Programming in Awk Language. LiStLA: Little Statistics Library in Awk. Part I.

In the following previous Awk programming articles

The AWK Programming Language

Awk Programming II: Life in a Shell

Awk Programming III: the One-Dimensional Cellular Automaton

I have given a short introduction to this very useful Unix program by also showing two example of elaborate applications. In this fourth article of the series, I am going to show a little library of functions that can be used for basic statistical analysis of data sets. I have written (and rewritten) many of these function but I have spent little time to collect them in a library that can be used by other user. So this article give me the motivation to achieve this target. I didn’t extensively test the library so I am realeasing it as alpha version. If you spot errors or you have improved it then please just send me your modified code!

READING DATA SETS

We start with a function that can be used to read data from a text file (ascii format). A good data reader should be able to read common data format such as comma separated (cvs) or space separated data files. It should also be able to spik blank lines or lines starting with special characters. It would be also handy to select the columns that need to be read and also check and skip lines with inconsistent data sets (missing data or NaNs). This is what exacty work the function ReadData() given in the Appendix. But shall we see it more in details.

ReadData(filename,fsep,skipchr,warn,range,ndata,data)

The function read the data from a file with name provided in the variable filename. The program skips all empty record, those starting with one of the characters contained in the regular expression skipchar. For example, a regular expressions such as skipchr=”@|#|;” skips the occurrence of the characters “at” or “hash” or semicolomn. The variable warn is used to check the behavior of the program if alphabetic characters or NaN or INF values are present in the data. If the variable is set to 0, the function gives a warning without stop the program, if set to 1 then the function terminate the program after the first warning.

The field separator is specified in fsep and it is used to set the awk internal variable FS and define the separator between data. The variable can be assigned with single character such as fsep=” “ or fsep=”,” or ESC codes such as fsep=FS=”\t” for tab-delimited.

The column in the data record can be read in two ways by set the element zero of the array range[]. For range[0]=0, a adjoint range of data is specified by setting the first element is at range[1] the last one in range[2]. For range[0]=1, the first element in range[1] is the number of data to read followed by the specific field in the record where the data is located.

The array ndata contains the information about the number of data sets in ndata[0] and the number of data points in ndata[1]. The output data are stored in the two dimensional array data[,]. The first dimension specify the data set and the second the number data points. Note that each set is considered to have the same number of data.

Example

# This is a comment line. 
# The following lines contain two records with 7 set of data
0.4 2.3 0.4 5.0 3. 9. 4.
1.5 4.3 1.4 7.0 2. 5. 1.

If range[0]=0, the first method of data reading is specified and a range need to be specified. For example, we can set

range[1]=2

range[2]=3

meaning that the data contains in the field from 2 to 5 of each record are read. In this way, ndata[0]=2 and ndata[1]=2, and the array data is filled as follows

data[0,0]=2.3;data[0,1]=0.4

data[1,0]=4.3;data[0,1]=1.4

The function contains also a call to the function CheckNum(ff) that is is used to check if a read data is a NaN or INF.

HOW TO USE IT

I will show with a simple example how the function can be used. The functions can be copied either in the main program or better in a separate file in same or in a specified directory using the shell variable AWKPATH using the command (in bash shell)

export AWKPATH="$HOME/myawklib/LiStLA"

I suggest you to use as file name extension .awkl (or similar) to recall you that it contains a library function.

We now write a main program as a BEGIN {} block.

@include "LiStLa.awkl"
#Test driver
BEGIN {
    filename="test.csv"
    skipchr="#|@|;"
    range[0]=1
    range[1]=2
    range[2]=2
    range[3]=5

    ReadData(filename,",",skipchr,0,range,ndata,data)
    print "======================================================"
    print "=============STATISTICAL  ANALYSIS  =================="
    print "======================================================"

    printf "# Data Sets  : %5d\n", ndata[0]
    printf "# Data Points in each set: %5d\n", ndata[1]
    for (i=0;i<ndata[1];i++) {
        for (n=0;n<ndata[0];n++) {
            printf "%12.3f ", data[n,i]
        }
        printf "\n"
    }
}

The script start by including the library file containg the function ReadData() using the command:

@include "LiStLa.awkl" 

Inside the BEGIN block, the variable filename is set with the name of the file to analyze. To pass the file name using the command line then the variable filename need to be set to the contents of awk variable ARGV[1]. In this case a check on the number of command line argument can also be added to avoid usage mistakes.

This is the contents of the test.csv file that I have used for testing the code.

# tt
# kkkkkk


1,6,37,28,9,9 
3,7,0,88,33,x2
5,2,8,8, 30
5,1,7       
@ jjdd
3,7,0,88,33,22


1,6,37,28,9,NaN
; dssdsd

It is a comma separated data file so the second argument of ReadData() is set to comma (“,”). The lines starting with the characters “#|@|;” or empty ones are skipped. The warning lever is set to 0, so the program warn but do not stop if it find NaN or INF values.

Finally, the variable range[0] is set to method 1, so that range[2] contains then number of data to read (2) and their position in range[2] and range[3]. The number of data are returned in ndata[] and in the data[,] arrayes.

In the following figure si reported the output obtained running the test script. The program read and then print on the screen the contents of the two data sets.

In the next article, I will show new functions for different types of statistical analysis of the data sets.

Rating: 1 out of 5.

IF YOU LIKE THIS ARTICLE AND YOU WANT TO KEEP INFORMED ABOUT NEW ARTICLES THEN PLEASE REPOST IT AND SUBSCRIBE MY BLOG.

APPENDIX

This appendix contains the source codes of the functions ReadData() and CheckNum().


function ReadData(filename,fsep,skipchr,warn,range,ndata,data) {
            #======================================================================
# FUNCTION NAME: ReadData(filename,fsep,skipchr,warn,range,ndata,data) 
    #======================================================================
# DESCRIPTION 
#
# Read the data from the file "filename" by skipping 
# all empty record, those starting with one of the characters in the 
# regular expression skipchar or containing alphabetic characters or 
# NaN or INF values.
#
# The field separator is specified in fsep and it is used to set 
# the internal variable FS.
# The record can be read in two ways by specify the method and by 
# using the array range[].
#
# INPUT PARAMETERS 
# warn                  : 0, report warning without stop the program 
#                       : 1, stop the program at the first warning   
# skipchr               : Regular expressions such as
#                         "skipchr="@|#|;" to skip the 
#                         occurrence of the characters "at" or
#                         "hash" or semicolomn. 
# fsep                  : single character such as fsep=" " or
#                         fsep="," or fsep=FS="\t" for tab-delimited.
# range[0]=0 -> Method 0: The first field is at range[1] 
#                         the last field at range[2] 
# range[0]=1 -> Method 1: The first field is the number of 
#                         fields, listed in the following 
#                         elements.
# OUTPUT DATA 
#        ndata[0]  : number of data sets
#        ndata[1]  : number of data points
#        data[n,m] : mth data point of the nth set (starting from
#                    data[0,0]
#==================================================================
# (c) Danilo Roccatano 1987-2020
#==================================================================

# Color printing on the terminal screen using ANSI escape codes    
    green="\033[1;32m"
    blue="\033[1;34m"
    red="\033[1;31m"
    ecol="\033[0m"

    ndp=0
    FS=fsep
    # select range method
    if (range[0]==0) {
        fi=range[1]
        ff=range[2]
        if (fi==ff) {
            ra=0
        } else {
            ra =ff-fi
        }

    } else {
        ra=range[1]
    }
    # Read data from file "filename"
    line=1
    chk=0
    while (getline < filename >0) {
        if (NF >0) {
            frchr=substr($1,1,1)
            if (!match(frchr,skipchr)) {
                # Take the number of fields from the first data record 
                if (ndp == 0 ) { 
                    nrec=NF
                    print green
                    printf"NOTE: The first record contains %d data sets.\n",nrec
                    printf"NOTE: The number of data set considered are %d\n",ra
                    printf "NOTE: corresponding to the columns"
                    if (range[0] ==0) {
                        # Method: 0 
                        print " spanning from %d to %d.\n",fi,ff
                    } else {
                        # Method: 1 
                        printf":" 
                        for (ii=2;ii<=2+ra;ii++) {
                            pp=range[ii]
                            printf" %d",pp+1 
                        }
                        printf".\n"
                    }
                    print ecol 
                    rskip =0
                } else {
                    if (nrec != NF) {
                        print red
                        printf "WARNING: The number of data sets (%d) at line %d.\n",NF-1,line
                        #printf "\"%s\"\n",  $0
                        printf  "WARNING: is different from the first data record (%d).\n",nrec-1
                        print  "WARNING: Therefore, the data set is  skipped."  
                        print ecol 
                        if ( warn == 1) exit
                        rskip =1
                    } else {
                        rskip =0
                    }
                }
                if (rskip == 0) {
                    # Method: 0 
                    if (range[0] ==0) {
                        for (ii=0;ii<=ra;ii++) {
                            pp=fi+ii
                            chk=CheckNum(warn,line,$pp)
                            data[ii,ndp]=$pp
                        }
                    } else {
                        # Method: 1 
                        for (ii=2;ii<=2+ra;ii++) {
                            pp=range[ii]
                            chk=CheckNum(warn,line,$pp)
                            data[ii-2,ndp]=$pp
                        }
                    }
                    if (chk==0) ndp++
                    chk=0
                }
            }
        }
        line++
    }
    ndata[0]=ra
    ndata[1]=ndp
}
function CheckNum(warn,line,ff) {
    #======================================================================
    # FUNCTION NAME: CheckNum(ff) 
    #======================================================================
    # DESCRIPTION:
    # Check if the data value is a NaN or INF type and print  
    # an alert message.
    #==================================================================
    # (c) Danilo Roccatano 1987-2020
    #==================================================================
    yellow="\033[1;33m"
    chk=0
    if (ff ~ /[[:alpha:]]/) {
        print yellow 
        if (warn ==0) {
            printf "WARNING: The %d line is skipped as it contains \n",line
        } else {
            printf "WARNING: The %d line contains \n",line
        }
        if (ff ~  "NaN" || ff ~  "INF" || ff ~  "-INF") {
            printf "WARNING: NaN or INF data types.  \n",line
            print ecol
            chk=1
        } else {
            printf "WARNING: alphanumeric characters. \n",line
            print ecol
            chk=1
        }
    }
    if (chk ==1 && warn ==1)  exit
    return chk
}

Posted in Programming, What is new | Leave a comment

Molekulare Maschinen: Die Coronavirus SARS-CoV-2 Bedrohung, Teil I.

Was Freunde mit und für uns tun, ist auch ein Erlebtes; denn es stärkt und fördert unsere Persönlichkeit. Was Feinde gegen uns unternehmen, erleben wir nicht, wir erfahren’s nur, lehnen’s ab und schützen uns dagegen wie gegen Frost, Sturm, Regen und Schloßenwetter oder sonst äußere Übel, die zu erwarten sind.

Johann Wolfgang von Goethe (1749-1832), Maximen und Reflexionen. Aphorismen und Aufzeichnungen.

Ein Virus ist Leben in der einfachsten Form. Es ist die minimalistische Reduktion eines Organismus auf seine wesentlichen Funktionselemente. Noch pragmatischer ist ein Virus ein Behälter mit genetischem Code mit einem effizienten molekularen Mechanismus, der es ihm ermöglicht, in eine Wirtszelle eines Organismus einzudringen, der sich selbstständig reproduzieren kann. Als molekulare Maschine kann ein Virus der Form und der zerstörerischen Kraft des Todessterns in der Star-Wars-Saga ähneln. Daher ist es eine Art molekulare Maschine, die wir absolut nicht in uns haben wollen!

Wie der große Goethe sagt, ist der Feind Teil unserer Erfahrung und wir müssen ihn jagen und uns tatsächlich vor anderen möglichen Feinden schützen. Dieser epische Naturkrieg veranlasste mich, diesen Blog zu starten, in dem ich mitteilen werde, was ich über diese gefährliche molekulare Maschine lerne.

Continue reading
Posted in Research, Science Topics, What is new | Leave a comment

Le Macchine Molecolari: La minaccia del Coronavirus SARS-CoV-2

Difficilmente è vinto colui che sa conoscere le forze sue e quelle del nemico.

Nicollò Machiavelli in Dell’arte della guerra (1519-1520)

Un virus è la vita nella forma più semplice. È la riduzione minimalista di un organismo ai suoi elementi essenziali di funzionalità. Più pragmaticamente, un virus è un contenitore di codice genetico dotato di un efficiente meccanismo molecolare che gli consente d’invadere una cellula ospite di un organismo capace di riprodursi autonomamente. Come macchina molecolare, un virus può assomigliare nella forma e potere distruttivo, alla Morte Nera della saga di Star Wars. Pertanto, è un tipo di macchina molecolare che non vogliamo assolutamente avere dentro di noi!

La diffusione del coronavirus SARS-CoV-2 (COVID-19) ha prodotto una nuova pandemia, ovvero una infezione causata da un agente patogeno che colpisce l’intera popolazione di una specie vivente, in questo caso quella umana. Questa situazione di emergenza globale è il risultato di una competizione naturale tra specie viventi che ci rammenta di essere ancora un tassello nell’ecosistema di Gaia. Tuttavia, anche se sia sempre arduo da credere visto lo stato in cui abbiamo ridotto il nostro pianeta, siamo la forma di vita più intelligente nell’universo conosciuto. Quindi sarebbe abbastanza imbarazzante essere sconfitti da un nemico invisibile.

Continue reading
Posted in Research, Science Topics | Leave a comment

Modelling Natural Shapes: (Easter) Eggs 2020

One year ago, I wrote an article about the modelling of the egg shapes, promising at one point to come back on the topics. A next step in studying eggs shapes is to look to real one or a copy of it. A happy occasion for experimenting with the model using three-dimensional graphics and 3d Printing! That is a natural indeed step: take half of the symmetric curve representing the egg shape

y=T(1+x)^{\frac{\lambda}{1+\lambda}}(1-x)^{\frac{1}{1+\lambda}},

where T and \lambda are two parameters, and rotate it around the central axis

\begin{aligned} x'&=&x\\ y' &=&y*cos(\theta) \\ z' &=& y*sin(\theta) \end{aligned}

Continue reading
Posted in Leonardo's Corner, Science Topics, What is new | Leave a comment

Nanoparticles in Biology and Medicine

I am very pleased to announce that the second edition of the book Nanoparticles in Biology and Medicine edited by Enrico Ferrari, Mikhail Soloviev is now out.

This fully updated volume presents a wide range of methods for synthesis, surface modification, characterization and application of nano-sized materials (nanoparticles) in the life science and medical fields, with a focus on drug delivery and diagnostics. Beginning with a section on the synthesis of nanoparticles and their applications, the book continues with detailed chapters on nanoparticle derivatization, bio-interface, and nanotoxicity, as well as nanoparticle characterization and advanced methods development. Written for the highly successful Methods in Molecular Biology series, chapters include introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Nanoparticles in Biology and Medicine: Methods and Protocols, Second Edition serves as an ideal guide for scientists at all levels of expertise to a wide range of biomedical and pharmaceutical applications including functional protein studies, drug delivery, immunochemistry, imaging, and more.

I have contributed with a chapter (14) titled The Molecular Dynamics Simulation of Peptides on Gold Nanosurfaces.

In this chapter a short tutorial on the preparation of molecular dynamics (MD) simulations for a peptide in solution at the interface of an uncoated gold nanosurface is given. Specifically, the step-by-step procedure will give guidance to set up the simulation of a 16 amino acid long antimicrobial peptide on a gold layer using the program Gromacs for Molecular Dynamics simulations.

Posted in Uncategorized | Leave a comment

Molecular Machines: the Coronavirus SARS-CoV-2 Menace

If you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle.”

SunTzu. The Art of War

A virus is the Bauhaus of the form of life: the minimalist reduction of an organism to its essential element of functionality. More pragmatically, it is a container of genetic code provided with a smart mechanism that allows it to invade cells of another host organism. As a molecular machine, a virus can resemble in shape and destructive power the Death Star spaceship of the Star War saga. Therefore, it is a molecular machine that we do not definitively want to have within us!

The spread of the coronavirus SARS-CoV-2 has produced a new pandemic, i.e. an infection caused by a pathogen that affects the entire population of a living species, in this case the human one. This global emergency situation is the result of a natural competition between living species that reminds us that we are still a small brick of the Gaia ecosystem. However, although it is always difficult to believe given the state in which we have reduced our planet, we are the most intelligent life form in the known universe. So it would be quite embarrassing to be defeated by an invisible enemy.

Continue reading
Posted in Research, Science Topics, What is new | Tagged , | Leave a comment

The Particle in a Box I: the Schrödinger Equation in One-dimension

In 1926, the Austrian physicist Erwin Schrödinger (1887-1961) made a fundamental mathematical discovery that had a profound impact on the study of the molecular world (in 1933, Schrödinger was awarded with the Nobel prize in Physics just 7 years later his breakthrough discovery). He discovered that a state of a quantum system composed by particles (such as electrons and nucleons) can be described by postulating the existence of a function of the particle coordinates and time, called state function or wave function (\Psi, psi function). This function are solution of a wave equation: the so-called the Schrödinger equation (SE). Although the SE equation can be solved analytically only for relatively simple cases, the development of computer and numerical methods has made possible the application of SE to study complex molecular. 

Continue reading
Posted in Science Topics, What is new | Leave a comment

Buon Natale e Felice Anno Nuovo

Gentilissimi/e Lettori e Lettrici, Dear Reader, Sehr geehrte Leserinnen und Leser,

Grazie mille per aver fatto tappa durante le vostre peregrinazioni cibernautiche nel mio sito web e per dedicare un po’ del vostro tempo nel leggere i miei articoli. Spero che li avete trovati tanto interessanti e utili da continuate a tornare a leggermi.
Voglio anticipare alcune delle prossime pubblicazioni.
Tra breve usciranno nuovi titoli:

  • The Logistic Map and the Feigenbaum Constants: a Retro Programming Inspired Excursion.
  • L’integrazione numerica di equazioni differenziali, parte II: 50 anni fa l’uomo ha messo piede sulla Luna
  • Retro Programming: Acid-base Titration.
  • Retro Programming: Plant evolution.

Per il momento auguro a tutti voi di trascorrere con le vostri cari un felice Natale e di avere un nuovo anno pieno di buone notizie.

Continue reading
Posted in Uncategorized | Leave a comment

A personal tribute to the founder of MD simulation of biological molecules: Prof Herman J.C. Berendsen (1934-2019)

On the 7 October 2019, Prof Dr Herman Johan Christiaan Berendsen passed away just shortly after his 85 birthday. Prof Berendsen is considered the founder of the molecular dynamics simulation of biological system: the area of theoretical research that also shaped my scientific career. He was working at the University of Groningen in the picturesque Northern part of the Netherlands. It was there that I meet him the first time as it allowed me to conduct research in his lab during the last year of my doctorate researches training at the University of Rome “La Sapienza”. After I completed my doctorate, Herman gave me the opportunity to continue working in his group with a postdoc position within the “Protein Folding” EU Training network. This happens just two years before his retirement and therefore I was also one of his last postdocs. After retirement, Herman dedicated himself to write two books that distillate all his experience in the area of molecular simulation [1] and in the education [2]. He stated in a project on the social scientific network Researchgate that “I am retired and work occasionally on methods for multiscale simulations.”

Continue reading
Posted in Research, What is new | 2 Comments

The First 150 Years of the Periodic Table of the Elements

That the nobility of man, acquired in a hundred centuries of trial and error, lay in making himself the conquerer of matter, and that I had enrolled in chemistry because I wanted to remain faithful to this nobility. That conquering matter is to understand it, and understanding matter is necessary to understanding the universe and ourselves: and that therefore Mendeleev’s Periodic Table, which just during those weeks we were laboriously learning to unravel, was poetry, loftier and more solemn than all the poetry we had swallowed down in liceo; and come to think of it, it even rhymed!

Primo Levi, The Periodic Table.

This year marks the 150th anniversary of the periodic table of the elements (TPE) which currently has 118 entries, the latest arrival (the Tennessine) was discovered 10 years ago (2009), and I feel obliged as a chemist to give some a small informative contribution to celebrate this important event.

Continue reading
Posted in Uncategorized | Leave a comment