Programming in Awk Language. LiStLA: Little Statistics Library in Awk. Part I.

In the following previous Awk programming articles

The AWK Programming Language

Awk Programming II: Life in a Shell

Awk Programming III: the One-Dimensional Cellular Automaton

I have given a short introduction to this very useful Unix program by also showing two example of elaborate applications. In this fourth article of the series, I am going to show a little library of functions that can be used for basic statistical analysis of data sets. I have written (and rewritten) many of these function but I have spent little time to collect them in a library that can be used by other user. So this article give me the motivation to achieve this target. I didn’t extensively test the library so I am realeasing it as alpha version. If you spot errors or you have improved it then please just send me your modified code!

READING DATA SETS

We start with a function that can be used to read data from a text file (ascii format). A good data reader should be able to read common data format such as comma separated (cvs) or space separated data files. It should also be able to spik blank lines or lines starting with special characters. It would be also handy to select the columns that need to be read and also check and skip lines with inconsistent data sets (missing data or NaNs). This is what exacty work the function ReadData() given in the Appendix. But shall we see it more in details.

ReadData(filename,fsep,skipchr,warn,range,ndata,data)

The function read the data from a file with name provided in the variable filename. The program skips all empty record, those starting with one of the characters contained in the regular expression skipchar. For example, a regular expressions such as skipchr=”@|#|;” skips the occurrence of the characters “at” or “hash” or semicolomn. The variable warn is used to check the behavior of the program if alphabetic characters or NaN or INF values are present in the data. If the variable is set to 0, the function gives a warning without stop the program, if set to 1 then the function terminate the program after the first warning.

The field separator is specified in fsep and it is used to set the awk internal variable FS and define the separator between data. The variable can be assigned with single character such as fsep=” “ or fsep=”,” or ESC codes such as fsep=FS=”\t” for tab-delimited.

The column in the data record can be read in two ways by set the element zero of the array range[]. For range[0]=0, a adjoint range of data is specified by setting the first element is at range[1] the last one in range[2]. For range[0]=1, the first element in range[1] is the number of data to read followed by the specific field in the record where the data is located.

The array ndata contains the information about the number of data sets in ndata[0] and the number of data points in ndata[1]. The output data are stored in the two dimensional array data[,]. The first dimension specify the data set and the second the number data points. Note that each set is considered to have the same number of data.

Example

# This is a comment line. 
# The following lines contain two records with 7 set of data
0.4 2.3 0.4 5.0 3. 9. 4.
1.5 4.3 1.4 7.0 2. 5. 1.

If range[0]=0, the first method of data reading is specified and a range need to be specified. For example, we can set

range[1]=2

range[2]=3

meaning that the data contains in the field from 2 to 5 of each record are read. In this way, ndata[0]=2 and ndata[1]=2, and the array data is filled as follows

data[0,0]=2.3;data[0,1]=0.4

data[1,0]=4.3;data[0,1]=1.4

The function contains also a call to the function CheckNum(ff) that is is used to check if a read data is a NaN or INF.

HOW TO USE IT

I will show with a simple example how the function can be used. The functions can be copied either in the main program or better in a separate file in same or in a specified directory using the shell variable AWKPATH using the command (in bash shell)

export AWKPATH="$HOME/myawklib/LiStLA"

I suggest you to use as file name extension .awkl (or similar) to recall you that it contains a library function.

We now write a main program as a BEGIN {} block.

@include "LiStLa.awkl"
#Test driver
BEGIN {
    filename="test.csv"
    skipchr="#|@|;"
    range[0]=1
    range[1]=2
    range[2]=2
    range[3]=5

    ReadData(filename,",",skipchr,0,range,ndata,data)
    print "======================================================"
    print "=============STATISTICAL  ANALYSIS  =================="
    print "======================================================"

    printf "# Data Sets  : %5d\n", ndata[0]
    printf "# Data Points in each set: %5d\n", ndata[1]
    for (i=0;i<ndata[1];i++) {
        for (n=0;n<ndata[0];n++) {
            printf "%12.3f ", data[n,i]
        }
        printf "\n"
    }
}

The script start by including the library file containg the function ReadData() using the command:

@include "LiStLa.awkl" 

Inside the BEGIN block, the variable filename is set with the name of the file to analyze. To pass the file name using the command line then the variable filename need to be set to the contents of awk variable ARGV[1]. In this case a check on the number of command line argument can also be added to avoid usage mistakes.

This is the contents of the test.csv file that I have used for testing the code.

# tt
# kkkkkk


1,6,37,28,9,9 
3,7,0,88,33,x2
5,2,8,8, 30
5,1,7       
@ jjdd
3,7,0,88,33,22


1,6,37,28,9,NaN
; dssdsd

It is a comma separated data file so the second argument of ReadData() is set to comma (“,”). The lines starting with the characters “#|@|;” or empty ones are skipped. The warning lever is set to 0, so the program warn but do not stop if it find NaN or INF values.

Finally, the variable range[0] is set to method 1, so that range[2] contains then number of data to read (2) and their position in range[2] and range[3]. The number of data are returned in ndata[] and in the data[,] arrayes.

In the following figure si reported the output obtained running the test script. The program read and then print on the screen the contents of the two data sets.

In the next article, I will show new functions for different types of statistical analysis of the data sets.

IF YOU LIKE THIS ARTICLE AND YOU WANT TO KEEP INFORMED ABOUT NEW ARTICLES THEN PLEASE REPOST IT AND SUBSCRIBE MY BLOG.

APPENDIX

This appendix contains the source codes of the functions ReadData() and CheckNum().


function ReadData(filename,fsep,skipchr,warn,range,ndata,data) {
            #======================================================================
# FUNCTION NAME: ReadData(filename,fsep,skipchr,warn,range,ndata,data) 
    #======================================================================
# DESCRIPTION 
#
# Read the data from the file "filename" by skipping 
# all empty record, those starting with one of the characters in the 
# regular expression skipchar or containing alphabetic characters or 
# NaN or INF values.
#
# The field separator is specified in fsep and it is used to set 
# the internal variable FS.
# The record can be read in two ways by specify the method and by 
# using the array range[].
#
# INPUT PARAMETERS 
# warn                  : 0, report warning without stop the program 
#                       : 1, stop the program at the first warning   
# skipchr               : Regular expressions such as
#                         "skipchr="@|#|;" to skip the 
#                         occurrence of the characters "at" or
#                         "hash" or semicolomn. 
# fsep                  : single character such as fsep=" " or
#                         fsep="," or fsep=FS="\t" for tab-delimited.
# range[0]=0 -> Method 0: The first field is at range[1] 
#                         the last field at range[2] 
# range[0]=1 -> Method 1: The first field is the number of 
#                         fields, listed in the following 
#                         elements.
# OUTPUT DATA 
#        ndata[0]  : number of data sets
#        ndata[1]  : number of data points
#        data[n,m] : mth data point of the nth set (starting from
#                    data[0,0]
#==================================================================
# (c) Danilo Roccatano 1987-2020
#==================================================================

# Color printing on the terminal screen using ANSI escape codes    
    green="\033[1;32m"
    blue="\033[1;34m"
    red="\033[1;31m"
    ecol="\033[0m"

    ndp=0
    FS=fsep
    # select range method
    if (range[0]==0) {
        fi=range[1]
        ff=range[2]
        if (fi==ff) {
            ra=0
        } else {
            ra =ff-fi
        }

    } else {
        ra=range[1]
    }
    # Read data from file "filename"
    line=1
    chk=0
    while (getline < filename >0) {
        if (NF >0) {
            frchr=substr($1,1,1)
            if (!match(frchr,skipchr)) {
                # Take the number of fields from the first data record 
                if (ndp == 0 ) { 
                    nrec=NF
                    print green
                    printf"NOTE: The first record contains %d data sets.\n",nrec
                    printf"NOTE: The number of data set considered are %d\n",ra
                    printf "NOTE: corresponding to the columns"
                    if (range[0] ==0) {
                        # Method: 0 
                        print " spanning from %d to %d.\n",fi,ff
                    } else {
                        # Method: 1 
                        printf":" 
                        for (ii=2;ii<=2+ra;ii++) {
                            pp=range[ii]
                            printf" %d",pp+1 
                        }
                        printf".\n"
                    }
                    print ecol 
                    rskip =0
                } else {
                    if (nrec != NF) {
                        print red
                        printf "WARNING: The number of data sets (%d) at line %d.\n",NF-1,line
                        #printf "\"%s\"\n",  $0
                        printf  "WARNING: is different from the first data record (%d).\n",nrec-1
                        print  "WARNING: Therefore, the data set is  skipped."  
                        print ecol 
                        if ( warn == 1) exit
                        rskip =1
                    } else {
                        rskip =0
                    }
                }
                if (rskip == 0) {
                    # Method: 0 
                    if (range[0] ==0) {
                        for (ii=0;ii<=ra;ii++) {
                            pp=fi+ii
                            chk=CheckNum(warn,line,$pp)
                            data[ii,ndp]=$pp
                        }
                    } else {
                        # Method: 1 
                        for (ii=2;ii<=2+ra;ii++) {
                            pp=range[ii]
                            chk=CheckNum(warn,line,$pp)
                            data[ii-2,ndp]=$pp
                        }
                    }
                    if (chk==0) ndp++
                    chk=0
                }
            }
        }
        line++
    }
    ndata[0]=ra
    ndata[1]=ndp
}
function CheckNum(warn,line,ff) {
    #======================================================================
    # FUNCTION NAME: CheckNum(ff) 
    #======================================================================
    # DESCRIPTION:
    # Check if the data value is a NaN or INF type and print  
    # an alert message.
    #==================================================================
    # (c) Danilo Roccatano 1987-2020
    #==================================================================
    yellow="\033[1;33m"
    chk=0
    if (ff ~ /[[:alpha:]]/) {
        print yellow 
        if (warn ==0) {
            printf "WARNING: The %d line is skipped as it contains \n",line
        } else {
            printf "WARNING: The %d line contains \n",line
        }
        if (ff ~  "NaN" || ff ~  "INF" || ff ~  "-INF") {
            printf "WARNING: NaN or INF data types.  \n",line
            print ecol
            chk=1
        } else {
            printf "WARNING: alphanumeric characters. \n",line
            print ecol
            chk=1
        }
    }
    if (chk ==1 && warn ==1)  exit
    return chk
}

About Danilo Roccatano

I have a Doctorate in chemistry at the University of Roma “La Sapienza”. I led educational and research activities at different universities in Italy, The Netherlands, Germany and now in the UK. I am fascinated by the study of nature with theoretical models and computational. For years, my scientific research is focused on the study of molecular systems of biological interest using the technique of Molecular Dynamics simulation. I have developed a server (the link is in one of my post) for statistical analysis at the amino acid level of the effect of random mutations induced by random mutagenesis methods. I am also very active in the didactic activity in physical chemistry, computational chemistry, and molecular modeling. I have several other interests and hobbies as video/photography, robotics, computer vision, electronics, programming, microscopy, entomology, recreational mathematics and computational linguistics.
This entry was posted in Programming, What is new. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.