In the following previous Awk programming articles
Awk Programming II: Life in a Shell
Awk Programming III: the One-Dimensional Cellular Automaton
I have briefly introduced this handy Unix program by showing two examples of elaborate applications. In this fourth article of the series, I will offer a little library of functions that can be used for the essential statistical analysis of data sets. I have written (and rewritten) many of these functions, but I have spent little time collecting them in a library that can be used by other users. So this article gives me the motivation to achieve this target. Unfortunately, I didn’t extensively test the library, so I am releasing it as an alpha version. If you spot errors or improve it, please just send me your modified code!
READING DATA SETS
We start with a function that can be used to read data from a text file (ascii format). A good data reader should be able to read common data format such as comma separated (cvs) or space separated data files. It should also be able to spik blank lines or lines starting with special characters. It would be also handy to select the columns that need to be read and also check and skip lines with inconsistent data sets (missing data or NaNs). This is what exacty work the function ReadData() given in the Appendix. But shall we see it more in details.
ReadData(filename,fsep,skipchr,warn,range,ndata,data)
The function read the data from a file with name provided in the variable filename. The program skips all empty record, those starting with one of the characters contained in the regular expression skipchar. For example, a regular expressions such as skipchr=”@|#|;” skips the occurrence of the characters “at” or “hash” or semicolomn. The variable warn is used to check the behavior of the program if alphabetic characters or NaN or INF values are present in the data. If the variable is set to 0, the function gives a warning without stop the program, if set to 1 then the function terminate the program after the first warning.
The field separator is specified in fsep and it is used to set the awk internal variable FS and define the separator between data. The variable can be assigned with single character such as fsep=” “ or fsep=”,” or ESC codes such as fsep=FS=”\t” for tab-delimited.
The column in the data record can be read in two ways by set the element zero of the array range[]. For range[0]=0, a adjoint range of data is specified by setting the first element is at range[1] the last one in range[2]. For range[0]=1, the first element in range[1] is the number of data to read followed by the specific field in the record where the data is located.
The array ndata contains the information about the number of data sets in ndata[0] and the number of data points in ndata[1]. The output data are stored in the two dimensional array data[,]. The first dimension specify the data set and the second the number data points. Note that each set is considered to have the same number of data.
Example
# This is a comment line. # The following lines contain two records with 7 set of data 0.4 2.3 0.4 5.0 3. 9. 4. 1.5 4.3 1.4 7.0 2. 5. 1.
If range[0]=0, the first method of data reading is specified and a range need to be specified. For example, we can set
range[1]=2
range[2]=3
meaning that the data contains in the field from 2 to 5 of each record are read. In this way, ndata[0]=2 and ndata[1]=2, and the array data is filled as follows
data[0,0]=2.3;data[0,1]=0.4
data[1,0]=4.3;data[0,1]=1.4
The function contains also a call to the function CheckNum(ff) that is is used to check if a read data is a NaN or INF.
HOW TO USE IT
I will show with a simple example how the function can be used. The functions can be copied either in the main program or better in a separate file in same or in a specified directory using the shell variable AWKPATH using the command (in bash shell)
export AWKPATH="$HOME/myawklib/LiStLA"
I suggest you to use as file name extension .awkl (or similar) to recall you that it contains a library function.
We now write a main program as a BEGIN {} block.
@include "LiStLa.awkl"
#Test driver
BEGIN {
filename="test.csv"
skipchr="#|@|;"
range[0]=1
range[1]=2
range[2]=2
range[3]=5
ReadData(filename,",",skipchr,0,range,ndata,data)
print "======================================================"
print "=============STATISTICAL ANALYSIS =================="
print "======================================================"
printf "# Data Sets : %5d\n", ndata[0]
printf "# Data Points in each set: %5d\n", ndata[1]
for (i=0;i<ndata[1];i++) {
for (n=0;n<ndata[0];n++) {
printf "%12.3f ", data[n,i]
}
printf "\n"
}
}
The script start by including the library file containg the function ReadData() using the command:
@include "LiStLa.awkl"
Inside the BEGIN block, the variable filename is set with the name of the file to analyze. To pass the file name using the command line then the variable filename need to be set to the contents of awk variable ARGV[1]. In this case a check on the number of command line argument can also be added to avoid usage mistakes.
This is the contents of the test.csv file that I have used for testing the code.
# tt # kkkkkk 1,6,37,28,9,9 3,7,0,88,33,x2 5,2,8,8, 30 5,1,7 @ jjdd 3,7,0,88,33,22 1,6,37,28,9,NaN ; dssdsd
It is a comma separated data file so the second argument of ReadData() is set to comma (“,”). The lines starting with the characters “#|@|;” or empty ones are skipped. The warning lever is set to 0, so the program warn but do not stop if it find NaN or INF values.
Finally, the variable range[0] is set to method 1, so that range[2] contains then number of data to read (2) and their position in range[2] and range[3]. The number of data are returned in ndata[] and in the data[,] arrayes.
In the following figure si reported the output obtained running the test script. The program read and then print on the screen the contents of the two data sets.

In the next article, I will show new functions for different types of statistical analysis of the data sets.
IF YOU LIKE THIS ARTICLE AND YOU WANT TO KEEP INFORMED ABOUT NEW ARTICLES THEN PLEASE REPOST IT AND SUBSCRIBE MY BLOG.
APPENDIX
This appendix contains the source codes of the functions ReadData() and CheckNum().
function ReadData(filename,fsep,skipchr,warn,range,ndata,data) {
#======================================================================
# FUNCTION NAME: ReadData(filename,fsep,skipchr,warn,range,ndata,data)
#======================================================================
# DESCRIPTION
#
# Read the data from the file "filename" by skipping
# all empty record, those starting with one of the characters in the
# regular expression skipchar or containing alphabetic characters or
# NaN or INF values.
#
# The field separator is specified in fsep and it is used to set
# the internal variable FS.
# The record can be read in two ways by specify the method and by
# using the array range[].
#
# INPUT PARAMETERS
# warn : 0, report warning without stop the program
# : 1, stop the program at the first warning
# skipchr : Regular expressions such as
# "skipchr="@|#|;" to skip the
# occurrence of the characters "at" or
# "hash" or semicolomn.
# fsep : single character such as fsep=" " or
# fsep="," or fsep=FS="\t" for tab-delimited.
# range[0]=0 -> Method 0: The first field is at range[1]
# the last field at range[2]
# range[0]=1 -> Method 1: The first field is the number of
# fields, listed in the following
# elements.
# OUTPUT DATA
# ndata[0] : number of data sets
# ndata[1] : number of data points
# data[n,m] : mth data point of the nth set (starting from
# data[0,0]
#==================================================================
# (c) Danilo Roccatano 1987-2020
#==================================================================
# Color printing on the terminal screen using ANSI escape codes
green="\033[1;32m"
blue="\033[1;34m"
red="\033[1;31m"
ecol="\033[0m"
ndp=0
FS=fsep
# select range method
if (range[0]==0) {
fi=range[1]
ff=range[2]
if (fi==ff) {
ra=0
} else {
ra =ff-fi
}
} else {
ra=range[1]
}
# Read data from file "filename"
line=1
chk=0
while (getline < filename >0) {
if (NF >0) {
frchr=substr($1,1,1)
if (!match(frchr,skipchr)) {
# Take the number of fields from the first data record
if (ndp == 0 ) {
nrec=NF
print green
printf"NOTE: The first record contains %d data sets.\n",nrec
printf"NOTE: The number of data set considered are %d\n",ra
printf "NOTE: corresponding to the columns"
if (range[0] ==0) {
# Method: 0
print " spanning from %d to %d.\n",fi,ff
} else {
# Method: 1
printf":"
for (ii=2;ii<=2+ra;ii++) {
pp=range[ii]
printf" %d",pp+1
}
printf".\n"
}
print ecol
rskip =0
} else {
if (nrec != NF) {
print red
printf "WARNING: The number of data sets (%d) at line %d.\n",NF-1,line
#printf "\"%s\"\n", $0
printf "WARNING: is different from the first data record (%d).\n",nrec-1
print "WARNING: Therefore, the data set is skipped."
print ecol
if ( warn == 1) exit
rskip =1
} else {
rskip =0
}
}
if (rskip == 0) {
# Method: 0
if (range[0] ==0) {
for (ii=0;ii<=ra;ii++) {
pp=fi+ii
chk=CheckNum(warn,line,$pp)
data[ii,ndp]=$pp
}
} else {
# Method: 1
for (ii=2;ii<=2+ra;ii++) {
pp=range[ii]
chk=CheckNum(warn,line,$pp)
data[ii-2,ndp]=$pp
}
}
if (chk==0) ndp++
chk=0
}
}
}
line++
}
ndata[0]=ra
ndata[1]=ndp
}
function CheckNum(warn,line,ff) {
#======================================================================
# FUNCTION NAME: CheckNum(ff)
#======================================================================
# DESCRIPTION:
# Check if the data value is a NaN or INF type and print
# an alert message.
#==================================================================
# (c) Danilo Roccatano 1987-2020
#==================================================================
yellow="\033[1;33m"
chk=0
if (ff ~ /[[:alpha:]]/) {
print yellow
if (warn ==0) {
printf "WARNING: The %d line is skipped as it contains \n",line
} else {
printf "WARNING: The %d line contains \n",line
}
if (ff ~ "NaN" || ff ~ "INF" || ff ~ "-INF") {
printf "WARNING: NaN or INF data types. \n",line
print ecol
chk=1
} else {
printf "WARNING: alphanumeric characters. \n",line
print ecol
chk=1
}
}
if (chk ==1 && warn ==1) exit
return chk
}