Programming in Awk Language. LiStaLiA: Little Statistics Library in Awk. Part I.

In the following previous Awk programming articles

The AWK Programming Language

Awk Programming II: Life in a Shell

Awk Programming III: the One-Dimensional Cellular Automaton

I have briefly introduced this handy Unix program by showing two examples of elaborate applications. In this fourth article of the series, I will offer a little library of functions that can be used for the essential statistical analysis of data sets. I have written (and rewritten) many of these functions, but I have spent little time collecting them in a library that can be used by other users. So this article gives me the motivation to achieve this target. Unfortunately, I didn’t extensively test the library, so I am releasing it as an alpha version. If you spot errors or improve it, please just send me your modified code!

READING DATA SETS

We start with a function that can be used to read data from a text file (ascii format). A good data reader should be able to read common data format such as comma separated (cvs) or space separated data files. It should also be able to spik blank lines or lines starting with special characters. It would be also handy to select the columns that need to be read and also check and skip lines with inconsistent data sets (missing data or NaNs). This is what exacty work the function ReadData() given in the Appendix. But shall we see it more in details.

ReadData(filename,fsep,skipchr,warn,range,ndata,data)

The function read the data from a file with name provided in the variable filename. The program skips all empty record, those starting with one of the characters contained in the regular expression skipchar. For example, a regular expressions such as skipchr=”@|#|;” skips the occurrence of the characters “at” or “hash” or semicolomn. The variable warn is used to check the behavior of the program if alphabetic characters or NaN or INF values are present in the data. If the variable is set to 0, the function gives a warning without stop the program, if set to 1 then the function terminate the program after the first warning.

The field separator is specified in fsep and it is used to set the awk internal variable FS and define the separator between data. The variable can be assigned with single character such as fsep=” “ or fsep=”,” or ESC codes such as fsep=FS=”\t” for tab-delimited.

The column in the data record can be read in two ways by set the element zero of the array range[]. For range[0]=0, a adjoint range of data is specified by setting the first element is at range[1] the last one in range[2]. For range[0]=1, the first element in range[1] is the number of data to read followed by the specific field in the record where the data is located.

Continue reading