Tuesday, September 3, 2013

Using AWK for Data Science

Over the years I have become convinced that one of the essential tools needed by anyone whose job consists of working with data is the unix scripting language AWK. It will save you an awful lot of time when it comes to processing raw text data files.

For example, taking a large delimited file of some sort and pre-processing its columns to pull out just the data you want, perform basic calculations or prepare it for entry into a program that requires a specific format.

AWK has saved me countless hours over the years, so now I am writing a super brief primer that should not only convince you it is worth learning but show you some examples.

The first thing you need to know about AWK is that it is data driven, unlike most other languages for which execution is constrained largely by procedural layout of the instructions. AWK instructions are defined by patterns in the data to which actions should be applied. If you are familiar with the regular expression type control structures available in PERL then this should seem like a comfortable idea.

The programs are also data driven in the sense that the entire program is applied to every line of the file (as long as there are patterns that match) and furthermore the program has inbuilt access to the columns of data inside the file through the $0, $1, 2 ... variables: where $0 contains the entire line and $1 upwards has the data from individuals columns. By default the columns are expected to be TAB separated, but you can follow your AWK script with FS=',' to use a comma or any other field separator.

To run a simple AWK script type:

   >awk 'AWK_SCRIPT_HERE' FILE_TO_PROCESS

The basic syntax of the scripts themselves consists of mutiple pattern action pairs defined like this:
PATTERN {ACTION}

One need not include an PATTERN, in which case the action will be applied to every line inside the file to which the program is applied.

So for example, the following program will out the sum of columns 3 and 4

>awk '{print $3+$4}' FILENAME

If we only wanted this to happen when column 1 contained the value 'COSTS' we have a number of options. We could simply use the pattern equivalent of an IF statement as follows:

>awk '$1=="COSTS" {print $3+$4} FILENAME

Alternatively we could use a PATTERN expression as follows

>awk '/COSTS/ {print $3+$4} FILENAME

The problem with the second solution is that it if for some reason the word COSTS can appear in other fields or places in the file then may not get the results you are looking for. There can be a trade off for using the power and flexibility of the regular expression patterns and their ability to lull us into a false sense of security about what they are doing.

There are several special execution paths that can be included in the program. In place of the pattern you can include the reserved words BEGIN or END in order to execute routine before or after the file processing occurs. This is particularly useful for doing something like calculating a MEAN, shown below:

>awk '{sum+=$1; count+=1} END {print sum/count}' FILENAME

By now you should be seeing the appeal of AWK. You can manipulate your data quickly with small scripts that do not require loading an enormous file into a spreadsheet, or writing a more complicated JAVA or PYTHON program.

Finally here are a few of the kinds of tasks that I do with AWK all the time

1)  Convert some file with 10 or more columns into one with a sum of a few and reformating the others:

>awk '{print toupper($1) "," ($3/100) "," ($2+$4-$5)}' FILENAME


2) Calculate the Mean and Standard Deviation on a column. (The following is fora sample, just change the n-1 to n for a complete population.

> awk 'pass==1 {sum+=$1; n+=1} pass==2 {mean=sum/(n-1); ssd+=($1-mean)*($1-mean)} END {print sqrt(ssd/(n-1))}' pass=1 FILENAME pass=2 FILENAME


3) Calculate the Pearson correlation coefficient between a pair of columns. Again for a sample of the data. Change n-1 to n to do the calculation on the entire population data.

> awk 'pass==1 {sx+=$1; sy+=$2; n+=1} pass==2 {mx=sx/(n-1); my=sy/(n-1); cov+=($1-mx)*($2-my); ssdx+=($1-mx)*($1-mx); ssdy+=($2-my)*($2-my);} END {print cov / ( sqrt(ssdx) * sqrt(ssdy) ) }' pass=1 FILENAME pass=2 FILENAME

If you have any great tips for getting more out of AWK let me know, I am always looking for shortcuts.


1 comment:

  1. Awk script, grab all lines where the first val is in a list.

    awk '"11993,12096,12372,12404"~$1 {print $0}'

    CAVEAT: Only works if none of the list values are substrings of each other.

    ReplyDelete