We are going to analyse large data sets working with file, whose size is at least 50% larger than available RAM (but no more than tens or hundreds of GB) using different languages: Linux command tools, R and Python.

Required environment

Linux tools

What are these files about ?

Our application will be based on csv files extracted from http://stat-computing.org/dataexpo/2009/: “The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.”

The aim of the data expo was to focus on different aspects of the data… answering questions as :

“You are also welcome to work with interesting subsets: you might want to compare flight patterns before and after 9/11, or between the pair of cities that you fly between most often, or all flights to and from a major airport like Chicago (ORD). Smaller subsets may also help you to match up the data to other interesting datasets.”

At first, We are going to load and unzip csv files, it may take up to 45 min depending on your internet speed.

Linux Users

As /tmp content is usually cleaned up at every boot, you should consider saving the `Data/ folder in your home directory (if you have enough space, if so just replace /tmp with /home)

It is now a good idea to compute the size of these files once uncompressed. If these files were *.gz files we could have used gzip -l file.gz to do so. Unfortunately, bzip2 doesn’t provide an option to display the size of the uncompressed file before actually uncompressing it. Nevertheless, we can do as follows:

The advantage is that no disk space was used to compute these sizes. The disadvantage is that it takes time (and that it will take again the same time to really uncompress the files).

122M 1987.csv 469M 1991.csv 507M 1995.csv 528M 1999.csv 598M 2003.csv 671M 2007.csv 478M 1988.csv 470M 1992.csv 510M 1996.csv 544M 2000.csv 639M 2004.csv 658M 2008.csv 464M 1989.csv 469M 1993.csv 516M 1997.csv 573M 2001.csv 640M 2005.csv 486M 1990.csv 479M 1994.csv 514M 1998.csv 506M 2002.csv 641M 2006.csv

We can see that this header (i.e., names of the variables) is present in all files, for example in the second:

At this point, we want to merge all files in one, so we will have to remove the header from all files but one. We investigated several commands to perform this operation and kept the fastest one. Details are given below using the time function.

Testing Linux tools for data scrubbing

Removing the header line

We are going to remove the header from this file and time this action with different Linux tools.

The time command is used to know how long the command takes to run.

The time option -p makes sure the result respects the POSIX standart (The Portable Operating System Interface is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems).

We are going to test different commands on a copy of the 1987.csv file.

Note that the result of the time command differ depending on processors speeds, disks performances and many factors (refer to the time command )

With awk

We first try with the awk command:

cp -f 1987.csv 1987-copy.csv
time -p awk 'NR != 1 {print $0}' 1987-copy.csv | sponge 1987-copy.csv

Redirection operator (> or >>) can not be used to redirect the content of 1987-copy.csv to itself, because these operators have a higher priority than the command and create/truncate the file before the command is even invoked. To avoid that, you should use appropriate tools such as tee, sponge, editors or any other tool which can write results to the file (e.g. sed -i or sort file -o file).

The time the awk command took to run on datalyptus server is as follows (the result certainly differ from your result):

real 10.88
user 0.66
sys 2.51

user and sys refer to CPU time used only by the process independantly from other processes which shares the CPU. real refers to the execution time including the execution time of the other processes.

user+sys = 3.17sec

If we try this different option awk 'NR>1' 1987-copy.csv | sponge 1987-copy.csv it also works

cp -f 1987.csv 1987-copy.csv
time -p awk 'NR>1 {print $0}' 1987-copy.csv | sponge 1987-copy.csv

The time (in seconds) the awk command took to run on datalyptus server is as follows (the result certainly differ from your result):

real 10.60
user 0.68
sys 2.45

user+sys = 3.13sec

We can note that to run the awk command with the testing pattern made of the ‘>’ comparison sign or the inequality sign do not significantly change the run time.

With sed

sed -i '1d' 1987-copy.csv

cp -f 1987.csv 1987-copy.csv
time -p sed -i '1d' 1987-copy.csv

The result is as follows:

real 29.16
user 1.79
sys 4.84

user+sys = 6.63sec

with ex editor

To supress the first line of the file and redirect input to the same original file you can use appropriate in-place editors as Ex editor (part of Vim)

ex -c ':1d' -c ':wq' 1987-copy.csv

where:

`ex 1987-copy.csv edits the file in the ex editor and option c to execute
:1d delete firts line
:wc save and quit

cp -f 1987.csv 1987-copy.csv
time -p ex -c ':1d' -c ':wq' 1987-copy.csv

The result is as follows:

real 15.66
user 2.56
sys 0.88

user+sys = 3.44sec

With the tail command:

time -p tail -n +2 1987-copy.csv | sponge 1987-copy.csv

real 6.21
user 0.05
sys 1.73

user+sys = 1.78sec

With the more command

real 6.69
user 0.08
sys 1.74

user+sys = 1.82sec

The more and tail command are the faster ways to remove the file header.

It would be better to get a more accurate benchmark of the tools to run each commands one hundred times and make an average of the execution time for each tool (as the CPU run time is different at each execution).

Merging two files

With | and sponge

Let’s make a copy of two files:

cp -f 1987.csv 1987-copy.csv
cp -f 1988.csv 1988-copy.csv
wc -l 1988-copy.csv
wc -l 1987-copy.csv

The number of lines for each file is:

1311827 1987-copy.csv
5202097 1988-copy.csv

Now let’s merge these two files into 1988-copy.csv and time it:

time -p cat 1987-copy.csv 1988-copy.csv | sponge 1988-copy.csv

real 40.35
user 0.03
sys 10.46

And concludes the Total CPU time = user + sys = 10.49sec

We can check the number of lines for the resulted file with the wc command:

wc -l 1988-copy.csv

6513924 1988-copy.csv

And check it worked as the total number of lines 6513924 is the sum of the number of lines of the two merged files (1311827 + 5202097).

With the redirection sign >>

Now let’s merge the same files with the redirection sign >> and time it again:

cp -f 1987.csv 1987-copy.csv
cp -f 1988.csv 1988-copy.csv
time -p cat 1988-copy.csv >> 1987-copy.csv

real 10.07
user 0.01
sys 2.08

And concludes the Total CPU time = user + sys = 2.09sec which is much faster than the 10.49sec we previously found.

We can also check the number of lines for the resulted file with the wc command:

wc -l 1987-copy.csv

6513924 1987-copy.csv

** We can conclude the method using the cat command and the redirection sign >> is much faster in merging the two files. This can be explained by the fact the >> sign will redirect the result of the cat command to the end of the second file. This way you do not have to copy both contents but just the first file one.**

Merging and removing header from all files

Now we are going to concatenate all files into airline.csv removing the header lines (except from 1987.csv). This takes around 2mns with command line tools:

You need to refer to the website for a description of these variables: each columns is described as follows:

Checking file and data integrity

It is now time to check for the integrity of both the csv file and our data. The file is too large to visually notice missing data or errors. Firstly we are going to look for any missing separators.

Checking the integrity of the field separators from the csv file:

Knowing we should have 29 variables for each line (record), we are going to check we have 29 fields per line(record):

As nothing but dates were printed out, we can conlude the csv file was correctly created and no missing separators were encountered.

Checking data integrity

When for a given variable, the number of possible values is small and known in advance, we can create a table of counts of these possible values. This will allow us to check the integrity of the data (e.g., how many unespected values, how many missing values,…). From the description of the variable, we should do this for variables 1 (Year), 2 (Month), 3 (DayOfMonth, 4 (DayOfWeek), 9 (UniqueCarrier), 10 (FlightNum), 11 (TailNum), 17 (Origin), 18 (Dest), 22 (Cancelled), 23 (CancellationCode) and 24 (Diverted).

More generally, a table of counts of the other variables can also highlight some information. We are going to create a table of counts for all variables (29 columns) and store each table in a different file (count1.txt for the first var, count2.txt for the second var, …):

For the variable 23, where the number of possible values should be only 5 (NA=Non Applicable, A = carrier, B = weather, C = NAS, D = security) we can have a look to the tmp23.txt

col	Name	Description
1	Year	1987-2008
2	Month	1-12
3	DayofMonth	1-31
4	DayOfWeek	1 (Monday) - 7 (Sunday)
5	DepTime	actual departure time (local, hhmm)
6	CRSDepTime	scheduled departure time (local, hhmm)
7	ArrTime	actual arrival time (local, hhmm)
8	CRSArrTime	scheduled arrival time (local, hhmm)
9	UniqueCarrier	unique carrier code
10	FlightNum	flight number
11	TailNum	plane tail number
12	ActualElapsedTime	in minutes
13	CRSElapsedTime	in minutes
14	AirTime	in minutes
15	ArrDelay	arrival delay, in minutes
16	DepDelay	departure delay, in minutes
17	Origin	origin IATA airport code
18	Dest	destination IATA airport code
19	Distance	in miles
20	TaxiIn	taxi in time, in minutes
21	TaxiOut	taxi out time in minutes
22	Cancelled	was the flight cancelled?
23	CancellationCode	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24	Diverted	1 = yes, 0 = no
25	CarrierDelay	in minutes
26	WeatherDelay	in minutes
27	NASDelay	in minutes
28	SecurityDelay	in minutes
29	LateAircraftDelay	in minutes

cat count23.txt | sort -n 601 D 149079 C 267054 B 317972 A 38955823 83844440 NA

R analysis

Most parts of what follows comes from slides of a talk given by John W. Emerson “Jay” and Michael J Jane from Yale University, at the UseR! 2009 conference. They used the data set airline.csv which unfortunately is not well formed (more on this later). https://www.r-project.org/conferences/useR-2009/slides/Emerson+Kane.pdf

To download bigmemory library: install.packages("bigmemory", repos="http://R-Forge.R-project.org")

>Unfortunately, all character codes from the files (i.e., columns 9, 17, 18 and 23) have been replaced by NA values. This is because we used the integer type (matrix parameter). Using the char type will not help (since this is only a C way of storing values on just 1 byte, so with only 256 different values). This might change in the future but for the moment we need to to replace all character entries by an integer numeric code.

Data Formating for the bigmemory library

To use bigmemory you need to replace all string codes from columns 9 (unique carrier code), 17 (origin IATA airport code), 18 (destination IATA airport code) and 23 (CancellationCode) with an integer code.

To do so, we will just replace the first column of the count.txt files with the line number which will stand for the integer code and keep the real code in the second column (string code) and store the result in code.txt.

In the 17th column representing the Origine variable (origin IATA airport code), we will set the integer 1 to stand for AA and 2 for XE… and store the result in code17.txt

**We are going to replace empty values ,, by ,NA, in the airline.csv file. We replace also all character entries by their corresponding codes (as given in the count*.txt files).**

Subsequent sessions can connect to the backing instantaneously, and we can interact with it (e.g., compute some statistics):

When is the best hour of the day to fly to minimize delays? A simple computation done in parallel on 3 cores.

CRSDepTime : scheduled departure time (local, hhmm). DepDelay : departure delay, in minutes.

What has been done just above is not correct!. Indeed, the variable CRSDepTime should be in hhmm format in the file airline.csv but this is not the case. Moreover, they assumed that this variable is in minutes, which is wrong!

Get back to the original website to download the original data: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time We will not do it since it would take too much time. Also what could be interesting would be to write scripts to download the files automatically, without having to click on the website.

Do older planes suffer more delays? Maybe. A computationally intensive example done in parallel.

Revolution Analytics’ RevoScaleR package overcomes the memory limitation of R by providing a binary file (extension .xdf ) format that is optimized for processing blocks of data at a time. The xdf file format stores data in a way that facilitates the operation of external memory algorithms. RevolScaleR provides functions for importing data from different kinds of sources into xdf files, manipulating data in xdf files and performing statistical analyses directly on data in these files.

Pour les autres colonnes: Box plots ou histogrammes en R Le jeu de données étant trop volumineux, on importera une colonne à la fois

Old stuff

Count the number of flights for each flight number in 2008 and save it to 2008- flights.csv:

Filtering

Sort by the 10th column (flightnum) (Source http://stat-computing.org/dataexpo/2009/unix-tools.html): sort -t, -k 10,10 2008.csv

Dealing with large data sets with Linux commands

Guide lines: STEPS TO ANALYSE LARGE DATASETS