##This page is still under development…
We are going to analyse large data sets working with file, whose size is at least 50% larger than available RAM (but no more than tens or hundreds of GB) using different languages: Linux command tools, R and Python.
By default, R loads files into memory (RAM). If a file is bigger than the available RAM, it cannot load it. Also, even if a file can fits into memory, when you start working on it, you may start creating copies of it (or creating large matrices of output for some analyses) which may quickly overload the RAM.
You should get familiar with these tools (e.g., use man cut): refer to the Linux useful tools section
sudo apt-get install moreutils
Our application will be based on csv files extracted from http://stat-computing.org/dataexpo/2009/: “The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.”
The aim of the data expo was to focus on different aspects of the data… answering questions as :
The site suggested other ideas to explore data:
“You are also welcome to work with interesting subsets: you might want to compare flight patterns before and after 9/11, or between the pair of cities that you fly between most often, or all flights to and from a major airport like Chicago (ORD). Smaller subsets may also help you to match up the data to other interesting datasets.”
At first, We are going to load and unzip csv files, it may take up to 45 min depending on your internet speed.
To check your filesystem available space : df -h
As /tmp
content is usually cleaned up at every boot, you should consider saving the `Data/ folder in your home directory (if you have enough space, if so just replace /tmp with /home)
cd /tmp ; mkdir Data ; cd Data
wget http://stat-computing.org/dataexpo/2009/{1987..2008}.csv.bz2
Then you can list the downloaded files and check the total size of these files.
ls -sh
total 1,6G
13M 1987.csv.bz2 48M 1992.csv.bz2 74M 1997.csv.bz2 73M 2002.csv.bz2 116M 2007.csv.bz2
48M 1988.csv.bz2 48M 1993.csv.bz2 74M 1998.csv.bz2 91M 2003.csv.bz2 109M 2008.csv.bz2
47M 1989.csv.bz2 49M 1994.csv.bz2 76M 1999.csv.bz2 106M 2004.csv.bz2
50M 1990.csv.bz2 72M 1995.csv.bz2 79M 2000.csv.bz2 108M 2005.csv.bz2
48M 1991.csv.bz2 73M 1996.csv.bz2 80M 2001.csv.bz2 110M 2006.csv.bz2
It is now a good idea to compute the size of these files once uncompressed. If these files were *.gz files we could have used gzip -l file.gz
to do so. Unfortunately, bzip2
doesn’t provide an option to display the size of the uncompressed file before actually uncompressing it. Nevertheless, we can do as follows:
sizeinbytes="$(bzcat {1987..2008}.csv.bz2 | wc -c)"
Size in Bytes: echo $sizeinbytes
12029214093
Size in MB: echo "$sizeinbytes /1024/1024" | bc
11471
Size in GB: echo "$sizeinbytes /1024/1024/1024" | bc
11
The advantage is that no disk space was used to compute these sizes. The disadvantage is that it takes time (and that it will take again the same time to really uncompress the files).
Since we have enough free space, we uncompress the files.
bzip2 -d *.bz2
and wait … quite a long time.
Then we can list the uncompressed files and check the total size.
wc -c *
127162942 1987.csv
501039472 1988.csv
486518821 1989.csv
509194687 1990.csv
491210093 1991.csv
492313731 1992.csv
490753652 1993.csv
501558665 1994.csv
530751568 1995.csv
533922363 1996.csv
540347861 1997.csv
538432875 1998.csv
552926022 1999.csv
570151613 2000.csv
600411462 2001.csv
530507013 2002.csv
626745242 2003.csv
669879113 2004.csv
671027265 2005.csv
672068096 2006.csv
702878193 2007.csv
689413344 2008.csv
12029214093 total
This corresponds to the previous output.
For more details about the size of each file as well as the total:
ls -sh
(or du -ch *
)
total 12G
122M 1987.csv 469M 1991.csv 507M 1995.csv 528M 1999.csv 598M 2003.csv 671M 2007.csv 478M 1988.csv 470M 1992.csv 510M 1996.csv 544M 2000.csv 639M 2004.csv 658M 2008.csv 464M 1989.csv 469M 1993.csv 516M 1997.csv 573M 2001.csv 640M 2005.csv 486M 1990.csv 479M 1994.csv 514M 1998.csv 506M 2002.csv 641M 2006.csv
If you want to count the number of files you just unzipped:
ls * | wc -l
22
We have 22 files.
Let’s have a look to the content of these files.
head 1987.csv
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,15,4,729,730,903,849,PS,1451,NA,94,79,NA,14,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,17,6,741,730,918,849,PS,1451,NA,97,79,NA,29,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,18,7,729,730,847,849,PS,1451,NA,78,79,NA,-2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,19,1,749,730,922,849,PS,1451,NA,93,79,NA,33,19,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,22,4,728,730,852,849,PS,1451,NA,84,79,NA,3,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,24,6,744,730,908,849,PS,1451,NA,84,79,NA,19,14,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
If we want only the header:
head -n 1 1988.csv
This should only display the first line.
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
We can see that this header (i.e., names of the variables) is present in all files, for example in the second:
head -n 1 1989.csv
At this point, we want to merge all files in one, so we will have to remove the header from all files but one. We investigated several commands to perform this operation and kept the fastest one. Details are given below using the time
function.
We are going to remove the header from this file and time this action with different Linux tools.
The time command is used to know how long the command takes to run.
The time option -p
makes sure the result respects the POSIX standart (The Portable Operating System Interface is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems).
We are going to test different commands on a copy of the 1987.csv file.
Note that the result of the time command differ depending on processors speeds, disks performances and many factors (refer to the time command )
We first try with the awk command:
Redirection operator (> or >>) can not be used to redirect the content of 1987-copy.csv to itself, because these operators have a higher priority than the command and create/truncate the file before the command is even invoked. To avoid that, you should use appropriate tools such as tee, sponge, editors or any other tool which can write results to the file (e.g. sed -i or sort file -o file).
The time the awk
command took to run on datalyptus server is as follows (the result certainly differ from your result):
real 10.88
user 0.66
sys 2.51
user
andsys
refer to CPU time used only by the process independantly from other processes which shares the CPU.real
refers to the execution time including the execution time of the other processes.
user+sys = 3.17sec
If we try this different option awk 'NR>1' 1987-copy.csv | sponge 1987-copy.csv
it also works
The time (in seconds) the awk
command took to run on datalyptus server is as follows (the result certainly differ from your result):
real 10.60
user 0.68
sys 2.45
user+sys = 3.13sec
We can note that to run the awk command with the testing pattern made of the ‘>’ comparison sign or the inequality sign do not significantly change the run time.
sed -i '1d' 1987-copy.csv
The result is as follows:
real 29.16
user 1.79
sys 4.84
user+sys = 6.63sec
To supress the first line of the file and redirect input to the same original file you can use appropriate in-place editors as Ex editor (part of Vim)
ex -c ':1d' -c ':wq' 1987-copy.csv
where:
:1d
delete firts line:wc
save and quitThe result is as follows:
real 15.66
user 2.56
sys 0.88
user+sys = 3.44sec
time -p tail -n +2 1987-copy.csv | sponge 1987-copy.csv
real 6.21
user 0.05
sys 1.73
user+sys = 1.78sec
real 6.69
user 0.08
sys 1.74
user+sys = 1.82sec
The more
and tail
command are the faster ways to remove the file header.
It would be better to get a more accurate benchmark of the tools to run each commands one hundred times and make an average of the execution time for each tool (as the CPU run time is different at each execution).
Let’s make a copy of two files:
The number of lines for each file is:
1311827 1987-copy.csv
5202097 1988-copy.csv
Now let’s merge these two files into 1988-copy.csv and time it:
time -p cat 1987-copy.csv 1988-copy.csv | sponge 1988-copy.csv
real 40.35
user 0.03
sys 10.46
And concludes the Total CPU time = user + sys = 10.49sec
We can check the number of lines for the resulted file with the wc command:
wc -l 1988-copy.csv
6513924 1988-copy.csv
And check it worked as the total number of lines 6513924 is the sum of the number of lines of the two merged files (1311827 + 5202097).
Now let’s merge the same files with the redirection sign >>
and time it again:
cp -f 1987.csv 1987-copy.csv
cp -f 1988.csv 1988-copy.csv
time -p cat 1988-copy.csv >> 1987-copy.csv
real 10.07
user 0.01
sys 2.08
And concludes the Total CPU time = user + sys = 2.09sec which is much faster than the 10.49sec we previously found.
We can also check the number of lines for the resulted file with the wc command:
wc -l 1987-copy.csv
6513924 1987-copy.csv
** We can conclude the method using the cat
command and the redirection sign >>
is much faster in merging the two files. This can be explained by the fact the >>
sign will redirect the result of the cat command to the end of the second file. This way you do not have to copy both contents but just the first file one.**
Now we are going to concatenate all files into airline.csv
removing the header lines (except from 1987.csv
). This takes around 2mns with command line tools:
date
nbfiles="$(ls {1987..2008}.csv | wc -l)"
nblinesbefore="$(cat {1987..2008}.csv | wc -l)"
mv 1987.csv airline.csv
# The header line is removed from all the files except from the 1987.csv and
# the content of the processed file is redirected to airline.csv
for res in {1988..2008}; do
tail -n +2 ${res}.csv >> airline.csv
# We delete each file once processed:
rm ${res}.csv
done;
# Since the first line of each file was removed except the first one, the total number of lines in the resulting file is given by:
nbTotalLines=$nblinesbefore-$nbfiles+1
echo $nbTotalLines |bc
date
123534970
We can check that the total number of lines is as expected for airline.csv
:
wc -l airline.csv
123534970 airline.csv
Everything is fine!
Now we have one big CSV (comma separated values) file named airline.csv
.
You need to refer to the website for a description of these variables: each columns is described as follows:
col | Name | Description |
---|---|---|
1 | Year | 1987-2008 |
2 | Month | 1-12 |
3 | DayofMonth | 1-31 |
4 | DayOfWeek | 1 (Monday) - 7 (Sunday) |
5 | DepTime | actual departure time (local, hhmm) |
6 | CRSDepTime | scheduled departure time (local, hhmm) |
7 | ArrTime | actual arrival time (local, hhmm) |
8 | CRSArrTime | scheduled arrival time (local, hhmm) |
9 | UniqueCarrier | unique carrier code |
10 | FlightNum | flight number |
11 | TailNum | plane tail number |
12 | ActualElapsedTime | in minutes |
13 | CRSElapsedTime | in minutes |
14 | AirTime | in minutes |
15 | ArrDelay | arrival delay, in minutes |
16 | DepDelay | departure delay, in minutes |
17 | Origin | origin IATA airport code |
18 | Dest | destination IATA airport code |
19 | Distance | in miles |
20 | TaxiIn | taxi in time, in minutes |
21 | TaxiOut | taxi out time in minutes |
22 | Cancelled | was the flight cancelled? |
23 | CancellationCode | reason for cancellation (A = carrier, B = weather, C = NAS, D = security) |
24 | Diverted | 1 = yes, 0 = no |
25 | CarrierDelay | in minutes |
26 | WeatherDelay | in minutes |
27 | NASDelay | in minutes |
28 | SecurityDelay | in minutes |
29 | LateAircraftDelay | in minutes |
It is now time to check for the integrity of both the csv file and our data. The file is too large to visually notice missing data or errors. Firstly we are going to look for any missing separators.
Knowing we should have 29 variables for each line (record), we are going to check we have 29 fields per line(record):
mardi 11 avril 2017, 11:19:01 (UTC+1000)
mardi 11 avril 2017, 11:25:00 (UTC+1000)
As nothing but dates were printed out, we can conlude the csv file was correctly created and no missing separators were encountered.
When for a given variable, the number of possible values is small and known in advance, we can create a table of counts of these possible values. This will allow us to check the integrity of the data (e.g., how many unespected values, how many missing values,…). From the description of the variable, we should do this for variables 1 (Year), 2 (Month), 3 (DayOfMonth, 4 (DayOfWeek), 9 (UniqueCarrier), 10 (FlightNum), 11 (TailNum), 17 (Origin), 18 (Dest), 22 (Cancelled), 23 (CancellationCode) and 24 (Diverted).
More generally, a table of counts of the other variables can also highlight some information. We are going to create a table of counts for all variables (29 columns) and store each table in a different file (count1.txt for the first var, count2.txt for the second var, …):
As running a code on a large file may take a long time you may use the exit
instruction combined ẁith NR conditions to test your awk script on a few lines first. For example, to process the first six lines of airline.csv and create a table of counts (count of the value, value) stored in count*.txt:
If everything is fine, we can process the command on the entire file:
For the variable 23, where the number of possible values should be only 5 (NA=Non Applicable, A = carrier, B = weather, C = NAS, D = security) we can have a look to the tmp23.txt
cat count23.txt | sort -n
601 D 149079 C 267054 B 317972 A 38955823 83844440 NA
We can notice that there are 38955824 empty values that should be set as NA
.
First, launch the memory monitor (e.g., gnome-system-monitor ).
Then Launch R
As you can see, this (most probably) fails.
Now, we use the bigmemory R package to read the file.
Most parts of what follows comes from slides of a talk given by John W. Emerson “Jay” and Michael J Jane from Yale University, at the UseR! 2009 conference. They used the data set airline.csv which unfortunately is not well formed (more on this later). https://www.r-project.org/conferences/useR-2009/slides/Emerson+Kane.pdf
To download bigmemory
library: install.packages("bigmemory", repos="http://R-Forge.R-project.org")
require(bigmemory)
# Takes around 16mns:
system.time({
X <- read.big.matrix("airline.csv",
header = TRUE, type = "integer",
backingfile = "airline.bin",
descriptorfile = "airline.desc",
extraCols = "age")
})
dim(X)
head(X)
q("no")
>Unfortunately, all character codes from the files (i.e., columns 9, 17, 18 and 23) have been replaced by NA values. This is because we used the integer type (matrix parameter). Using the char type will not help (since this is only a C way of storing values on just 1 byte, so with only 256 different values). This might change in the future but for the moment we need to to replace all character entries by an integer numeric code.
To use bigmemory you need to replace all string codes from columns 9 (unique carrier code), 17 (origin IATA airport code), 18 (destination IATA airport code) and 23 (CancellationCode) with an integer code.
To do so, we will just replace the first column of the count.txt files with the line number which will stand for the integer code and keep the real code in the second column (string code) and store the result in code.txt.
In the 17th column representing the Origine
variable (origin IATA airport code), we will set the integer 1
to stand for AA
and 2 for XE
… and store the result in code17.txt
We can display the content of code9.txt to check everything is fine:
1 AA
2 XE
3 DL
4 OO
5 MQ
6 PA
7 TW
8 B6
9 FL
10 AQ
11 CO
12 ML
13 EA
14 TZ
15 AS
16 YV
17 HA
18 UA
19 PS
20 OH
21 9E
22 NW
23 HP
24 WN
25 F9
26 DH
27 PI
28 EV
29 US
**We are going to replace empty values ,,
by ,NA,
in the airline.csv file. We replace also all character entries by their corresponding codes (as given in the count*.txt files).**
In the awk command:
The following code Takes about 23mns to run:
Launch R
require(bigmemory)
# Takes around 16mns:
system.time({
X <- read.big.matrix("airline.csv",
header = TRUE, type = "integer",
backingfile = "airline.bin",
descriptorfile = "airline.desc",
extraCols = "age")
})
dim(X)
head(X)
q("no")
Subsequent sessions can connect to the backing instantaneously, and we can interact with it (e.g., compute some statistics):
library(bigmemory)
xdesc <- dget("airline2.desc")
xdesc
# The following command will be executed in a very short time compare to the previous read.big.matrix(...)
x <- attach.big.matrix(xdesc)
colnames(x)
head(x)
system.time(a <- x[,1])
max(x[,]) # This one will not work! -> Error in GetAll.bm(x) : Too many indices (>2^31-1) for extraction.
range(x[,1], na.rm = TRUE)
tail(x, 1)
Can we get all flights from JFK to SFO? Sure!
a <- read.table("count17.txt", sep = ",")
JFK <- a$V1[a$V2 == "JFK"]
SFO <- a$V1[a$V2 == "SFO"]
gc(reset=TRUE)
y <- x[x[, "Origin"] == JFK & x[, "Dest"] == SFO,]
dim(y)
gc()
rm(y)
var1 <- read.table("count1.txt", sep = " ")
barplot(var1$V1, names = var1$V2)
var26 <- read.table("count26.txt", sep = " ")
var26 <- var26[with(var26, order(V2, V1)), ]
var26 <- na.omit(var26)[-1,]
breaks26 <- hist(var26$V2, plot = FALSE)$breaks/10
count26 <- rep(NA, length(breaks26) - 1)
for (i in 1:(length(breaks26) - 1)) {
count26[i] <- sum(var26$V1[(breaks26[i] < var26$V2) & (var26$V2 <= breaks26[i + 1])])
}
myhist26 <- list(breaks = breaks26, counts = count26)
class(myhist26) <- "histogram"
plot(myhist26)
require(biganalytics)
# The column range for the first column
colmean(x, 1, na.rm = TRUE)
# The first column is cached a second operation
# on the column is fast.
colrange(x, 1, na.rm = TRUE)
When is the best hour of the day to fly to minimize delays? A simple computation done in parallel on 3 cores.
gnome-system-monitor ```
require(foreach)
require(doMC)
registerDoMC(cores = 3)
probs <- c(0.9, 0.99, 0.999, 0.9999)
desc <- describe(x)
# delays by hour of day.
anshourofday <- foreach (i = seq(0, colmax(x, "CRSDepTime") - 1, by=60),
.combine = cbind)%dopar%
{
require(bigmemory)
x <- attach.big.matrix(desc)
ind <- mwhich(x, "CRSDepTime", c(i, i + 60),
comps = c('ge', 'lt'))
m <- cbind(probs, quantile(x[ind, "DepDelay"],
probs = probs, na.rm = TRUE))
colnames(m) <- c("Probabilites", "Quantiles")
t(m)[2,]
}
Try to understand this code!
CRSDepTime : scheduled departure time (local, hhmm). DepDelay : departure delay, in minutes.
What has been done just above is not correct!. Indeed, the variable CRSDepTime should be in hhmm format in the file airline.csv but this is not the case. Moreover, they assumed that this variable is in minutes, which is wrong!
What should be done?
Get back to the original website to download the original data: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time We will not do it since it would take too much time. Also what could be interesting would be to write scripts to download the files automatically, without having to click on the website.
Do older planes suffer more delays? Maybe. A computationally intensive example done in parallel.
uniqTailNum <- na.omit(unique(x[, 11]))
uniqTailNum.len <- length(uniqTailNum)
# 166 different planes whose TailNum is known
planeStart <- big.matrix(nrow = uniqTailNum.len,
ncol = 1, shared = TRUE)
psDesc <- describe(planeStart)
foreach(i=1:uniqTailNum.len) %dopar%
{
require(bigmemory)
x <- attach.big.matrix(desc)
planeStart <- attach.big.matrix(psDesc)
# The first year plane i can be found:
yearInds <- mwhich(x, "TailNum", uniqTailNum[i],
comps = 'eq')
minYear <- min( x[yearInds, "Year"], na.rm = TRUE )
# First month in minYear where the plane can be found:
minMonth <- min( x[yearInds, "Month"], na.rm = TRUE )
planeStart[i, 1] <- 12 * minYear + minMonth
return(TRUE)
}
BadTailNum <- mwhich(x, 11, NA, 'eq')
x[BadTailNum, 30] <- NA
MoreRecentDate <- max(x[,1]) * 12 + max(x[,2])
system.time(foreach(i=1:uniqTailNum.len) %dopar%
{
require(bigmemory)
x <- attach.big.matrix(desc)
planeStart <- attach.big.matrix(psDesc)
tmpInds <- mwhich(x, 11, uniqTailNum[i], 'eq')
x[tmpInds, 30] <- as.integer(MoreRecentDate -
planeStart[i, 1])
})
Column "age" (i.e., 30) of x is now filled.
blm1 <- biglm.big.matrix(ArrDelay ̃ age, data = x)
( out <- summary(blm1) )
names(out)
out$rsq
blm2 <- biglm.big.matrix(ArrDelay ̃ age + Year, data = x)
summary(blm2)
Revolution Analytics’ RevoScaleR package overcomes the memory limitation of R by providing a binary file (extension .xdf ) format that is optimized for processing blocks of data at a time. The xdf file format stores data in a way that facilitates the operation of external memory algorithms. RevolScaleR provides functions for importing data from different kinds of sources into xdf files, manipulating data in xdf files and performing statistical analyses directly on data in these files.
This package is available in Revolution R Enterprise which is non-free (owned by Microsoft now…): http://www.revolutionanalytics.com/revolution-r-enterprise
Analyze airline.csv with RevoScaleR :
http://www.r-bloggers.com/a-simple-big-data-analysis-using-the-revoscaler-package-in-revolution-r
Reste à faire: Corriger les erreurs
Pour les autres colonnes: Box plots ou histogrammes en R Le jeu de données étant trop volumineux, on importera une colonne à la fois
Utilisation de BigMemory:
Count the number of flights for each flight number in 2008 and save it to 2008- flights.csv:
cut -f9,10 -d, 2008.csv | sort | uniq -c > 2008-flights.csv
Sort by the 10th column (flightnum) (Source http://stat-computing.org/dataexpo/2009/unix-tools.html): sort -t, -k 10,10 2008.csv
To show flights from Des Moines to Chicago O’hare:
awk -F, '$17 == "DSM" && $18 == "ORD"' 2008.csv
Select only columns 9 (carrier) and 10 (flight num):
cut -f9,10 -d, 2008.csv