AWK Command

The word "AWK" comes from the initials of the language's three developers: A. Aho, P. Weinberger and B. W. Kernighan.

Awk is commonly used for processing column-oriented text data, such as tables (Many UNIX utilities generates rows and columns of information).

How To Use The Awk Command

As Sed tool, Awk is line oriented and processes lines the same way.

The AWK Command syntax is as follows :

awk pattern { action }

In the following example :

awk '$1<30 {print $1}' awk.txt where :

Action is done when ever a line matches the specified pattern : the pattern specifies a test (condition) that is performed for each line read as input from awk.txt file : If the condition returns true (is $1<30?), then action (print) is performed on the studied line otherwise next line is read.

Pattern is optional, If pattern is omitted, then it is implicitly matched and action is performed for each line of the input stream.

Example : awk '/Montpe.*/ {print $1}' awk.txt where /Montpe.*/ is the testing pattern (regular expression). For each matching line (from awk.txt) with the pattern, the print action will be performed.

SOME EXAMPLES

Let's create a simple file to test awk command and display results in a terminal window :

cat > awktest.txt << EOF
Name           Gender          Age              City            Weight(Kg)
-------------------------------------------------------------------
Luka            M               14             Sydney               40
Mathias         M               11             Sydney               30
Jules           M               11             Montpellier          31
Eloise          F                5             Montpellier          18
Thibaud         M                3             Barcelone            15
Nina            F               11             Barcelone            35
Zoe             F               15             Perpignan            43
Gaspard         M               6              Perpignan            20
EOF

This file is saved as awk.txt. To display this file, we can use the cat command :

cat awk.txt

$x Variable

As AWK is used to process column-oriented text data, $x ("$1" or "$2" or "$3", etc.) has a meaning similar to shell script var. Instead of standing for the xth argument, it stands for the xth field of the input line. You can think of a field as a column, and the action you specify operates on each line (row].

For example :

- $1 stands for the value of the first column
- $2       "                 the second
- $3       "                 the third 
- etc.  

$0 stands for the whole line

To print the first and second column of a file, you might use the following awk script :

awk '{print $1,$2}' awk.txt
Name Gender
------------------------------------------------------------------------------ 
Luka M
Mathias M
Jules M
Eloise F
Thibaud M
Nina F
Zoe F
Gaspard M

More Useful Variables :

You might want to change Output Field Separator to put a comma between fields:

awk '{ OFS="," ; print $1,$2 }' awk.txt
Name,Gender
------------------------------------------------------------------------------,
Luka,M
Mathias,M
Jules,M
Eloise,F
Thibaud,M
Nina,F
Zoe,F
Gaspard,M

Pattern is missing, { action } consists in setting Output Field Separator to a comma and printing first and second column of each line from awk.txt read as input.

More Variables :

Variable Description
ARGC Number of arguments in the command line
ARGV Arguments table on the command line
CONVFMT Format de conversion des nombres en string (chaîne de caractères)
ENVIRON Tableau associatif des variables d'environnement
FILENAME Nom du fichier courant (et son chemin si précisé)
FNR Number of the record in the current file useful when many files are processed in the same command
OFMT Number Output Format
RLENGTH Length of string found by match function ()
RSTART First position of the string found by the match () function
SUBSEP Caractère de séparation pour les routines internes des tableaux

You might want to change Output Record Separator (line separator) to put one blank line between records (input lines) :

awk '{ ORS="\n\n" ; print $0 }' awk.txt
Name           Gender          Age              City              Weight(Kg)

------------------------------------------------------------------------------

Luka            M               14             Sydney               40

Mathias         M               11             Sydney               30

Jules           M               11             Montpellier          31

Eloise          F                5             Montpellier          18

Thibaud         M                3             Barcelone            15

Nina            F               11             Barcelone            35

Zoe             F               15             Perpignan            43

Gaspard         M               6              Perpignan            20

Let's print :

awk '{ print NR, NF, $0 }' awk.txt
1 5 Name           Gender          Age              City              Weight(Kg)
2 1 ------------------------------------------------------------------------------
3 5 Luka            M               14             Sydney               40
4 5 Mathias         M               11             Sydney               30
5 5 Jules           M               11             Montpellier          31
6 5 Eloise          F                5             Montpellier          18
7 5 Thibaud         M                3             Barcelone            15
8 5 Nina            F               11             Barcelone            35
9 5 Zoe             F               15             Perpignan            43
10 5 Gaspard         M               6              Perpignan            20

The number of field is 5 except for the 2d line where there is only one column ("-------")

-F Option is used to modify Input Field Separator

By default awk splits input lines into fields based on whitespace (spaces and tabs).

-F option replaces the default separator by the specified character : or ; for example.

cat > separator.txt << EOF
Apple:Banana:Ananas
Tomato:Carrot:Zukini
Oignon:Garlic:Leek
EOF

To print the 1st column from separator.txt file on the system, you might do :

        awk -F: '{print $1,$2}' separator.txt
Apple Banana
Tomato Carrot
Oignon Garlic

or either refering to BEGIN block

awk 'BEGIN { FS=":" ; } {print $1,$2 }' separator.txt
Apple Banana
Tomato Carrot
Oignon Garlic

Regex Pattern and Printing Action

Using regular expression allows you to apply filters.

Let's select names (column 1) from lines containing the word "Montpellier".

awk '/Montpellier/ {print $1}' awk.txt   
Jules
Eloise
Operator Meaning
~ Matches
!~ Doesn't match

You could specify the column number and search for a match with a regular expression thanks to the symbol : ~ Let's suppose we would like to select the line for people whose weight is 40Kg (column 5).

awk '$5 ~ /40/ {print $1}' awk.txt
Luka

To select names from lines starting with Jules (line starting with Jules : ^Jules) to Zoe (line starting with Zoe : ^Zoe) just add a coma between the two patterns :

 awk '/^Jules/,/^Zoe/ {print $1}' awk.txt
Jules
Eloise
Thibaud
Nina
Zoe

Awk Script File

We can store commands in a script file in order to simplify your code or to reuse it:

Note that you have to escape special character with backslashes, (e.g, $ is replaced by \$) when creating the script file from cat command lines to avoid an interpretation of the $ symbol. (It is not necessary to escape $ if you choose to create the script file from a text editor and save it as awkScript.awk)

cat > awkScript.awk << EOF
/^Jules/,/^Zoe/ {print \$0}
EOF

Let's check if our script was successfully created :

cat  awkScript.awk 
/^Jules/,/^Zoe/ {print $0}

Let's execute this script with -f option

awk -f awkScript.awk awk.txt 
Jules           M               11             Montpellier          31
Eloise          F                5             Montpellier          18
Thibaud         M                3             Barcelone            15
Nina            F               11             Barcelone            35
Zoe             F               15             Perpignan            43

Arithmetic Formula

Awk is a weakly typed language; variables can either be strings or numbers. The conversion rules are simple. The string "32" will be automatically converted into the number 32 when placed in a formula. However, if the string is placed in an arithmetic formula but is not representing a number as for example "123X" or "Biology" or "----", it will be converted into the number : 0.

You might write $5*2.20462 to convert weights from column 5 in Pounds :

 awk 'NR >2 {print $1,$5,"Kgs",($5*2.20462),"Pounds"}' awk.txt 
Luka 40 Kgs 88.1848 Pounds
Mathias 30 Kgs 66.1386 Pounds
Jules 31 Kgs 68.3432 Pounds
Eloise 18 Kgs 39.6832 Pounds
Thibaud 15 Kgs 33.0693 Pounds
Nina 35 Kgs 77.1617 Pounds
Zoe 43 Kgs 94.7987 Pounds
Gaspard 20 Kgs 44.0924 Pounds

To align the output refer to printf command :

 awk 'NR >2 {printf "%-10s %5d %-6s %5.2f %-6s\n",$1,$5,"Kgs",($5*2.20462),"Pounds"}' awk.txt 
Luka          40 Kgs    88.18 Pounds
Mathias       30 Kgs    66.14 Pounds
Jules         31 Kgs    68.34 Pounds
Eloise        18 Kgs    39.68 Pounds
Thibaud       15 Kgs    33.07 Pounds
Nina          35 Kgs    77.16 Pounds
Zoe           43 Kgs    94.80 Pounds
Gaspard       20 Kgs    44.09 Pounds

Let's create another file test :

cat > awk2.txt << EOF
Name             Math            Literacy        History           Biology
---------    --------------   --------------    ----------      --------------
Camille           97               85              89                  90
Caroline          80               92              50                  85
Leo               85               97              90                  89
EOF

As an example, the following code prints the marks average for each name :

        awk 'NR>2{total=0; for (col=2; col<=NF; col++) total+=$col; print $1, total/(NF-1);}' awk2.txt
Camille 90.25
Caroline 76.75
Leo 90.25

For more : some awk arithmetics function

| Function | Action | |---------------------------- |-------------------- | | sqrt(expr) |returns the square root of expr | | sin(expr) | returns the sine of expr, which is expressed in radians | | cos(expr) | returns the cosine of expr, which is expressed in radians| | exp(expr) | returns the exponential value of expr | | int(expr) | truncates the expr to an integer value | | rand() | returns a random number N, between 0 and 1 |

Examples :

awk 'BEGIN {
   print "Random1 =" , rand()
   print "Random2 =" , rand()
}'
Random1 = 0.382933
Random2 = 0.948479
awk 'BEGIN {
   print "Int num1 =" , int(10.745)
}'
Int num1 = 10

Programming with Awk

Pre and Post Operations

K offers pre-processing BEGIN and post-processing END sections when parsing a file. The structure of the awk script follows :

When starting a program, awk can execute instructions before the heart of the program. These instructions must be placed in a block called BEGIN

The BEGIN block must be followed by its opening brace on the same line. (Unless you use the backslash character in front of the carriage return)

BEGIN { etc }

or either

BEGIN
{
} EOF -->

cat > fileNameScript.awk << EOF
       # test if number of field is different from 5
       if (NF != 5 )  {
            print "MISSING VALUE IN ", FILENAME, " line #", FNR, "(", NR, "scanned records )\\n\\
    etc
}

 
BEGIN blocks are very useful for initializing variables and thus preparing the rest of the program.


Unlike BEGIN blocks, END blocks are executed at the end of the program : once all records have been processed by the heart of the program. It has the same properties as a BEGIN block:

Let's create a file script called blockScript.awk :
bash: line 13: warning: here-document at line 0 delimited by end-of-file (wanted `EOF')
cat blockScript.awk 

BEGIN {
    print "Start 1";
}

BEGIN {
    print "Start 1";
}
{
print $1;
}

END {
    print "End 1";
}
END {
    print "End 2"
}
awk -f blockScript.awk awk.txt 
Start 1
Start 1
Name
------------------------------------------------------------------------------
Luka
Mathias
Jules
Eloise
Thibaud
Nina
Zoe
Gaspard
End 1
End 2

Programming Structures

Awk parser offers all the programming structures: conditions, loops, iterations.

Condition

Let's take for example marks for history (col 4), if they are greater than 60, course is passed, 'PASS' is returned otherwise course is failed, 'FAIL' is returned.

cat awk2.txt

cat > awkScript.txt << EOF

BEGIN {
        OFS=","
}
NR <=3 { next }
{
        if ( \$4 > 60 ) {
                course="PASS"
        } else {
                course="FAIL"
        }

        print \$1, course
}
EOF

This script is then executed :

awk -f awkScript.txt awk2.txt
Caroline,FAIL
Leo,PASS

For loop

How to define arrays?

AWK has associative arrays : you can use either string or number as an array index

You do no need to declare the size of an array.

arrayname[string]=value

To loop into an array :

for (var in arrayname) {list of actions to be performed}

Array Script Examples :

awk 'BEGIN { fruits["mango"] = "yellow"; fruits["orange"] = "orange"; fruits["tomato"] = "red"; for (var in fruits) {print var,fruits[var]} }'

How to define variables using awk

The syntax is as follows for passing multiple -v to the awk command:

x=10
y=30
text="Total is : "
awk -v a=$x -v b=$y -v c="$text" 'BEGIN {ans=a+b; print c " " ans}'
Total is :  40

String-Manipulation Functions

toupper, tolower

awk '/Mathias/ { print $1, toupper($1) }' awk.txt 
Mathias MATHIAS

printf

The printf function works essentially like C printf. This can be used when you want to format output or combine fields onto one line in more complex ways.

Printf structure :

%[flag][min width][precision][length modifier][conversion specifier]

There are many format specifiers defined in C. Take a look at the following list :

Specifier Description
%i or %d Decimal integer
%c Character
%f Decimal floating point
%s String of characters
%e Scientific notation with e (ex: 1.86e6)
%E Like e, but with a capital E (1.86E6)
%g Uses the shorter of %e or %f
%G Like g, except it uses the shorter of %E or %f
%x Number in hexadecimal (base 16)
%% Prints a percent sign
% Prints a percent sign
printf integer formatting examples :

The print command implicitly adds a newline; printf doesn't : `is used in the printf statements (usually called escape sequence) and represents a newline character.

At least eight characters :

printf "%8d\n" 300
     300

With a plus sign, at least eight characters :

printf "%+8d\n" 300
    +300

Left-justified, plus sign, at least eight characters :

printf "%-+8d\n" 300
+300    

Scientific notation with e :

printf "%e\n" 300
3.000000e+02

Zero-filled, at least eight characters :

printf "%08d\n" 300
00000300
printf - formatting floating point numbers examples

One position after the decimal :

printf "%.1f\n" 10.3456
10.3

Two positions after the decimal :

printf "%.2f\n" 10.3456
10.35

Zero-filled, at least Eight characters, three positions after the decimal :

printf "%08.3f\n" 10.3456
0010.346

Left-justified, at least eight characters, two positions after the decimal :

printf "%-8.2f" 10.3456;
10.35   
printf string formatting examples :

A simple string :

printf "%s" "abc"
abc

Minimum length (5 char) :

printf "%5s" "abc"
  abc

Minimum length (5 char), left-justified :

printf "%-5s" "abc"
abc  
Summary of special printf characters

The following character sequences have a special meaning when used as printf format specifiers:

\b backspace
\n newline, or linefeed
\r carriage return
\t tab
\\ backslash

As you can see from that last example, because the backslash character itself is treated specially, you have to print two backslash characters in a row to get one backslash character to appear in your output.

Special characters formatting examples :

Inserting a tab character and a newline character in a string :

printf "Hello\tworld\nHere comes the sun"
Hello   world
Here comes the sun

A Windows path with backslash characters :

printf "C:\\Windows\\System32\\" 
C:\Windows\System32\

Let's supress the first column from awk.txt:

awk '{ for (i=2; i<=NF; i++) printf "%s", $i ; printf "\n";}' awk.txt 
GenderAgeCityWeight(Kg)

M14Sydney40
M11Sydney30
M11Montpellier31
F5Montpellier18
M3Barcelone15
F11Barcelone35
F15Perpignan43
M6Perpignan20

length

Return the number of characters in string. If string is a number, the length of the digit string representing that number is returned.

awk '{print $1, length($1);}' awk.txt
Name 4
------------------------------------------------------------------------------ 78
Luka 4
Mathias 7
Jules 5
Eloise 6
Thibaud 7
Nina 4
Zoe 3
Gaspard 7

match(str,exp)

The match function returns the position of the pattern in str matching the regular expression exp, or 0 if not found. Assigns the values to the RSTART and RLENGTH variables.

awk 'NR >2 { print $1, match($1,/L.*/)}' awk2.txt
Camille 0
Caroline 0
Leo 1

match($1,/L.*/) returns the position of the matching pattern in $1 (for example : first character is "L" in $1="Laura") when matches with regexp and 0 when doesn't

awk 'NR >2 { print $0, match($0,/9$/)}' awk2.txt
Camille           97               85              89                  90 0
Caroline          80               92              50                  85 0
Leo               85               97              90                  89 73

match($5,/9$/) returns 2 when processing the last lines where $5="89" (ending with "9")

sub(regexp, replacement [, target])

Search for the first occurence in the line of the longest substring that matches the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).

awk -v stri="rain, rain, everywhere" 'BEGIN {sub(/ai/, "u", stri); print stri}'
run, rain, everywhere
echo "rain, rain, everywhere" | awk '{sub(/ai/, "u"); print $0}'  
run, rain, everywhere

gsub

Search target for all of the matching substrings it can find in the line and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere.

substr

substr(string, start [, length ])

Return a length-character-long substring of string, starting at character number start. The first character of a string is the character number one.

awk '{ print $1, substr($1,2,3) }' awk.txt
Name ame
------------------------------------------------------------------------------ ---
Luka uka
Mathias ath
Jules ule
Eloise loi
Thibaud hib
Nina ina
Zoe oe
Gaspard asp

returns 3 characters from name column starting from 2d character

User Functions

Let's use awk.txt in the following example:

cat awk.txt
Name           Gender          Age              City              Weight(Kg)
------------------------------------------------------------------------------
Luka            M               14             Sydney               40
Mathias         M               11             Sydney               30
Jules           M               11             Montpellier          31
Eloise          F                5             Montpellier          18
Thibaud         M                3             Barcelone            15
Nina            F               11             Barcelone            35
Zoe             F               15             Perpignan            43
Gaspard         M               6              Perpignan            20

The ability to create user functions is one of the most important features of the awk utility. Functions are defined with the keyword function. In the following script, we are defining a function gentag which takes the first three letters from the parameter, nom, and convert them to lowercase before returning these three letters followed by the age parameter:

cat > awkScript << EOF
function gentag(nom,age) {
        tmp=tolower(substr(nom,1,3))
        return tmp "_" age
}

BEGIN { 
        FS=" "
        OFS=";"
}

{ 
        print $1, $3, gentag($1,$3)
}

END { 
print NR , "lines"
}
EOF
awk -f awkScript awk.txt
awk: awkScript: line 12: syntax error at or near ,

We just used the gentag function to format the output

date

Create a test file date.txt :

cat > date.txt << EOF
Name           Gender          Date           
-------------------------------------------
Thomas            M               2017-09-05            
Simon         M               2011-10-28           
Elliot           M               2015-09-03             
Jeanne         F               2030-06-02                          
EOF

We want to compare a given date with dates from the third column of our test file date.txt. We need to assign a value to script var called fixedDate

v option :

-v var=value assigns value to program variable var

awk -v var="hello" 'BEGIN{print var;}'
hello

Supposing we just want to get records where date > today :

date function returns Today date : Normally, date is a bash command that prints the current date and time of day in a well-known format.

However, if you provide an argument to it that begins with a +sign, date copies nonformat specifier characters to the standard output and interprets the current time according to the format specifiers in the string.

awk -v date="$(date +%Y-%m-%d)" '$3>date{print $0;}' date.txt 
Name           Gender          Date           
Jeanne         F               2030-06-02                          

or either

awk -v date="$(date +%Y-%m-%d)" '$3>date' date.txt
Name           Gender          Date           
Jeanne         F               2030-06-02                          

We may want to count the number of values a variable is having.

Considering a weather variable which value can be: sunny, rainy, cloudy, stormy

cat > weather.txt << EOF
DAY  WEATHER
1    sunny
2    sunny
3    rainy
4    sunny
5    stormy
6    rainy
7    rainy
EOF
bash: weather.txt: Permission denied

We are going to create an array with the weather variable as the key and add 1 to the corresponding value each time it is read from the file:

awk '{a[$2]++}END{for(x in a)print x,a[x]}' weather.txt | sort -k2,2
EOF
Error in running command bash

Processing more than one file using awk

cat > f1 << EOF
a
b
c
d
EOF
cat > f2 << EOF
e
f
g
h
EOF
awk '{printf("file->[%s] NR->[%d] FNR->[%d] str->[%s]\n", FILENAME, NR, FNR, $0)}' f1 f2
file->[f1] NR->[1] FNR->[1] str->[a]
file->[f1] NR->[2] FNR->[2] str->[b]
file->[f1] NR->[3] FNR->[3] str->[c]
file->[f1] NR->[4] FNR->[4] str->[d]
file->[f2] NR->[5] FNR->[1] str->[e]
file->[f2] NR->[6] FNR->[2] str->[f]
file->[f2] NR->[7] FNR->[3] str->[g]
file->[f2] NR->[8] FNR->[4] str->[h]

FNR is the line number of the current file, NR is the number of lines that have been processed. If you only give one file to awk, FNR will always equal NR. If you give more than one file, FNR will go back to 1 when the next file is reached but NR will continue incrementing. Therefore, NR == FNR only while the first file is being processed.

Examples of scripts to check file integrity :

Supposing you would like to detect a missing field in a bunch of files. Let's say each record should have 5 fields :

<!-- cat > fileNameScript.awk << EOF # set f (for filename) to an empty string, BEGIN {f=""; } { # detect when a new file is scanned if(f != FILENAME){line=1} #initialyze the line number to 1 if a new file is scanned # test if number of field is different from 5 if (NF != 5) { print "MISSING VALUE IN ", FILENAME, " line #", line, "(", NR, "scanned records )\n\n ", $0,"\n"; } f=FILENAME; # f stores scanned file name to be compared to next FILENAME value line++; # line is incrementedn ", $0,"\n"; }

} EOF ```

To test this script, we are going to create three files:

cat > test1.txt << EOF
Luka            M               14             Sydney               40
Mathias         M               11             Sydney               30
Jules                           11             Montpellier          31
Heloise         F               4              Montpellier          18
Zoe             F               15             Perpignan            43
Gaspard         M               6              Perpignan            20
EOF
cat > test2.txt << EOF
Luka            M               14             Sydney               40
Mathias         M               11             Sydney               30
Jules           M               11             Montpellier          31
Heloise         F               4              Montpellier          18
Zoe             F                              Perpignan            43
Gaspard         M               6              Perpignan            20
EOF
cat > test3.txt << EOF
Luka            M               14             Sydney               40
Mathias         M               11             Sydney               30
Jules           M               11             Montpellier          31
Heloise         F               4                                   18
Zoe             F               15             Perpignan            
Gaspard         M               6              Perpignan            20
EOF
awk -f fileNameScript.awk test1.txt test2.txt test3.txt
awk: fileNameScript.awk: line 2: syntax error at or near if
awk: fileNameScript.awk: line 3: runaway string constant "scanned re ...

This way we checked if any of the files has one or more missing field(s).

TODO

exit and next:

-The next statement forces awk to immediately stop processing the current record and go on to the next record.

Expliquer comment sortir de la commande sinon awk traite toutes les lignes: for i in {1..30}; do awk 'NR<3{FS=",";print}NR>4{exit}' airline.csv ; done;

TODO

Since 1 always evaluates to true, it performs default operation {print $0}, hence prints the current line stored in $0

So, awk '(condition){action}1' file is equivalent to and shorthand of

awk '(condition){action} {print $0}' file Again $0 is default argument to print, so you could also write

awk 'condition{action} {print}' file

TODO

ARGC ARGV The command-line arguments available to awk programs are stored in an array called ARGV. ARGC is the number of command-line arguments present. See section Other Command Line Arguments. Unlike most awk arrays, ARGV is indexed from zero to ARGC - 1

TODO To select columns change header titles and remove space and tabulation characters from a csv file

awk 'BEGIN {FS=",";}(NR==1){$1="date";$2="MaxTemp";$3="meanTemp";$4="minTemp";$8="maxHumidity";$9="meanHumidity";$10="minHumidity"}{print $1,",",$2,",",$3,",",$4,",",$8,",",$9,",",$10}' meteoBoston.csv| sed 's/ +//g' >meteo.csv