The word "AWK" comes from the initials of the language's three developers: A. Aho, P. Weinberger and B. W. Kernighan.
Awk is commonly used for processing column-oriented text data, such as tables (Many UNIX utilities generates rows and columns of information).
As Sed tool, Awk is line oriented and processes lines the same way.
The AWK Command syntax is as follows :
awk pattern { action }
In the following example :
awk '$1<30 {print $1}' awk.txt
where :
$1<30
stands for the testing pattern{print $1}
stands for the actionawk.txt
stands for the input stream...Action is done when ever a line matches the specified pattern : the pattern specifies a test (condition) that is performed for each line read as input from awk.txt file : If the condition returns true (is $1<30?), then action (print) is performed on the studied line otherwise next line is read.
Pattern is optional, If pattern is omitted, then it is implicitly matched and action is performed for each line of the input stream.
Example : awk '/Montpe.*/ {print $1}' awk.txt
where /Montpe.*/
is the testing pattern (regular expression). For each matching line (from awk.txt) with the pattern, the print action will be performed.
Two other important patterns are specified by the keywords "BEGIN" and "END". BEGIN and END patterns require an action :
BEGIN AND END BLOCKS STRUCTURES :
BEGIN { action performed at the beginning }
{ action performed for each line }
END { action performed at the end }
Let's create a simple file to test awk command and display results in a terminal window :
cat > awktest.txt << EOF
Name Gender Age City Weight(Kg)
-------------------------------------------------------------------
Luka M 14 Sydney 40
Mathias M 11 Sydney 30
Jules M 11 Montpellier 31
Eloise F 5 Montpellier 18
Thibaud M 3 Barcelone 15
Nina F 11 Barcelone 35
Zoe F 15 Perpignan 43
Gaspard M 6 Perpignan 20
EOF
This file is saved as awk.txt. To display this file, we can use the cat
command :
cat awk.txt
As AWK is used to process column-oriented text data, $x ("$1" or "$2" or "$3", etc.) has a meaning similar to shell script var. Instead of standing for the xth argument, it stands for the xth field of the input line. You can think of a field as a column, and the action you specify operates on each line (row].
For example :
- $1 stands for the value of the first column
- $2 " the second
- $3 " the third
- etc.
$0 stands for the whole line
To print the first and second column of a file, you might use the following awk script :
awk '{print $1,$2}' awk.txt
Name Gender
------------------------------------------------------------------------------
Luka M
Mathias M
Jules M
Eloise F
Thibaud M
Nina F
Zoe F
Gaspard M
You might want to change Output Field Separator to put a comma between fields:
awk '{ OFS="," ; print $1,$2 }' awk.txt
Name,Gender
------------------------------------------------------------------------------,
Luka,M
Mathias,M
Jules,M
Eloise,F
Thibaud,M
Nina,F
Zoe,F
Gaspard,M
Pattern is missing, { action } consists in setting Output Field Separator to a comma and printing first and second column of each line from awk.txt read as input.
Variable | Description |
---|---|
ARGC | Number of arguments in the command line |
ARGV | Arguments table on the command line |
CONVFMT | Format de conversion des nombres en string (chaîne de caractères) |
ENVIRON | Tableau associatif des variables d'environnement |
FILENAME | Nom du fichier courant (et son chemin si précisé) |
FNR | Number of the record in the current file useful when many files are processed in the same command |
OFMT | Number Output Format |
RLENGTH | Length of string found by match function () |
RSTART | First position of the string found by the match () function |
SUBSEP | Caractère de séparation pour les routines internes des tableaux |
You might want to change Output Record Separator (line separator) to put one blank line between records (input lines) :
awk '{ ORS="\n\n" ; print $0 }' awk.txt
Name Gender Age City Weight(Kg)
------------------------------------------------------------------------------
Luka M 14 Sydney 40
Mathias M 11 Sydney 30
Jules M 11 Montpellier 31
Eloise F 5 Montpellier 18
Thibaud M 3 Barcelone 15
Nina F 11 Barcelone 35
Zoe F 15 Perpignan 43
Gaspard M 6 Perpignan 20
Let's print :
awk '{ print NR, NF, $0 }' awk.txt
1 5 Name Gender Age City Weight(Kg)
2 1 ------------------------------------------------------------------------------
3 5 Luka M 14 Sydney 40
4 5 Mathias M 11 Sydney 30
5 5 Jules M 11 Montpellier 31
6 5 Eloise F 5 Montpellier 18
7 5 Thibaud M 3 Barcelone 15
8 5 Nina F 11 Barcelone 35
9 5 Zoe F 15 Perpignan 43
10 5 Gaspard M 6 Perpignan 20
The number of field is 5 except for the 2d line where there is only one column ("-------")
-F Option is used to modify Input Field Separator
By default awk splits input lines into fields based on whitespace (spaces and tabs).
-F
option replaces the default separator by the specified character :
or ;
for example.
cat > separator.txt << EOF
Apple:Banana:Ananas
Tomato:Carrot:Zukini
Oignon:Garlic:Leek
EOF
To print the 1st column from separator.txt file on the system, you might do :
awk -F: '{print $1,$2}' separator.txt
Apple Banana
Tomato Carrot
Oignon Garlic
or either refering to BEGIN block
awk 'BEGIN { FS=":" ; } {print $1,$2 }' separator.txt
Apple Banana
Tomato Carrot
Oignon Garlic
Using regular expression allows you to apply filters.
Let's select names (column 1) from lines containing the word "Montpellier".
awk '/Montpellier/ {print $1}' awk.txt
Jules
Eloise
Operator | Meaning |
---|---|
~ | Matches |
!~ | Doesn't match |
You could specify the column number and search for a match with a regular expression thanks to the symbol : ~
Let's suppose we would like to select the line for people whose weight is 40Kg (column 5).
awk '$5 ~ /40/ {print $1}' awk.txt
Luka
To select names from lines starting with Jules (line starting with Jules : ^Jules
) to Zoe (line starting with Zoe : ^Zoe
) just add a coma between the two patterns :
awk '/^Jules/,/^Zoe/ {print $1}' awk.txt
Jules
Eloise
Thibaud
Nina
Zoe
We can store commands in a script file in order to simplify your code or to reuse it:
To print lines Between two patterns with AWK is similar to the sed command: we can specify the starting pattern and the ending pattern as follows: awk '/StartPattern/,/EndPattern/' FileName Let's say that we want to print the lines from our previous file, from the line starting with Jules
to the line starting with Zoe
. In this case, the starting pattern is ^Jules
, and the end pattern is ^Zoe
.
We can create this script using the cat
command or write it in a Text Editor and save it as awkScript.awk.
Note that you have to escape special character with backslashes, (e.g,
$
is replaced by\$)
when creating the script file fromcat
command lines to avoid an interpretation of the$
symbol. (It is not necessary to escape$
if you choose to create the script file from a text editor and save it as awkScript.awk)
cat > awkScript.awk << EOF
/^Jules/,/^Zoe/ {print \$0}
EOF
Let's check if our script was successfully created :
cat awkScript.awk
/^Jules/,/^Zoe/ {print $0}
Let's execute this script with -f option
awk -f awkScript.awk awk.txt
Jules M 11 Montpellier 31
Eloise F 5 Montpellier 18
Thibaud M 3 Barcelone 15
Nina F 11 Barcelone 35
Zoe F 15 Perpignan 43
Awk is a weakly typed language; variables can either be strings or numbers. The conversion rules are simple. The string "32" will be automatically converted into the number 32 when placed in a formula. However, if the string is placed in an arithmetic formula but is not representing a number as for example "123X" or "Biology" or "----", it will be converted into the number : 0.
You might write $5*2.20462
to convert weights from column 5 in Pounds :
awk 'NR >2 {print $1,$5,"Kgs",($5*2.20462),"Pounds"}' awk.txt
Luka 40 Kgs 88.1848 Pounds
Mathias 30 Kgs 66.1386 Pounds
Jules 31 Kgs 68.3432 Pounds
Eloise 18 Kgs 39.6832 Pounds
Thibaud 15 Kgs 33.0693 Pounds
Nina 35 Kgs 77.1617 Pounds
Zoe 43 Kgs 94.7987 Pounds
Gaspard 20 Kgs 44.0924 Pounds
To align the output refer to printf command :
awk 'NR >2 {printf "%-10s %5d %-6s %5.2f %-6s\n",$1,$5,"Kgs",($5*2.20462),"Pounds"}' awk.txt
Luka 40 Kgs 88.18 Pounds
Mathias 30 Kgs 66.14 Pounds
Jules 31 Kgs 68.34 Pounds
Eloise 18 Kgs 39.68 Pounds
Thibaud 15 Kgs 33.07 Pounds
Nina 35 Kgs 77.16 Pounds
Zoe 43 Kgs 94.80 Pounds
Gaspard 20 Kgs 44.09 Pounds
Let's create another file test :
cat > awk2.txt << EOF
Name Math Literacy History Biology
--------- -------------- -------------- ---------- --------------
Camille 97 85 89 90
Caroline 80 92 50 85
Leo 85 97 90 89
EOF
As an example, the following code prints the marks average for each name :
awk 'NR>2{total=0; for (col=2; col<=NF; col++) total+=$col; print $1, total/(NF-1);}' awk2.txt
Camille 90.25
Caroline 76.75
Leo 90.25
For more : some awk arithmetics function
| Function | Action | |---------------------------- |-------------------- | | sqrt(expr) |returns the square root of expr | | sin(expr) | returns the sine of expr, which is expressed in radians | | cos(expr) | returns the cosine of expr, which is expressed in radians| | exp(expr) | returns the exponential value of expr | | int(expr) | truncates the expr to an integer value | | rand() | returns a random number N, between 0 and 1 |
Examples :
awk 'BEGIN {
print "Random1 =" , rand()
print "Random2 =" , rand()
}'
Random1 = 0.382933
Random2 = 0.948479
awk 'BEGIN {
print "Int num1 =" , int(10.745)
}'
Int num1 = 10
K offers pre-processing BEGIN and post-processing END sections when parsing a file. The structure of the awk script follows :
When starting a program, awk can execute instructions before the heart of the program. These instructions must be placed in a block called BEGIN
The BEGIN block must be followed by its opening brace on the same line. (Unless you use the backslash character in front of the carriage return)
BEGIN { etc }
or either
BEGIN
{
} EOF -->
cat > fileNameScript.awk << EOF
# test if number of field is different from 5
if (NF != 5 ) {
print "MISSING VALUE IN ", FILENAME, " line #", FNR, "(", NR, "scanned records )\\n\\
etc
}
BEGIN blocks are very useful for initializing variables and thus preparing the rest of the program.
Unlike BEGIN blocks, END blocks are executed at the end of the program : once all records have been processed by the heart of the program. It has the same properties as a BEGIN block:
Let's create a file script called blockScript.awk :
bash: line 13: warning: here-document at line 0 delimited by end-of-file (wanted `EOF')
cat blockScript.awk
BEGIN {
print "Start 1";
}
BEGIN {
print "Start 1";
}
{
print $1;
}
END {
print "End 1";
}
END {
print "End 2"
}
awk -f blockScript.awk awk.txt
Start 1
Start 1
Name
------------------------------------------------------------------------------
Luka
Mathias
Jules
Eloise
Thibaud
Nina
Zoe
Gaspard
End 1
End 2
Awk parser offers all the programming structures: conditions, loops, iterations.
Let's take for example marks for history (col 4), if they are greater than 60, course is passed, 'PASS' is returned otherwise course is failed, 'FAIL' is returned.
cat awk2.txt
cat > awkScript.txt << EOF
BEGIN {
OFS=","
}
NR <=3 { next }
{
if ( \$4 > 60 ) {
course="PASS"
} else {
course="FAIL"
}
print \$1, course
}
EOF
This script is then executed :
awk -f awkScript.txt awk2.txt
Caroline,FAIL
Leo,PASS
AWK has associative arrays : you can use either string or number as an array index
You do no need to declare the size of an array.
arrayname[string]=value
To loop into an array :
for (var in arrayname) {list of actions to be performed}
Array Script Examples :
awk 'BEGIN { fruits["mango"] = "yellow"; fruits["orange"] = "orange"; fruits["tomato"] = "red"; for (var in fruits) {print var,fruits[var]} }'
The syntax is as follows for passing multiple -v to the awk command:
x=10
y=30
text="Total is : "
awk -v a=$x -v b=$y -v c="$text" 'BEGIN {ans=a+b; print c " " ans}'
Total is : 40
awk '/Mathias/ { print $1, toupper($1) }' awk.txt
Mathias MATHIAS
The printf function works essentially like C printf. This can be used when you want to format output or combine fields onto one line in more complex ways.
%[flag][min width][precision][length modifier][conversion specifier]
The flag setting controls 'characters' that are added to a string : the plus sign includes the sign specifier, the flag 0 includes pads numbers with 0s, the minus sign specifies output is left-justifed (right-justified by default).
Min width controls the minimum number of characters to print
Precision controls the max number of characters to print
The length modifier does not modify the length of the output but specifies the length of the input. The length modifier is all about helping printf deal with cases where you're using unusually big (or unusually small) variables.
The conversion specifier is the part of the format specifier that determines the basic formatting of the value that is to be printed.
There are many format specifiers defined in C. Take a look at the following list :
Specifier | Description |
---|---|
%i or %d | Decimal integer |
%c | Character |
%f | Decimal floating point |
%s | String of characters |
%e | Scientific notation with e (ex: 1.86e6) |
%E | Like e, but with a capital E (1.86E6) |
%g | Uses the shorter of %e or %f |
%G | Like g, except it uses the shorter of %E or %f |
%x | Number in hexadecimal (base 16) |
%% | Prints a percent sign |
% | Prints a percent sign |
The
printf
doesn't : `is used in the printf statements (usually called escape sequence) and represents a newline character.
At least eight characters :
printf "%8d\n" 300
300
With a plus sign, at least eight characters :
printf "%+8d\n" 300
+300
Left-justified, plus sign, at least eight characters :
printf "%-+8d\n" 300
+300
Scientific notation with e :
printf "%e\n" 300
3.000000e+02
Zero-filled, at least eight characters :
printf "%08d\n" 300
00000300
One position after the decimal :
printf "%.1f\n" 10.3456
10.3
Two positions after the decimal :
printf "%.2f\n" 10.3456
10.35
Zero-filled, at least Eight characters, three positions after the decimal :
printf "%08.3f\n" 10.3456
0010.346
Left-justified, at least eight characters, two positions after the decimal :
printf "%-8.2f" 10.3456;
10.35
A simple string :
printf "%s" "abc"
abc
Minimum length (5 char) :
printf "%5s" "abc"
abc
Minimum length (5 char), left-justified :
printf "%-5s" "abc"
abc
The following character sequences have a special meaning when used as printf format specifiers:
\b | backspace |
\n | newline, or linefeed |
\r | carriage return |
\t | tab |
\\ | backslash |
As you can see from that last example, because the backslash character itself is treated specially, you have to print two backslash characters in a row to get one backslash character to appear in your output.
Inserting a tab character and a newline character in a string :
printf "Hello\tworld\nHere comes the sun"
Hello world
Here comes the sun
A Windows path with backslash characters :
printf "C:\\Windows\\System32\\"
C:\Windows\System32\
Let's supress the first column from awk.txt:
awk '{ for (i=2; i<=NF; i++) printf "%s", $i ; printf "\n";}' awk.txt
GenderAgeCityWeight(Kg)
M14Sydney40
M11Sydney30
M11Montpellier31
F5Montpellier18
M3Barcelone15
F11Barcelone35
F15Perpignan43
M6Perpignan20
Return the number of characters in string. If string is a number, the length of the digit string representing that number is returned.
awk '{print $1, length($1);}' awk.txt
Name 4
------------------------------------------------------------------------------ 78
Luka 4
Mathias 7
Jules 5
Eloise 6
Thibaud 7
Nina 4
Zoe 3
Gaspard 7
The match function returns the position of the pattern in str matching the regular expression exp, or 0 if not found. Assigns the values to the RSTART and RLENGTH variables.
awk 'NR >2 { print $1, match($1,/L.*/)}' awk2.txt
Camille 0
Caroline 0
Leo 1
match($1,/L.*/)
returns the position of the matching pattern in $1 (for example : first character is "L" in $1="Laura") when matches with regexp and 0 when doesn't
awk 'NR >2 { print $0, match($0,/9$/)}' awk2.txt
Camille 97 85 89 90 0
Caroline 80 92 50 85 0
Leo 85 97 90 89 73
match($5,/9$/)
returns 2 when processing the last lines where $5="89" (ending with "9")
Search for the first occurence in the line of the longest substring that matches the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).
awk -v stri="rain, rain, everywhere" 'BEGIN {sub(/ai/, "u", stri); print stri}'
run, rain, everywhere
echo "rain, rain, everywhere" | awk '{sub(/ai/, "u"); print $0}'
run, rain, everywhere
Search target for all of the matching substrings it can find in the line and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere.
substr(string, start [, length ])
Return a length-character-long substring of string, starting at character number start. The first character of a string is the character number one.
awk '{ print $1, substr($1,2,3) }' awk.txt
Name ame
------------------------------------------------------------------------------ ---
Luka uka
Mathias ath
Jules ule
Eloise loi
Thibaud hib
Nina ina
Zoe oe
Gaspard asp
returns 3 characters from name column starting from 2d character
Let's use awk.txt in the following example:
cat awk.txt
Name Gender Age City Weight(Kg)
------------------------------------------------------------------------------
Luka M 14 Sydney 40
Mathias M 11 Sydney 30
Jules M 11 Montpellier 31
Eloise F 5 Montpellier 18
Thibaud M 3 Barcelone 15
Nina F 11 Barcelone 35
Zoe F 15 Perpignan 43
Gaspard M 6 Perpignan 20
The ability to create user functions is one of the most important features of the awk utility. Functions are defined with the keyword function
. In the following script, we are defining a function gentag which takes the first three letters from the parameter, nom, and convert them to lowercase before returning these three letters followed by the age parameter:
cat > awkScript << EOF
function gentag(nom,age) {
tmp=tolower(substr(nom,1,3))
return tmp "_" age
}
BEGIN {
FS=" "
OFS=";"
}
{
print $1, $3, gentag($1,$3)
}
END {
print NR , "lines"
}
EOF
awk -f awkScript awk.txt
awk: awkScript: line 12: syntax error at or near ,
We just used the gentag function to format the output
Create a test file date.txt :
cat > date.txt << EOF
Name Gender Date
-------------------------------------------
Thomas M 2017-09-05
Simon M 2011-10-28
Elliot M 2015-09-03
Jeanne F 2030-06-02
EOF
We want to compare a given date with dates from the third column of our test file date.txt. We need to assign a value to script var called fixedDate
v option :
-v var=value assigns value to program variable var
awk -v var="hello" 'BEGIN{print var;}'
hello
Supposing we just want to get records where date > today :
date function returns Today date : Normally, date
is a bash command that prints the current date and time of day in a well-known format.
However, if you provide an argument to it that begins with a +
sign, date copies nonformat specifier characters to the standard output and interprets the current time according to the format specifiers in the string.
awk -v date="$(date +%Y-%m-%d)" '$3>date{print $0;}' date.txt
Name Gender Date
Jeanne F 2030-06-02
or either
awk -v date="$(date +%Y-%m-%d)" '$3>date' date.txt
Name Gender Date
Jeanne F 2030-06-02
We may want to count the number of values a variable is having.
Considering a weather variable which value can be: sunny, rainy, cloudy, stormy
cat > weather.txt << EOF
DAY WEATHER
1 sunny
2 sunny
3 rainy
4 sunny
5 stormy
6 rainy
7 rainy
EOF
bash: weather.txt: Permission denied
We are going to create an array with the weather variable as the key and add 1 to the corresponding value each time it is read from the file:
awk '{a[$2]++}END{for(x in a)print x,a[x]}' weather.txt | sort -k2,2
EOF
Error in running command bash
cat > f1 << EOF
a
b
c
d
EOF
cat > f2 << EOF
e
f
g
h
EOF
awk '{printf("file->[%s] NR->[%d] FNR->[%d] str->[%s]\n", FILENAME, NR, FNR, $0)}' f1 f2
file->[f1] NR->[1] FNR->[1] str->[a]
file->[f1] NR->[2] FNR->[2] str->[b]
file->[f1] NR->[3] FNR->[3] str->[c]
file->[f1] NR->[4] FNR->[4] str->[d]
file->[f2] NR->[5] FNR->[1] str->[e]
file->[f2] NR->[6] FNR->[2] str->[f]
file->[f2] NR->[7] FNR->[3] str->[g]
file->[f2] NR->[8] FNR->[4] str->[h]
FNR is the line number of the current file, NR is the number of lines that have been processed. If you only give one file to awk, FNR will always equal NR. If you give more than one file, FNR will go back to 1 when the next file is reached but NR will continue incrementing. Therefore, NR == FNR only while the first file is being processed.
Supposing you would like to detect a missing field in a bunch of files. Let's say each record should have 5 fields :
<!-- cat > fileNameScript.awk << EOF # set f (for filename) to an empty string, BEGIN {f=""; } { # detect when a new file is scanned if(f != FILENAME){line=1} #initialyze the line number to 1 if a new file is scanned # test if number of field is different from 5 if (NF != 5) { print "MISSING VALUE IN ", FILENAME, " line #", line, "(", NR, "scanned records )\n\n ", $0,"\n"; } f=FILENAME; # f stores scanned file name to be compared to next FILENAME value line++; # line is incrementedn ", $0,"\n"; }
} EOF ```
To test this script, we are going to create three files:
cat > test1.txt << EOF
Luka M 14 Sydney 40
Mathias M 11 Sydney 30
Jules 11 Montpellier 31
Heloise F 4 Montpellier 18
Zoe F 15 Perpignan 43
Gaspard M 6 Perpignan 20
EOF
cat > test2.txt << EOF
Luka M 14 Sydney 40
Mathias M 11 Sydney 30
Jules M 11 Montpellier 31
Heloise F 4 Montpellier 18
Zoe F Perpignan 43
Gaspard M 6 Perpignan 20
EOF
cat > test3.txt << EOF
Luka M 14 Sydney 40
Mathias M 11 Sydney 30
Jules M 11 Montpellier 31
Heloise F 4 18
Zoe F 15 Perpignan
Gaspard M 6 Perpignan 20
EOF
awk -f fileNameScript.awk test1.txt test2.txt test3.txt
awk: fileNameScript.awk: line 2: syntax error at or near if
awk: fileNameScript.awk: line 3: runaway string constant "scanned re ...
This way we checked if any of the files has one or more missing field(s).
TODO
exit
and next
:
exit
statement forces awk to stop executing the current rule and to stop processing input; any remaining input is ignoredThe exit statement causes awk to immediately stop executing the current rule and to stop processing input; any remaining input is ignored-The next
statement forces awk to immediately stop processing the current record and go on to the next record.
Expliquer comment sortir de la commande sinon awk traite toutes les lignes: for i in {1..30}; do awk 'NR<3{FS=",";print}NR>4{exit}' airline.csv ; done;
TODO
Since 1 always evaluates to true, it performs default operation {print $0}, hence prints the current line stored in $0
So, awk '(condition){action}1' file is equivalent to and shorthand of
awk '(condition){action} {print $0}' file Again $0 is default argument to print, so you could also write
awk 'condition{action} {print}' file
TODO
ARGC ARGV The command-line arguments available to awk programs are stored in an array called ARGV. ARGC is the number of command-line arguments present. See section Other Command Line Arguments. Unlike most awk arrays, ARGV is indexed from zero to ARGC - 1
TODO To select columns change header titles and remove space and tabulation characters from a csv file
awk 'BEGIN {FS=",";}(NR==1){$1="date";$2="MaxTemp";$3="meanTemp";$4="minTemp";$8="maxHumidity";$9="meanHumidity";$10="minHumidity"}{print $1,",",$2,",",$3,",",$4,",",$8,",",$9,",",$10}' meteoBoston.csv| sed 's/ +//g' >meteo.csv