Awk
I have been attempting to produce some documentation about Awk for an age. The fact is awk rocks and if you didn't know it then you are missing one of the most versitile tools in the UNIX kit. I have been tampering with awk since a while and even now I find it fun to produce sharp one-liners. This howto does expect you to already know some command line and maybe be familiar with scripting/programming in some way.
So just what is awk, hrmm sort of a tough question to answer. Awk is a programming language and the name of the tool that parses that language (think perl). It is intended to pattern match and processing text input, that is, it is able to find like grep and alter like sed!
Maybe a simple example will go further to explain some of the basic usage. Please note the quotes, the correct usage is important because the shell could expand certain awk variables and cause issues..
$ echo Hello, World! | awk '{print}'
Hello, World!
$ awk '{print}' file
a b c
b c a
c a b
abc
Not exactly amazing but this shows how you can output with awk. Now lets take this a step further:
$ echo Hello, World! | awk '{print $1}'
Hello,
$ awk '{print $1}' file
a
b
c
abc
You can see that awk has cut the text at the whitespace inbetween
the words and then printed the first 'field'. You can continue this with
$2, $3 etc. You can ask awk to cut at many
different point by either setting a special variable FS or using -F. For
instance lets see how it affects if we split at 'b':
$ echo Hello, World! | awk -Fb '{print $1}'
Hello, World!
$ awk '{print $1}' file
a
c a
a
We now get the whole line of text where no 'b' exists to split at. However with the file example you can see that the first field now contains very different information.
You can very quickly juggle bits of text around using these fields:
$ echo Hello, World! | awk '{print $2, $1}'
World! Hello,
$ awk '{print $2, $1, $3}' file
b a c
c b a
a c b
abc
Awk can quickly match grep by simply looking for certain strings (or more correctly regexes):
$ awk '/ab/' file abc
We can also ask awk to NOT match something (like grep -v):
$ awk '!/ab/' file a b c a b c b c a c a b
We can also tell awk to do things at the start and at the end of the program:
$ awk 'BEGIN{print "S . ."} {print} END{print ". . E"}'
S . .
a b c
b c a
c a b
abc
. . E
Normally the BEGIN is used to prepare something within the script and the final END to produce a summary. In this case we simply printed a statement.
A really simple example is to emulate wc.
$wc -l file
4 file
$ awk 'END{print NR}' file
4
The first example is actually very straight forward. The NR is a special
counter that increases on every 'record' (line of a file for instance) that
passes through awk. Therefore all we ask awk to do is print the NR for the
final line in the file. Of course wc can do more things.
$ wc file
4 10 22 file
$ awk 'BEGIN{w=0;c=0}{w=w+NF;c=length($0)+c+1}END{print NR, w, c}' file
4 10 22
Ok this is a little bit more of a complex example. So lets break it down into the separate parts:
BEGIN{w=0;c=0;}
Set 'w' and 'c' to 0. This is just an initialization step, we will be using 'w' to indicate words and 'c' to indicate characters.
w=w+NF
NF is another special awk variable, it holds the Number of Fields in the current record (line). In this case awk is spliting at whitespace, therefore the NF is actually the number of words in the line.
c=length($0)+c+1
Well I explained $1 and its friends but missed out on
$0. $0 represents the whole 'record' (just
keeping with awk terms but yes its still the line!). Therefore we calculate
the length of the line and add that to the total characters. We need to add
one here due to the fact that awk will not could the end of line character
(but wc does).
END{print NR, w, c}
So now that we have worked through the file and reached the end, we
print the NR (number of lines) then 'w' (the number of words) and finally
'c' the number of characters. There it is, awk emulating
wc
I think that it might be nice to continue to show how easily that awk
can emulate some of the tools of UNIX. For instance if we look at
uniq:
$ cat file
a b c
a b c
b c a
abc
c a b
abc
$ uniq file
a b c
b c a
abc
c a b
abc
$ awk 'BEGIN{t=""}{if(t!=$0){print} t=$0}' file
a b c
b c a
abc
c a b
abc
So, once more we prepare an empty variable 't'. We then enter the main loop and test if 't' is the same as the current line, if it isn't we print it. Then we set the current line into 't' and loop round again. This way we remove consecutive duplicate lines. Actually we can do better than this by using some awk magic dust:
$ awk 't!=$0; {t=$0}' file
a b c
b c a
abc
c a b
abc
So lets just look at this for the two steps we take.
t!=$0;
We are doing the same test as before but this time we haven't said to awk that it should print. However awk has a default action and that is to print the record. Therefore when we don't specify awk will print the line out. So we end up with the exact same test, neat eh?
{t=$0}
Once more at the end of each loop we set the current record into t.
So you can see that awk is growing more powerful as we move on. There is more trickery here, showing how you can remove all non-consecutive duplicate lines from a file:
$ awk '!($0 in t){t[$0];print}' file
a b c
b c a
abc
c a b
This time awk has done even better for us, removing duplicates across the whole file. Awk tests to see if the current record is IN t, this means that it assumes that t is an array and will test all the indexes in t to see if it matches. If not then we know that we have a new line that isn't duplicated so we add it into t's array and then print it.
Another command we can emulate with awk is tac (reverse cat):
$ awk '{a[i++]=$0} END {for (j=i-1; j>=0;) print a[j--] }' file
abc
c a b
abc
b c a
a b c
a b c
So we loop on every record 'loading' it into an array (notice how we also have a counter that we increment after each load. Then at the end of the file, we effectively unload the array backwards.
Now lets move on to a couple of simple functions in awk. We already see length() but there is a lot of others, for instance sub and gsub (substitution and global substitution). I think I will just leap in what a tougher example.
$ echo " testing weee " | awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}'
testing weee
So lets explain the regex that we use inside gsub:
^[ \t]+|[ \t]+$
That says match spaces or tabs at the beginning of the line OR match spaces or tabs at the end of the line.
So using that with gsub, what were are asking awk to do is globally (to the record) replace tabs and spaces from the start and end of the line.
Lets give you some examples of sub and gsub as they are very useful.
$ echo "abc def abc def" | awk '{sub(/abc/,"def");print}'
def def abc def
This would replace only the first instance of abc in a line not all however:
$ echo "abc def abc def" | awk '{gsub(/abc/,"def");print}'
def def def def
This replaces all instances of abc in the line.
You can let awk use external programs inside its code, for instance:
awk -F':' '{ print $1 | "sort" }' /etc/passwd
adm
bin
daemon
lp
noaccess
nobody
printer
root
sys
uucp
Here awk grabs the first field separated by : and prints the output through sort. Therefore we can actually utilize the power of the other UNIX commands into awk.
This really goes to show you that awk is one of the utilities that truely symbolizes the UNIX ideal. It is capable of being quickly intergrated to do simple mindless tasks or, with some nudging, become a fully fledged programming langauage. If you find yourself in a bind, missing utilities, or broken utilities, what better than to use awk to recover!
I have barely scratched the surface of awk here, it has literally hundreds of functions and special variables to learn. If you stick at it though you will discover one of the best utilities on your boxen!
Please see the Gawk User Manual for a far more in depth tutorial of awk and its power.
