AWK

AWK
Paradigm(s)	scripting, procedural, event-driven
Appeared in	1977
Designed by	Alfred Aho, Peter Weinberger, and Brian Kernighan
Stable release	IEEE Std 1003.1-2008 (POSIX) / 1985
Typing discipline	none; can handle strings, integers and floating point numbers; regular expressions
Major implementations	awk, GNU Awk, mawk, nawk, MKS AWK, Thompson AWK (compiler), Awka (compiler)
Dialects	old awk oawk 1977, new awk nawk 1985, GNU Awk gawk
Influenced by	C, SNOBOL4, Bourne shell
Influenced	Tcl, AMPL, Perl, Korn Shell (ksh93, dtksh, tksh), Lua
OS	Cross-platform
cm.bell-labs.com/cm/cs/awkbook

The AWK utility is an interpreted programming language typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems.

AWK was created at Bell Labs in the 1970s,^[1] and its name is derived from the family names of its authors – Alfred Aho, Peter Weinberger, and Brian Kernighan. The name is not commonly pronounced as a string of separate letters but rather as an acronym, to sound the same as the name of the bird, auk (which acts as an emblem of the language such as on The AWK Programming Language book cover - the book is often referred to by the abbreviation TAPL). When written in all lowercase letters, as awk, it refers to the Unix or Plan 9 program that runs scripts written in the AWK programming language.

As one of the early tools to appear in Version 7 Unix, it gained popularity as a way to add computational features to a Unix pipeline and besides the Bourne shell is the only scripting language available in a standard Unix environment. It is one of the mandatory utilities of the Single UNIX Specification; ^[2] required by the Linux Standard Base specification^[3] — and implementations of AWK exist for almost all other operating systems.^{[citation needed]}

AWK uses a data-driven scripting language consisting of a set of actions to be taken against textual data (either in files or data streams) for the purpose of producing formatted reports. The language extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and regular expressions. Although AWK and sed were designed to support one-liner programs, even the early Bell Labs users of AWK often wrote well-structured large AWK programs, and despite its limited intended area of use, AWK is Turing-complete.^[4] The power, terseness, and limits of early AWK programs inspired Larry Wall to write Perl just as a new, more powerful POSIX AWK and gawk (GNU AWK) were being defined.

Structure of AWK programs

"AWK is a language for processing text files. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed." - Alfred V. Aho^[5]

An AWK program is a series of pattern action pairs, written as:

''condition'' { ''action'' }

where condition is typically an expression and action is a series of commands. The input is split into records, where by default records are separated by newline characters so that the input is split into lines. The program tests each record against each of the conditions in turn, and executes the action for each expression that is true. Either the condition or the action may be omitted. The condition defaults to matching every record. The default action is to print the record.

In addition to a simple AWK expression, such as foo == 1 or /^foo/, the condition can be BEGIN or END causing the action to be executed before or after all records have been read, or pattern1, pattern2 which matches the range of records starting with a record that matches pattern1 up to and including the record that matches pattern2 before again trying to match against pattern1 on future lines.

In addition to normal arithmetic and logical operators, AWK expressions include the tilde operator, ~, which matches a regular expression against a string. As handy syntactic sugar, /regexp/ without using the tilde operator matches against the current record.

AWK commands

AWK commands are the statements that are substituted for action in the examples above. AWK commands can include function calls, variable assignments, calculations, or any combination thereof. AWK contains built-in support for many functions; many more are provided by the various flavors of AWK. Also, some flavors support the inclusion of dynamically linked libraries, which can also provide more functions.

For brevity, the enclosing curly braces ( { } ) will be omitted from these examples.

The print command

The print command is used to output text. The output text is always terminated with a predefined string called the output record separator (ORS) whose default value is a newline. The simplest form of this command is:

print: This displays the contents of the current record. In AWK, records are broken down into fields, and these can be displayed separately:
print $1: Displays the first field of the current record
print $1, $3: Displays the first and third fields of the current record, separated by a predefined string called the output field separator (OFS) whose default value is a single space character

Although these fields ($X) may bear resemblance to variables (the $ symbol indicates variables in Perl), they actually refer to the fields of the current record. A special case, $0, refers to the entire record. In fact, the commands "print" and "print $0" are identical in functionality.

The print command can also display the results of calculations and/or function calls:

print 3+2print foobar(3)print foobar(variable)print sin(3-2)

Output may be sent to a file:

print "expression" > "file name"

or through a pipe:

print "expression" | "command"

Built-in variables

Awk's built-in variables include the field variables: $1, $2, $3, and so on ($0 represents the entire record). They hold the text or values in the individual text-fields in a record.

Other variables include:

NR: Keeps a current count of the number of input records.
NF: Keeps a count of the number of fields in an input record. The last field in the input record can be designated by $NF.
FILENAME: Contains the name of the current input-file.
FS: Contains the "field separator" character used to divide fields on the input record. The default, "white space", includes any space and tab characters. FS can be reassigned to another character to change the field separator.
RS: Stores the current "record separator" character. Since, by default, an input line is the input record, the default record separator character is a "newline".
OFS: Stores the "output field separator", which separates the fields when Awk prints them. The default is a "space" character.
ORS: Stores the "output record separator", which separates the output records when Awk prints them. The default is a "newline" character.
OFMT: Stores the format for numeric output. The default format is "%.6g".

Variables and syntax

Variable names can use any of the characters [A-Za-z0-9_], with the exception of language keywords. The operators + - * / represent addition, subtraction, multiplication, and division, respectively. For string concatenation, simply place two variables (or string constants) next to each other. It is optional to use a space in between if string constants are involved, but two variable names placed adjacent to each other require a space in between. Double quotes delimit string constants. Statements need not end with semicolons. Finally, comments can be added to programs by using # as the first character on a line.

User-defined functions

In a format similar to C, function definitions consist of the keyword function, the function name, argument names and the function body. Here is an example of a function.

function add_three (number) { return number + 3}

This statement can be invoked as follows:

print add_three(36) # Outputs '''39'''

Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some whitespace in the argument list before the local variables, to indicate where the parameters end and the local variables begin.

Sample applications

Hello World

Here is the customary "Hello world" program written in AWK:

BEGIN { print "Hello, world!" }

Note that an explicit exit statement is not needed here; since the only pattern is BEGIN, no command-line arguments are processed.

Print lines longer than 80 characters

Print all lines longer than 80 characters. Note that the default action is to print the current line.

length($0) > 80

Print a count of words

Count words in the input, and print the number of lines, words, and characters (like wc)

{ w += NF c += length + 1}END { print NR, w, c }

As there is no pattern for the first line of the program, every line of input matches by default so the increment actions are executed for every line. Note that w += NF is shorthand for w = w + NF.

Sum last word

{ s += $NF }END { print s + 0 }

s is incremented by the numeric value of $NF which is the last word on the line as defined by AWK's field separator, by default white-space. NF is the number of fields in the current line, e.g. 4. Since $4 is the value of the fourth field, $NF is the value of the last field in the line regardless of how many fields this line has, or whether it has more or fewer fields than surrounding lines. $ is actually a unary operator with the highest operator precedence. (If the line has no fields then NF is 0, $0 is the whole line, which in this case is empty apart from possible white-space, and so has the numeric value 0

At the end of the input the END pattern matches so s is printed. However, since there may have been no lines of input at all, in which case no value has ever been assigned to s, it will by default be an empty string. Adding zero to a variable is an AWK idiom for coercing it from a string to a numeric value. (Concatenating an empty string is to coerce from a number to a string, e.g. s "". Note, there's no operator to concatenate strings, they're just placed adjacently.) With the coercion the program prints 0 on an empty input, without it an empty line is printed.

Match a range of input lines

$ yes Wikipedia | awk 'NR % 4 == 1, NR % 4 == 3 { printf "%6d  %s\n", NR, $0 }' | sed 7q 1  Wikipedia 2  Wikipedia 3  Wikipedia 5  Wikipedia 6  Wikipedia 7  Wikipedia 9  Wikipedia

The yes command repeatedly prints its argument (by default the letter "y") on a line. In this case, we tell the command to print the word "Wikipedia". The action statement prints each line numbered. The printf function emulates the standard C printf, and works similarly to the print command described above. The pattern to match, however, works as follows: NR is the number of records, typically lines of input, AWK has so far read, i.e. the current line number, starting at 1 for the first line of input. % is the modulo operator. NR % 4 == 1 is true for the first, fifth, ninth, etc., lines of input. Likewise, NR % 4 == 3 is true for the third, seventh, eleventh, etc., lines of input. The range pattern is false until the first part matches, on line 1, and then remains true up to and including when the second part matches, on line 3. It then stays false until the first part matches again on line 5. The sed command is used to print the first 7 lines, to prevent yes running forever. It is equivalent to head -n7 if the head command is available.

The first part of a range pattern being constantly true, e.g. 1, can be used to start the range at the beginning of input. Similarly, if the second part is constantly false, e.g. 0, the range continues until the end of input:

 /^--cut here--$/, 0

prints lines of input from the first line matching the regular expression ^--cut here--$, that is, a line containing only the phrase "--cut here--", to the end.

Calculate word frequencies

Word frequency uses associative arrays:

BEGIN { FS="[^a-zA-Z]+"}{ for (i=1; i<=NF; i++)  words[tolower($i)]++}END { for (i in words) print i, words[i]}

The BEGIN block sets the field separator to any sequence of non-alphabetic characters. Note that separators can be regular expressions. After that, we get to a bare action, which performs the action on every input line. In this case, for every field on the line, we add one to the number of times that word, first converted to lowercase, appears. Finally, in the END block, we print the words with their frequencies. The line

for (i in words)

creates a loop that goes through the array words, setting i to each subscript of the array. This is different from most languages, where such a loop goes through each value in the array. The loop thus prints out each word followed by its frequency count. tolower was an addition to the One True awk (see below) made after the book was published.

Match pattern from command line

This program can be represented in several ways. The first one uses the Bourne shell to make a shell script that does everything. It is the shortest of these methods:

pattern=$1shiftawk '/'$pattern'/ { print FILENAME ":" $0 }' "$@"

The $pattern in the awk command is not protected by quotes. A pattern by itself in the usual way checks to see if the whole line ($0) matches. FILENAME contains the current filename. awk has no explicit concatenation operator; two adjacent strings concatenate them. $0 expands to the original unchanged input line.

There are alternate ways of writing this. This shell script accesses the environment directly from within awk:

pattern=$1shiftawk '$0 ~ ENVIRON["pattern"] { print FILENAME ":" $0 }' "$@"

This is a shell script that uses ENVIRON, an array introduced in a newer version of the One True awk after the book was published. The subscript of ENVIRON is the name of an environment variable; its result is the variable's value. This is like the getenv function in various standard libraries and POSIX. The shell script makes an environment variable pattern containing the first argument, then drops that argument and has awk look for the pattern in each file.

~ checks to see if its left operand matches its right operand; !~ is its inverse. Note that a regular expression is just a string and can be stored in variables.

The next way uses command-line variable assignment, in which an argument to awk can be seen as an assignment to a variable:

pattern=$1shiftawk '$0 ~ pattern { print FILENAME ":" $0 }' "pattern=$pattern" "$@"

Finally, this is written in pure awk, without help from a shell or without the need to know too much about the implementation of the awk script (as the variable assignment on command line one does), but is a bit lengthy:

BEGIN { pattern = ARGV[1] for (i = 1; i < ARGC; i++) # remove first argument ARGV[i] = ARGV[i + 1] ARGC-- if (ARGC == 1) { # the pattern was the only thing, so force read from standard input (used by book) ARGC = 2 ARGV[1] = "-" }}$0 ~ pattern { print FILENAME ":" $0 }

The BEGIN is necessary not only to extract the first argument, but also to prevent it from being interpreted as a filename after the BEGIN block ends. ARGC, the number of arguments, is always guaranteed to be ≥1, as ARGV[0] is the name of the command that executed the script, most often the string "awk". Also note that ARGV[ARGC] is the empty string, "". # initiates a comment that expands to the end of the line.

Note the if block. awk only checks to see if it should read from standard input before it runs the command. This means that

awk 'prog'

only works because the fact that there are no filenames is only checked before prog is run! If you explicitly set ARGC to 1 so that there are no arguments, awk will simply quit because it feels there are no more input files. Therefore, you need to explicitly say to read from standard input with the special filename -.

Self-contained AWK scripts

On Unix-like operating systems self-contained AWK scripts can be constructed using the "shebang" syntax.

For example, a script called hello.awk that prints the string Hello, world! may be built by creating a file named hello.awk containing the following lines:

#!/usr/bin/awk -fBEGIN { print "Hello, world!" }

The -f tells awk that the argument that follows is the file to read the AWK program from.

Versions and implementations

AWK was originally written in 1977, and distributed with Version 7 Unix.

In 1985 its authors started expanding the language, most significantly by adding user-defined functions. The language is described in the book The AWK Programming Language, published 1988, and its implementation was made available in releases of UNIX System V. To avoid confusion with the incompatible older version, this version was sometimes called "new awk" or nawk. This implementation was released under a free software license in 1996, and is still maintained by Brian Kernighan. (see external links below)

Old versions of Unix, such as UNIX/32V, included awkcc, which converted AWK to C. Kernighan wrote a program to turn awk into C++; its state is not known.^[6]

BWK awk or nawk refers to the version by Brian Kernighan. It has been dubbed the "One True AWK" because of the use of the term in association with the book that originally described the language and the fact that Kernighan was one of the original authors of AWK.^[7] FreeBSD refers to this version as one-true-awk.^[8] This version also has features not in the book, such as tolower and ENVIRON that are explained above; see the FIXES file in the source archive for details. This version is used by e.g. FreeBSD, NetBSD, OpenBSD and OS X.

gawk (GNU awk) is another free software implementation and the only implementation that makes serious progress implementing internationalization and localization and TCP/IP networking. It was written before the original implementation became freely available. It includes its own debugger, and its profiler enables the user to make measured performance enhancements to a script, and it also enables the user to extend functionality via shared libraries. Linux distributions are mostly GNU software, and so they include gawk. FreeBSD before version 5.0 also included gawk version 3.0 but subsequent versions of FreeBSD use BWK awk to avoid the more restrictive GNU General Public License (GPL) license as well as for its technical characteristics.^[9] ^[10]

mawk is a very fast AWK implementation by Mike Brennan based on a byte code interpreter.

libmawk is a fork of mawk, allowing applications to embed multiple parallel instances of awk interpreters.

awka (which front end is written atop the mawk program) is another translator of AWK scripts into C code. When compiled, statically including the author's libawka.a, the resulting executables are considerably sped up and, according to the author's tests, compare very well with other versions of AWK, Perl, or Tcl. Small scripts will turn into programs of 160-170 kB.

tawk (Thompson AWK) is an AWK compiler for Solaris, DOS, OS/2, and Windows, previously sold by Thompson Automation Software (which has ceased its activities).

Jawk is a project to implement AWK in Java, hosted on SourceForge.^[11] Extensions to the language are added to provide access to Java features within AWK scripts (i.e., Java threads, sockets, Collections, etc.).

jawk (Josh's Awk) is a modern implementation of AWK in Perl, which supports ranges, indexing columns by negative numbers, a Perl mode, and more.

xgawk is a SourceForge project based on gawk.^[12] It extends gawk with dynamically loadable libraries.

QSEAWK is an embedded AWK interpreter implementation included in the QSE library that provides embedding application programming interface (API) for C and C++.^[13]

BusyBox includes a sparsely documented AWK implementation that appears to be complete, written by Dmitry Zakharov. This is a very small implementation suitable for embedded systems.

Books

Aho, Alfred V.; Kernighan, Brian W.; Weinberger, Peter J. (1988-01-01). The AWK Programming Language. New York, NY: Addison-Wesley. ISBN 0-201-07981-X. http://cm.bell-labs.com/cm/cs/awkbook /. Retrieved 2009-04-16. The book's webpage includes downloads of the current implementation of Awk and links to others.
Robbins, Arnold (2001-05-15). Effective awk Programming (3rd ed.). Sebastopol, CA: O'Reilly Media. ISBN 0-596-00070-7. http://www.oreilly.com/catalog/awkpro g3/. Retrieved 2009-04-16.
Dougherty, Dale; Robbins, Arnold (1997-03-01). sed & awk (2nd ed.). Sebastopol, CA: O'Reilly Media. ISBN 1-56592-225-5. http://www.oreilly.com/catalog/sed2/. Retrieved 2009-04-16.
Robbins, Arnold (2000). Effective Awk Programming: A User's Guide for Gnu Awk (1.0.3 ed.). Bloomington, IN: iUniverse. ISBN 0-595-10034-1. Archived from the original on 12 April 2009. http://www.gnu.org/software/gawk/manu al/. Retrieved 2009-04-16. Arnold Robbins maintained the GNU Awk implementation of AWK for more than 10 years. The free GNU Awk manual was also published by O'Reilly in May 2001. Free download of this manual is possible through the following book references.

References

^ The A-Z of Programming Languages: AWK
^ The Single UNIX Specification, Version 3, Utilities Interface Table
^ Linux Standard Base Core Specification 4.0, Chapter 15. Commands and Utilities
^ Raymond, Eric S.. "Applying Minilanguages". The Art of Unix Programming. Case Study: awk. Archived from the original on July 30, 2008. http://web.archive.org/web/2008073006 3308/http://www.faqs.org/docs/artu/ch 08s02.html#awk. Retrieved May 11, 2010. "The awk action language is Turing-complete, and can read and write files."
^ http://www.computerworld.com.au/index .php/id;1726534212;pp;2 The A-Z of Programming Languages: AWK
^ An AWK to C++ Translator
^ The AWK Programming Language, ISBN 0-201-07981-X.
^ FreeBSD's work log for importing BWK awk into FreeBSD's core, dated 2005-05-16, downloaded 2006-09-20
^ FreeBSD's view of GPL Advantages and Disadvantages
^ FreeBSD 5.0 release notes with notice of BWK awk in the base distribution
^ Jawk at SourceForge
^ xgawk Home Page
^ QSEAWK at Google Code

External links

The Amazing Awk Assembler by Henry Spencer.
Awk Community Portal
Awk on flossmanuals.net
AWK at the Open Directory Project

Unix command-line interface programs and shell builtins

File system	cat cd chmod chown chgrp cksum cmp cp dd du df file fsck fuser ln ls lsattr lsof mkdir mount mv pax pwd rm rmdir size split tee touch type umask

Processes	at bg chroot cron fg kill killall nice pgrep pidof pkill ps pstree time top

User environment	clear env exit finger history id logname mesg passwd su sudo uptime talk tput uname w wall who whoami write

Text processing	awk banner basename comm csplit cut dirname ed ex fmt fold head iconv join less more paste sed sort spell strings tail tr uniq vi wc xargs

Shell builtins	alias echo expr printf sleep test true and false unset wait yes

Networking	dig host ifconfig inetd netcat netstat nslookup ping rdate rlogin route ssh traceroute

Searching	find grep locate whatis whereis

Documentation	apropos help man

Miscellaneous	bc dc cal date lp lpr

Daftar/Tabel -- Unix utilities

Contents