Computer Hope
Software => Computer programming => Topic started by: briandams on January 16, 2014, 05:47:09 AM
-
awk has been around since the unix days in the 70s and has been a standard tool in most unix-like OS. It is primarily used for text and string manipulation. The GNU awk (http://www.gnu.org/software/gawk/manual/gawk.html) is one of the most widely used awk version nowadays and now its also ported to the windows OS so its now convenient to use it as part of your batch scripting tool set.
In this thread, I shall show some examples on how one can use this tool for easy file/text manipulation in a Windows batch environment. (if you can download 3rd party tools not installed by default). This is mostly for beginners to using awk or looking for tools to parse strings/text.
The syntax for awk is simple
pattern { action }
For example, to print a file
C:\> awk "{print}" myFile.txt
In the above example, the "action" is "print". This is the equivalent of the command
type myFile.txt
The cmd.exe on windows doesn't like single quotes, so we have use double quotes for the "action" part.
awk has a "BEGIN" and "END" pattern block. The "BEGIN" pattern only executes once before the first record is read by awk. For example, you can initialize variables inside this block
awk "BEGIN{a=10} ....." myFile.txt
or just do some calculation (simple calculator)
C:\>awk "BEGIN { print 1+2 } "
3
Likewise the "END" pattern is executed once only after all the records in the file has been read. For example, you want to print the last line of the file
C:\> more myFile.txt
C:\original\1\2\3
C:\original\1\2\4
C:\original\1\2\5
C:\original\1\2\36
test
C:\>awk "END{ print $0} " myFile.txt
test
"$0" means the current line/record.
to be continued...
- brianadams
-
One common task almost everyone does is getting information from files. If you have a field delimited text file to parse, then awk might be just the tool you need.
For example, if you have this simple csv file where the delimiter is "|"
C:\>type myFile.txt
1|2|3|4|5
6|7|8|9|10
a|b|c|d|e
and you wish to get the 3rd column. In awk, the 3rd column is denoted by $3. Likewise, 2nd column as $2 and so on. So to get the 3rd column, issue this command
C:\>awk -F"|" "{print $3}" myFile.txt
3
8
c
the -F option is the field delimiter. Here, the "|" is specified as field delimiter. Hence awk will break the record into fields or tokens, with each field denoted by "$" and a numeric value. eg $1 means the first field, $9 means the 9th field and so on.
The above is equivalent to DOS for /f command with tokens
for /f "tokens=* delims=|" ..........
To print the last field, use $NF. To print the last 2nd field, use $(NF-1)
One of the feature of awk and -F is its ability to take in a regular expression, or multiple characters as the field delimiter. For example we have a file with delimiters ",%#"
C:\>type myFile.txt
1,%#2,%#3,%#4,%#5
6,%#7,%#8,%#9,%#10
a,%#b,%#c,%#d,%#e
Issuing the same commands as before but pass ',%#' to -F and printing the 2nd column
C:\>awk -F",%#" "{print $2}" myFile.txt
2
7
b
to be continued
- brianadams
-
Often we need to get the length of a string or a line of record in the file. awk provides the length() function to do this. For example
C:\>echo "test"| awk "{print length}"
6
why is it 6 and not 4 ? This is because the "echo" command in DOS "counts" the double quotes as characters. Hence you get 6. To calculate the string length of some variable you can just pipe to awk without the quotes
C:\>echo test| awk "{print length}"
4
Use the usual DOS for loop (for /f ...) to capture the result
How about going through a file and displaying the lines that are of a certain length?
eg we have this file
C:\>type myFile.txt
abcd
abcd
abcdefghi
abcdefghi
abcdefghijklmn
and we want to get those lines whose length is 4.
C:\>awk "length==4" myFile.txt
abcd
abcd
writing length==4 this way is considered the "pattern" part of the awk syntax. So its not like this:
c:\>awk "{length==4}" myFile.txt
The "pattern" part of the awk syntax is usually a regular expression or some conditions.
Another example, search for length greater than 4 and less than 10 will yield
C:\>awk "length>4 && length <10" myFile.txt
abcdefghi
abcdefghi
If you want to write out the "action" part of the awk syntax, then the above is the same as
C:\>awk "length>4 && length <10 {print} " myFile.txt
abcdefghi
abcdefghi
to be continued..
- brianadams
-
awk provides the usual maths operators to help you perform calculations in your script. Here I only list some that are commonly used.
Exponents
x ^ y
x ** y
Add, minus, divide, multiply -> +, - , / , *
Modulus : %
x++ , ++x : post and pre increment operators
x-- , --x : post and pre decrement operators
x += 1 : Adds 1 to the value of x
x -= 1 : Minus 1 from the value of x
Boolean operators:
! : not operator
&& : Logical AND
|| : Logical OR
Relational Operators
<<, >>, <, > , =
Regular expression matching operators
~ : matching
!~ : non-matching
Ternary operator (conditional expression )
?:
For square roots, there is the sqrt() function. eg sqrt(100)
For trigonometry, there are cosine(), sine(), tan() etc functions.
To generate random numbers, there is the rand() function eg
C:\>awk "BEGIN{ print rand() }"
0.237788
To generate different random numbers everytime you run the awk command, use the srand() function
C:\>gawk "BEGIN{ srand(); print rand() }"
0.14306
C:\>gawk "BEGIN{ srand(); print rand() }"
0.807121
C:\>gawk "BEGIN{ srand(); print rand() }"
0.663245
To concatenate strings , just write them next to each other, like this
C:\>awk "BEGIN{ print \"2\" \"3\" }"
23
If writing the awk command on the command line, we have to take care of the double quotes that is used inside awk , by escaping the quotes. In unix shell, it can be written like this :
awk 'BEGIN{ print "2" "3"}'
For more information on operators, please consult the manual (http://www.gnu.org/software/gawk/manual/gawk.html#Concatenation).
to be continued ...
- brianadams
-
Here I cover simple string manipulation in awk using its in-built string functions
1) Getting part of a string - substring-ing
2) Getting index of a string
3) Splitting a string
4) Uppercase and lowercase
1) Getting part of a string
awk provides the substr() function to get part of a string, for example
C:\>echo chimpanzee| awk "{print substr($0,2,5) }"
himpa
$0 is the current record/line, in this case, its the standard input passed to awk using the pipe. substr($0,2,5) just says to get the 5 characters starting at position 2 of the current record. It is the same as the DOS internal build in
%variable:~1,5%
where %variable% is "chimpanzee". Note that the "echo" command in DOS is particular about spaces (ref:foxidrive), so in the example above, no spaces after "echo chimpanzee"
2) Getting index of a string
This is equivalent to saying "get the first occurence of a string inside a string. eg To find the first occurence of the letter "h" in "elephant"
C:\>echo elephant| awk "{print index($0,\"h\") }"
5
(take note of the escaping of double quotes when writing on the command line)
If the letter is not found, index() will return 0. So you can check for ERRORLEVEL ==0 in DOS shell. This is useful if you want to see if a string is found inside another string.
C:\>echo elephant| awk "{print index($0,\"z\") }"
0
Next, the split command. awk provides the split() command to split a string based on a pattern. For example, let's split the word "euphoria" on the letter "p"
C:\>echo euphoria| awk "{ n=split($0,array,\"p\") } END{ print array[1], n} "
eu 2
Again, $0 means current record (which is euphoria passed in from std input). the split() function takes in the first argument as the record, the 2nd argument as an array, and the last argument as the pattern to split on. This pattern can be a regular expression.
The results of the split are stored in "array". In the above example, we print out the first item of the array at the END block. split() returns the number of items in the array. So in the above, "n" has a value of 2, meaning there are 2 items in the array.
4) Uppercase and lowercase
Often you might want to change the case of words/strings in your task objective. Awk provides in-built functions tolower() and toupper(). eg
This one liner change all the characters in the file to uppercase
C:\>type myFile.txt
computerhope.com
C:\>awk "{ print toupper($0) ;}" myFile.txt
COMPUTERHOPE.COM
If you want to change only one string,
C:\>echo test|awk "{print toupper($0) }"
TEST
C:\>echo TEST|awk "{print tolower($0) }"
test
As usual, capture the result in using a DOS for loop.
to be continued...
- brianadams
-
awk provides at least 2 forms of printing to output,
1) print
2) printf and
3) sprintf()
1) print.
The basic statement to display output to the user is the print statement. It should be too difficult to understand how to use it. Just
print "your string"
Sometimes you also can redirect to an output file inside of awk by using the output redirection operator ">"
C:\>awk "BEGIN{print \"computerhope.com\" > \"testfile\" }"
C:\>type testfile
computerhope.com
2) printf().
This printf statement syntax look like this:
printf("format" , item1, item2 ...)
This printf statement is very much similar to printf() from C language, where you can put format specifiers such as %s (string), %d (integer), %f (float). For example, to format some number or floats to 2 decimal places
C:\>awk "BEGIN{ printf(\"%.2f\" , 100) }"
100.00
C:\>awk "BEGIN{ printf(\"%.2f\" , 3.14244) }"
3.14
To right justify a string 15 places
C:\>awk "BEGIN{ printf(\"%15s\" , \"mystring\") }"
mystring
If you want to pad a string with 0's in front, eg
C:\>awk "BEGIN{ printf(\"%05d\" , 100) }"
00100
3) sprintf().
sprintf() works the same as printf() and it allows the formatting to be saved to a variable.
C:\>awk "BEGIN{PI = sprintf(\"%.4f\", 22/7); print PI }"
3.1429
here, the value of 22/7 is saved to "PI" variable with 4 decimal places. This variable can then be used in other parts of the awk script.
For more info and examples on print, printf and sprintf please consult the manual.
to be continued
- brianadams
-
Awk loops works the same as those in C language. Here I touch 2 of the most common loops ,
1) for loop
2) while loop
The syntax for a "for" loop in awk is this
for (initialization; condition; increment)
body
eg to generate a range of numbers from 1 to 9
C:\>awk "BEGIN{ for(i=1;i<10;i++){ print i } }"
1
2
3
4
5
6
7
8
9
Use a DOS for loop to catch each number and use as desired. This is the same as
FOR /L %%G IN (1,1,9) DO echo %%G
The while loop is another popular form of looping construct in most programming language. For example, setting a count down and printing 10 "*"
C:\>awk "BEGIN{count=10; while(count>0 ){ print \"*\" ; count--} }"
*
*
*
*
*
*
*
*
*
*
To put in more clearly
BEGIN{
count=10 # set count of 10.
while(count>0 ) {
print \"*\" # print *
count-- # decrement the count each time through the loop
}
}
Of course, the above can be written with the for loop as well
C:\>awk "BEGIN{for(c=10;c>0;c--) print \"*\" }"
*
*
*
*
*
*
*
*
*
*
to be continued
- brianadams
-
Most programming language support data structures such as arrays that can be used to stored similar collection of items, instead of individual variables. Awk has arrays too and its called associative arrays. That means each array is a collection of pairs, called, index and value.
Here are simple example of how to use arrays in Awk.
C:\>awk "BEGIN{a[1]=\"one\" ; a[2]=10; print a[1]\",\"a[2] }"
one,10
In the above example, we declare array "a" with item "1" having a value of "one" (a string) and item "2" with value of 10 (integer). Arrays in awk can have different data types for items and values. eg
C:\>awk "BEGIN{a[\"two\"]=2; print a[\"two\"] }"
2
here, the item is "two" (a string) and value is the integer 2.
To iterate an array:, use an awk for loop
C:\>awk "BEGIN{a[1]=\"one\" ; a[\"two\"]=2; for(item in a) {print item\" \"a[item] } }"
two 2
1 one
To put it more clearly:
BEGIN{
a[1]=\"one\"
a[\"two\"]=2
for( item in a ) {
print item\" \"a[item]
}
}
In awk, arrays have no order indexing, not like normal arrays in C. So by printing the array in the above case, the result is arbitrary.
To get the size of an array, you can use length() function as described earlier
C:\>awk "BEGIN{a[1]=\"one\" ; a[\"two\"]=2; a[2]=100; print length(a) }"
3
To see if an item exists in array, we can use the if statement
C:\>awk "BEGIN{a[1]=\"one\" ; a[\"one\"]=1; a[2]=100; if (2 in a) { print \"ok\"} }"
ok
More clearer this way:
BEGIN{
a[1] =\"one\" # define array items and values
a[\"one\"] = 1
a[2] = 100
if (2 in a) {
print \"ok\"
}
}
To remove an item in array, use the delete statement, eg
C:\>awk "BEGIN{a[1]=\"one\" ; a[\"one\"]=1; a[2]=100; delete a[2]; if (2 in a) { print \"ok\"}else {print \"not ok\"} }"
not ok
More clearer this way:
BEGIN{
a[1]=\"one\"
a[\"one\"]=1
a[2]=100;
delete a[2] # delete item "2"
if (2 in a) {
print \"ok\"
}else {
print \"not ok\"
}
}
To delete a whole array: just do delete array
See the manual for more elaborate examples on using arrays
to be continued ...
- brianadams
-
Making decisions are part of our thought process every day. If you want to tell the computer to do something then the language must provide if/else constructs for that. :)
Awk provides the usual if/else/else if constructs that most languages have.
C:\>awk "BEGIN{ b=2; if( b==2 ) print \"it is 2\" }"
it is 2
Basic construct
if ( condition ) {
....
} else if (condition) {
...
} else {
....
}
The break Statement jumps out of loops, like this:
for ( conditions ){
...
break #breaks out of for loop
...
}
The continue Statement is also used in loop and skips the rest of the loop causing the next cycle around the loop to begin immediately. eg
for (x = 0; x <= 10; x++) {
if (x == 2) {
continue # this continue skips the print statement below
}
print "something"
}
Later version of Awk supports the switch statement, but its seldom needed as an if/else is good enough for most task. If you want to know more about switch statements, check the manual.
to be continued
- brianadams
-
Awk has some internal variables that you should be familiar with for parsing string and files.
1) NR
2) NF
3) FS
4) RS
5) ORS
6) OFS
1) NR
NR stands for number of input records awk has processed since the beginning of the program's execution. For example, you want to find the line count of a file
C:\>type myFile.txt
1,%#2,%#3,%#4,%#5
6,%#7,%#8,%#9,%#10
a,%#b,%#c,%#d,%#e
C:\>type myFile.txt | awk "END{print NR}"
3
This is the same as what the Unix wc -l command gives you.
2) NF
NF stands for the number of fields in the current input record. For example
C:\>type myFile.txt
1,2,3,4,5
6,7,8,9,0,10
C:\> awk -F"," "{print NF}" myFile.txt
5
6
here, because we have set the -F option (field delimiter) as comma, then the first record will have 5 fields, and the 2nd record will have 6.
3) FS
This is the input field separator, similar to -F option passed to awk. Usually its defined in the BEGIN block before any records are processed
awk "BEGIN{FS=","} {print}" myFile.txt
FS can be any characters (multicharacters as well) and regular expressions
4) RS
RS stands for input record separator. By default awk's RS is the newline character, that's why awk processed lines one by one by default. You can set the RS to a different value, for example, let's say you want to display the above myFile.txt each number on a line by itself.
C:\>more myFile.txt
1,2,3,4,5
6,7,8,9,0,10
C:\>awk "BEGIN{RS=\",\"}{ print $0 } " myFile.txt
1
2
3
4
5
6
7
8
9
0
10
Here, RS is set to comma "," , so now each record is just the numbers by itself.
5) ORS
ORS stands for Output record separator. Its default is newline "\n" and is the output of every print statement. For example, let's say you want "wrap" lines in a file to become a single line eg,
C:\>awk "BEGIN{ORS=\"#\"}{ print $0 } " myFile.txt
1,2,3,4,5#6,7,8,9,0,10#
you can change the ORS to "#", and the output will become one line. Notice the "#". Orignially, its "\n", now its "#". Hence this gives the effect of joining to become a single line.
6) OFS
This is the output field separator. ITs default is space, and its the output between the fields printed by a print statement. For example, changing the field separator to "#"
C:\>type myFile.txt
1,2,3,4,5
6,7,8,9,0,10
C:\>awk "BEGIN{OFS=\"#\"; FS=\",\"}{$1=$1;print } " myFile.txt
1#2#3#4#5
6#7#8#9#0#10
In the above example, because we are changing the OFS, the record need to be rebuild to "reflect" the changes. Hence its common idiom to use $1=$1. (you can consult the manual for explanation)
to be continued ..
-brianadams
-
Awk has in built pattern matching and functions for string substitutions. Here I show some basic examples of simple matching and substitution. Regular expressions is a vast topic so if for in depth regex , please consult a regex book. My favorite is Mastering Regular Expression from Oreilly.
Pattern matching
In awk, simple matching goes like this using the ~ operator. (all examples use myFile.txt)
C:\> type myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
C:\>awk "/divID/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
The above says to find any lines that has the string "divID" . For pattern matching, the regex pattern to find is usually enclosed in / /.
If you want case-insensitive search , use the IGNORECASE variable
C:\>awk "BEGIN{IGNORECASE=1}/divid/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
Setting IGNORECASE to 0 toggles it back to case-sensitive.
If you want to find all records with 2nd column starting with "A", then
C:\>awk -F"," "$2 ~ /^A/ {print}" myFile.txt
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
First, give the -F"," option because the file is "," delimited. Then use $2 because its the 2nd column. Then using the regex /^A/. "^" means "starts with". After that "{print}" action will print the relevant records.
In awk, you can negate matches using !~ operator. For example , you want to find
records that doesn't have "DET" as the 3rd field
C:\>awk -F"," "$3 !~ /DET/{print}" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
If you just want to find records that doesn't have the string "DET", just do a !/DET/ using the "!" operator
C:\>awk -F"," "!/DET/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
These are very simple examples on using regex operator ~, !~ for searching strings.
String replacement
Awk provides the sub() and gsub() functions to replace strings in files
The syntax for sub() is
sub(regexp, replacement [, target])
for example, replace "LAN" with "NAL"
C:\>awk "{sub(\"LAN\",\"NAL\", $0); print }" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,[color=#800000]NAL[/color],1,W,N,1,57,36,21
1981,NL,[color=#800000]NAL[/color],2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
sub() only replaces one occurence of the string. For global replacement, use gsub() which has the same syntax as sub().
To replace the "BAL" string from the 4th line only, use NR==4 as the "pattern". then use sub().
C:\>awk "NR==4 { sub(\"BAL\",\"LAB\") } {print}" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,[color=#800000]LAB[/color],1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
.....
to be continued
- brianadams
-
Awk commands are not just for one liners as we have seen so far. You can put awk commands in a script (aka, text file) and have awk run them for you. Its the same as writing a vbscript and having cscript engine runs the command for you.
The syntax for running awk scripting is simply (-f option)
c:\> awk -f myawkscript.awk input_file.csv
Lets round up this part of the primer by an example. Say you have last 20 days worth of Google financial data in a csv comma delimited file. you want to find the average of the closing price (column 5) and find out how many of the days (records) have their closing price greater than the average.
Date,Open,High,Low,Close,Volume,Adj Close
2014-01-03,1115.00,1116.93,1104.93,1105.00,1666700,1105.00
2014-01-02,1115.46,1117.75,1108.26,1113.12,1821400,1113.12
2013-12-31,1112.24,1121.00,1106.26,1120.71,1357900,1120.71
2013-12-30,1120.34,1120.50,1109.02,1109.46,1236100,1109.46
2013-12-27,1120.00,1120.28,1112.94,1118.40,1569700,1118.40
2013-12-26,1114.01,1119.00,1108.69,1117.46,1337800,1117.46
2013-12-24,1114.97,1115.24,1108.10,1111.84,734200,1111.84
2013-12-23,1107.84,1115.80,1105.12,1115.10,1721600,1115.10
2013-12-20,1088.30,1101.17,1088.00,1100.62,3261600,1100.62
2013-12-19,1080.77,1091.99,1079.08,1086.22,1665700,1086.22
2013-12-18,1071.85,1084.95,1059.04,1084.75,2210300,1084.75
2013-12-17,1072.82,1080.76,1068.38,1069.86,1535700,1069.86
2013-12-16,1064.00,1074.69,1062.01,1072.98,1602000,1072.98
2013-12-13,1075.40,1076.29,1057.89,1060.79,2162400,1060.79
2013-12-12,1079.57,1082.94,1069.00,1069.96,1595900,1069.96
2013-12-11,1087.40,1091.32,1075.17,1077.29,1695800,1077.29
2013-12-10,1076.15,1092.31,1075.65,1084.66,1853900,1084.66
2013-12-09,1070.99,1082.31,1068.02,1078.14,1482600,1078.14
2013-12-06,1069.79,1070.00,1060.08,1069.87,1428800,1069.87
2013-12-05,1057.20,1059.66,1051.09,1057.34,1133700,1057.34
For this, its too "complicated" to be a one liner so we put commands inside a file. You can use any text editor to create your script.
The basic layout of the script goes like this:
BEGIN{
# here you can initialize variables
}
{
# here you Do processing For every record
}
End {
# here you can Do End processing like printing final result
}
Here's a snapshot of a the script
BEGIN{
# here you can initialize variables
FS = "," # Set the field delimiter To comma
sum = 0 # Set a variable called sum To store the total of column 5
}
NR>1{
# use NR > 1 To exclude the header row
# here you Do processing For every record
sum += $5 # awk convert implictly Each column 5 values To integer
}
END {
# here you can Do End processing like printing final result
print "The total sum is " sum
print "The average is " sum/NR
}
NR is the total number of records, so to average column 5 which is the closing price, just divide the sum by NR at the END block.
Running the script gives
C:\>awk -f average.awk google.csv
The total sum is 21823.6
The average is 1039.22
Next we find how many days are there in the file that has closing price greater than average. This is the code
BEGIN{
# here you can initialize variables
FS = "," # Set the field delimiter To comma
sum = 0 # Set a variable called sum To store the total of column 5
}
NR>1{
# use NR > 1 To exclude the header row
# here you Do processing For every record
sum += $5 # awk convert implictly Each column 5 values To integer
days[$1] = $5 # store the closing price into Array, With the first column as index
}
END {
# here you can Do End processing like printing final result
average = sum/NR
print "The total sum is " sum
print "The average is " average
print "Days greater than average"
for( d in days ) {
if ( days[d] > average ) {
print d, days[d]
}
}
}
running the script gives
C:\>awk -f average.awk google.csv
The total sum is 21823.6
The average is 1039.22
Days greater than average
2013-12-10 1084.66
2013-12-11 1077.29
2013-12-20 1100.62
2013-12-12 1069.96
2013-12-30 1109.46
2013-12-13 1060.79
2013-12-31 1120.71
2013-12-05 1057.34
2013-12-23 1115.10
2014-01-02 1113.12
2013-12-06 1069.87
2013-12-24 1111.84
2014-01-03 1105.00
2013-12-16 1072.98
2013-12-17 1069.86
2013-12-26 1117.46
2013-12-09 1078.14
2013-12-18 1084.75
2013-12-27 1118.40
2013-12-19 1086.22
Very simple example to illustrate concepts shown so far. hope you understand how to use simple awk in your batch.
to be continued ..
- brianadams
-
Let's say you want to get some information from systeminfo command. eg you want to get the data from these items:
OS Name
System type
System Up Time
Original Install Date"
Total Physical Memory
Available Physical Memory
BIOS Version
OS Version
Here is the code, save as parse_systeminfo.awk
BEGIN{
# here you can initialize variables
FS = ":[ ]+" # Set the field delimiter To : and one or more spaces
# initialize lookup table
array["OS Name"]=""
array["System type"] = ""
array["System Up Time"] = ""
array["Original Install Date"] = ""
array["Total Physical Memory"] = ""
array["Available Physical Memory"] = ""
array["BIOS Version"] = ""
array["OS Version"] = ""
}
{
# update table
if ( $1 in array ){
array[$1] = $2
}
}
END {
for( item in array ){
# beautify output by adjusting width using printf
printf("%-30s ===> %-30s\n" , item, array[item])
}
}
Another way to do it is just to use a regex inside the body, eg
/OS Name|Bios Version|....../ {
array[$1] = $2
}
Results:
C:\>systeminfo | awk -f parse_systeminfo.awk
System Up Time ===> 0 Days, 8 Hours, 25 Minutes, 58 Seconds
OS Version ===> 5.1.2600 Service Pack 3 Build 2600
System type ===> X86-based PC
Available Physical Memory ===> 244 MB
Total Physical Memory ===> 575 MB
BIOS Version ===> VBOX - 1
OS Name ===> Microsoft Windows XP Professional
Original Install Date ===> 2013/12/09, 12:04:49 AM
-
For this section I am going to introduce user defined functions in awk. Awk in fact is a little "programming language" as you can already see what features it has so far. As such, you can create user defined functions inside an awk script. The purpose of functions is to provide a means for running repetitive tasks in the program. The syntax of awk functions is similar to other languages.
function name( argument1, argument2 ... )
{
body-of-function
return [expression]
}
you can put all the functions declarations before the BEGIN block, eg say you want to create a function that prints horizontal lines at various part of your code
function horizontal_line(){
# function prints 100 "dashs"
for(i=0;i<100;i++){
printf "-"
}
print # add final new line
}
BEGIN{
print "Initializing..."
horizontal_line()
print "After horizontal_line function is called ..."
}
output results:
C:\>awk -f myScript.awk
Initializing...
----------------------------------------------------------------------------------------------------
After horizontal_line function is called ...
This is a simple example of a function with no arguments.
In awk, if you pass an array as the function argument, then the array is said to be "passed as reference". Otherwise, the argument is said to be "passed by value". For example, a string is passed by value.
animal = "monkey"
z = zoo( animal )
function zoo( string ){
print string
string = "snake"
print string
}
the function zoo does not change the value of "animal" in the main code. This is called "passed by value"
For arrays, its passed by reference, as in this example
function zoo(b){
b[1] = "hippo" # here we change the item to hippo
}
BEGIN{
# main code
a[1] = "test" # define an item in array
print "a[1] before function is: " a[1]
zoo(a) # call zoo function
print "a[1] after function is: " a[1]
}
result:
C:\>awk -f myScript.awk
a[1] before function is: test
a[1] after function is: hippo
we can see that the array item is changed in the main code after calling the function zoo.
Values can be passed back to the calling program by using the return keyword.
function calculate(){
.. calculation code here...
result = ....
return result
}
This is a simple introduction to user defined functions in awk
to be continued...
- brianadams
-
In awk, you can get user input using the getline function eg
BEGIN{
print "Enter something"
getline entered
print "You entered " entered
}
result
C:\>awk -f test.awk
Enter something
test
You entered test
here, the variable "entered" will contain the value of what the user has entered.
There is another common usage of getline function. Reading a file. Here's an example of how to read a file inside an awk script
BEGIN{
while ( ( getline line < "myFile.txt" ) > 0 ){
print "Read: " line
}
}
result
C:\>type myFile.txt
computerhope.com
is
the
best
C:\>awk -f myScript.awk
Read: computerhope.com
Read: is
Read: the
Read: best
Let's dissect the while loop , first use getline to read in the file
( getline line < "myFile.txt" )
Every line that is successfully read in has a value more than 0.
( getline line < "myFile.txt" ) > 0
You can then use a while loop to iterate the file,
while ( ( getline line < "myFile.txt" ) > 0 ){
# do something with line
}
each time checking the value if its greater than 0. Otherwise, getline will finish processing when reached end of file, and the while loop will end.
Lastly, another common way to use getline is using a pipe. Let's say you want to display the output of the "dir" DOS command inside awk. Here's how to do it. Its still using a while loop coupled with the getline function
BEGIN{
while ( ("dir" | getline line ) > 0 ){
print "Read: " line
}
close("dir") # close the pipe properly for next use in the program
}
result
C:\>awk -f myScript.awk
Read: Volume in drive C has no label.
Read: Volume Serial Number is DCEB-67C9
Read:
Read: Directory of C:\
Read:
....
... [ too long ] ...
That's how you can call an external DOS command and have it displayed inside awk program itself.
getline returns 1 if it finds a record, and 0 if the end of the file is encountered. If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1. It is generally good practice to always explicitly test for >0 while reading a file or handling input from pipes.
to be continued
- brianadams
-
Dealing with date and time is more or less a common task when batch scripting. Awk provides simple date and time function for basic time/date manipulation needs.
1) systime()
2) strftime()
3) mktime()
1) systime().
This is the the number of seconds since the system epoch. systime is commonly used to create a random number seed.
C:\>awk "BEGIN{ print systime(); } "
1389169226
2) strftime().
This is a function to format a timestamp based on the contents of the format string. This is useful if you want to create a time stamp on windows.eg To get the full 4-digits year, use the "%Y" format
C:\>awk "BEGIN{ print strftime(\"%Y\") } "
2014
To get YYYY-MM-DD-HH-mm-ss timestamp
C:\>awk "BEGIN{ print strftime(\"%Y-%m-%d-%H-%M-%S\") } "
2014-01-08-16-24-23
you can then capture the results in the usual DOS for loop.
3) mktime( date specs )
"date specs" argument to mktime is a string of the form YYYY MM DD HH MM SS.
YYYY = full year
MM = month, 1 to 12
DD = day, 1 to 31
HH = hour, 0 to 23
mm = minute, 0 to 59
SS = seconds, 0 to 59
mktime will create a timestamp similar to systime()
eg
C:\>awk "BEGIN{string=\"2014 01 01 0 0 0\"; print mktime(string) } "
1388505600
mktime is commonly use to get time difference. eg compare the date "2014 01 01 0 0 0 " against today's date and get their difference (in secs)
C:\>awk "BEGIN{string=\"2014 01 01 0 0 0\"; s=mktime(string); print (systime() - s) } "
664866
this is useful if for example, you are parsing a log file and filtering the date/time column for a specific date.
to be continued
- brianadams
-
Sometimes you many want to merge a collection of similar items. eg
C_1,KOG0155
C_1,KOG0306
C_2,KOG3259
C_3,KOG0931
C_2,KOG3638
C_4,KOG0956
C_6,KOG0155
C_1,KOG0306
C_3,KOG3259
C_4,KOG0931
C_5,KOG3638
C_1,KOG0956
to become something like this:
C_1,KOG0155 ,KOG0306,KOG0306,KOG0956
C_2,KOG3259, KOG3638
C_3,KOG0931, KOG3259
C_4,KOG0956, KOG0931
C_6,KOG0155
C_5,KOG3638
You can make use of associative arrays in awk
C:\>awk -F"," "{ array[$1] = array[$1]\",\"$2 }END{ for(idx in array) print idx, a[idx]}"
C_3 ,KOG0931,KOG3259
C_4 ,KOG0956,KOG0931
C_5 ,KOG3638
C_6 ,KOG0155
C_1 ,KOG0155,KOG0306,KOG0306,KOG0956
C_2 ,KOG3259,KOG3638
-
Lots of awk stuff lately from you.
-
That's how you can call an external DOS command and have it displayed inside awk program itself.
getline returns 1 if it finds a record, and 0 if the end of the file is encountered. If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1. It is generally good practice to always explicitly test for >0 while reading a file or handling input from pipes.
to be continued
- brianadams
The error is stored within AWK's error variable. It does not pass the error back to the calling batch file or CMD window you have open.
-
The error is stored within AWK's error variable. It does not pass the error back to the calling batch file or CMD window you have open.
awk internally doesn't have a mechanism for checking file existence such as -f test for linux. so most of the time if you want to do that then have to make a system call , OR to call getline and check -1.
C:\>awk "BEGIN{ x=getline < \"ddd\" ; print x }"
-1
ERRNO is just a string internal for awk.
C:\>awk "BEGIN{ getline < \"ddd\" ; print ERRNO }"
No such file or directory
so it doesn't get returned to DOS errorlevel. you can capture it though using exit().
C:\>awk "BEGIN{ x=getline < \"ddd\" ; exit(x) }"
C:\>echo %errorlevel%
-1
or
C:\> awk "BEGIN{ if ((\"ddd\" | getline) <= 0 ) exit(-1) ; }" 2>nul
C:\>echo %errorlevel%
-1
-
awk internally doesn't have a mechanism for checking file existence such as -f test for linux. so most of the time if you want to do that then have to make a system call , OR to call getline and check -1.
Then why not use the shells built-in functionality to check for the file existence before running your AWK command.
IF EXIST foo.txt awk.........
-
The error is stored within AWK's error variable. It does not pass the error back to the calling batch file or CMD window you have open.
awk internally doesn't have a mechanism for checking file existence such as -f test for linux. so most of the time if you want to do that then have to make a system call , OR to call getline and check -1.
I can see copy-pasting your posts from another forum (http://www.dostips.com/forum/viewtopic.php?f=3&t=5248) practically verbatim, because they had never really been posted here so could be valuable to some.But when responses like the above are copy-pasted verbatim to rather different questions, that's just a bit weird, I think.
-
copy-pasting your posts from another forum
I wondered about that.
-
Then why not use the shells built-in functionality to check for the file existence before running your AWK command.
IF EXIST foo.txt awk.........
this can be done in awk as well as shown in the examples but if you want to do it in the shell , thats up to individual.
-
But when responses like the above are copy-pasted verbatim to rather different questions, that's just a bit weird, I think.
The author for that dostips thread is yours truly . Hence I can copy and paste all I want. I don't have a blog, if not, i would just redirect readers there. That's not a different question. I just felt the response for the question look a bit similar as i had answered it in dostips. hence the copy and paste.
-
I wondered about that.
as explained. I am the original author of that dostip thread.
-
Not that hard to start a free blog or free website.
-
Sometimes you may need to filter a file using keywords from another file. say you have file1.txt and file2.txt
C:\>type file1.txt
cheese
milk
sausage
C:\>type file2.txt
milk
cheese
popcorn
pasta
milk
sausage
cheese
melon
you want to filter file2.txt with file1.txt such that only those not matching remains. eg
popcorn
pasta
melon
We can do this with awk one liner.
C:\>awk "FNR==NR{ a[$1] ;next} { if ( !($0 in a) ) { print } }" file1.txt file2.txt
popcorn
pasta
melon
Explanation:
FNR==NR : FNR means the number of records read so far. NR means the TOTAL number of records read from all files. Hence, the idiom FNR==NR means to read all the records from the first file and store to array.
When awk finish processing the first file, the FNR and NR would be different values, so the 2nd file will be processed. In this case the
if ( !($0 in a) ) { print }
statement just says to compare the item inside the array and print the record if not found.
-
Here are some commonly used one liners for file/text parsing
1) Deleting last line of a file
2) Deleting first line of file
3) Print a range of lines
4) Print lines not in a range
5) Concatenating two files
6) Transposing a file
7) Print first and last line
8) Print the line above and below a pattern
9) Print all lines until a matched pattern
10) Print from a matched pattern till the end of file
1) Deleting last line of a file
C:\>type myFile.txt
CAT
MAT
RAT
C:\>awk "BEGIN{ RS=\"\0\"} { for(i=1;i<NF;i++) print $i } " myFile.txt
CAT
MAT
2) Deleting first line of file
C:\> awk "NR>1 { print } " myFile.txt
MAT
RAT
3) Print a range of lines. eg print line 3 to line 5
C:\> type myFile.txt
CAT
MAT
RAT
BAT
TAT
DAT
PAT
C:\> awk "NR==3,NR==5{ print } " myFile.txt
RAT
BAT
TAT
4) Print lines not in a range . eg don't print lines number 3 to 5
C:\>awk "!(NR>=3 && NR<=5) { print }" myFile.txt
CAT
MAT
DAT
PAT
5) Concatenating two files
C:\>awk "{print}" file1 file2 > newFile.txt
6) Transposing a file
C:\> awk "BEGIN{ORS=\" \"}{print}" myFile.txt
CAT MAT RAT BAT TAT DAT PAT
7) Print first and last line
C:\> awk "NR==1;END{print}" myFile.txt
CAT
PAT
8) Print the line above and below a pattern. eg Search for "RAT" and print the lines above and below
C:\> type myFile.txt
CAT
MAT
RAT
BAT
TAT
DAT
PAT
C:\> awk "/RAT/{print y;print;f=1;next}f{print;f=0}{y=$0}" myFile.txt
MAT
RAT
BAT
9) Print all lines until a matched pattern. eg Print until the word "BAT" is found
C:\> awk "/BAT/{exit}{print}" myFile.txt
10) Print from a matched pattern till the end of file
C:\> awk "/TAT/,0" myFile.txt
TAT
DAT
PAT