Computer Hope

Software => Computer programming => Topic started by: briandams on January 16, 2014, 05:47:09 AM

Title: Awk - A nifty little tool for text manipulation and more.
Post by: briandams on January 16, 2014, 05:47:09 AM: awk has been around since the unix days in the 70s and has been a standard tool in most unix-like OS. It is primarily used for text and string manipulation. The GNU awk (http://www.gnu.org/software/gawk/manual/gawk.html) is one of the most widely used awk version nowadays and now its also ported to the windows OS so its now convenient to use it as part of your batch scripting tool set.

In this thread, I shall show some examples on how one can use this tool for easy file/text manipulation in a Windows batch environment. (if you can download 3rd party tools not installed by default). This is mostly for beginners to using awk or looking for tools to parse strings/text.

The syntax for awk is simple
Code: [Select]
pattern { action }
For example, to print a file

Code: [Select]
C:\> awk "{print}" myFile.txt
In the above example, the "action" is "print". This is the equivalent of the command
Code: [Select]
type myFile.txt
The cmd.exe on windows doesn't like single quotes, so we have use double quotes for the "action" part.

awk has a "BEGIN" and "END" pattern block. The "BEGIN" pattern only executes once before the first record is read by awk. For example, you can initialize variables inside this block

Code: [Select]
awk "BEGIN{a=10} ....." myFile.txt
or just do some calculation (simple calculator)
Code: [Select]
C:\>awk "BEGIN { print 1+2 } " 3
Likewise the "END" pattern is executed once only after all the records in the file has been read. For example, you want to print the last line of the file
Code: [Select]
C:\> more myFile.txt C:\original\1\2\3 C:\original\1\2\4 C:\original\1\2\5 C:\original\1\2\36 test C:\>awk "END{ print $0} " myFile.txt test"$0" means the current line/record.

to be continued...

- brianadams
Title: Post Parsing Structured Text
Post by: briandams on January 16, 2014, 05:49:53 AM: One common task almost everyone does is getting information from files. If you have a field delimited text file to parse, then awk might be just the tool you need.

For example, if you have this simple csv file where the delimiter is "|"

Code: [Select]
C:\>type myFile.txt 1|2|3|4|5 6|7|8|9|10 a|b|c|d|e
and you wish to get the 3rd column. In awk, the 3rd column is denoted by $3. Likewise, 2nd column as $2 and so on. So to get the 3rd column, issue this command

Code: [Select]
C:\>awk -F"|" "{print $3}" myFile.txt 3 8 c
the -F option is the field delimiter. Here, the "|" is specified as field delimiter. Hence awk will break the record into fields or tokens, with each field denoted by "$" and a numeric value. eg $1 means the first field, $9 means the 9th field and so on.
The above is equivalent to DOS for /f command with tokens
Code: [Select]
for /f "tokens=* delims=|" ..........

To print the last field, use $NF. To print the last 2nd field, use $(NF-1)
One of the feature of awk and -F is its ability to take in a regular expression, or multiple characters as the field delimiter. For example we have a file with delimiters ",%#"

Code: [Select]
C:\>type myFile.txt 1,%#2,%#3,%#4,%#5 6,%#7,%#8,%#9,%#10 a,%#b,%#c,%#d,%#eIssuing the same commands as before but pass ',%#' to -F and printing the 2nd column
Code: [Select]
C:\>awk -F",%#" "{print $2}" myFile.txt 2 7 b
to be continued

- brianadams
Title: Length of a string or record
Post by: briandams on January 16, 2014, 05:52:45 AM: Often we need to get the length of a string or a line of record in the file. awk provides the length() function to do this. For example
Code: [Select]
C:\>echo "test"| awk "{print length}" 6
why is it 6 and not 4 ? This is because the "echo" command in DOS "counts" the double quotes as characters. Hence you get 6. To calculate the string length of some variable you can just pipe to awk without the quotes
Code: [Select]
C:\>echo test| awk "{print length}" 4
Use the usual DOS for loop (for /f ...) to capture the result

How about going through a file and displaying the lines that are of a certain length?
eg we have this file
Code: [Select]
C:\>type myFile.txt abcd abcd abcdefghi abcdefghi abcdefghijklmn
and we want to get those lines whose length is 4.

Code: [Select]
C:\>awk "length==4" myFile.txt abcd abcd
writing length==4 this way is considered the "pattern" part of the awk syntax. So its not like this:
Code: [Select]
c:\>awk "{length==4}" myFile.txt
The "pattern" part of the awk syntax is usually a regular expression or some conditions.

Another example, search for length greater than 4 and less than 10 will yield

Code: [Select]
C:\>awk "length>4 && length <10" myFile.txt abcdefghi abcdefghi
If you want to write out the "action" part of the awk syntax, then the above is the same as
Code: [Select]
C:\>awk "length>4 && length <10 {print} " myFile.txt abcdefghi abcdefghi
to be continued..

- brianadams
Title: Operators
Post by: briandams on January 16, 2014, 05:55:20 AM: awk provides the usual maths operators to help you perform calculations in your script. Here I only list some that are commonly used.

Exponents
x ^ y
x ** y

Add, minus, divide, multiply -> +, - , / , *

Modulus : %

x++ , ++x : post and pre increment operators
x-- , --x : post and pre decrement operators

x += 1 : Adds 1 to the value of x
x -= 1 : Minus 1 from the value of x

Boolean operators:
! : not operator
&& : Logical AND
|| : Logical OR

Relational Operators
<<, >>, <, > , =

Regular expression matching operators
~ : matching
!~ : non-matching

Ternary operator (conditional expression )
?:

For square roots, there is the sqrt() function. eg sqrt(100)

For trigonometry, there are cosine(), sine(), tan() etc functions.

To generate random numbers, there is the rand() function eg
Code: [Select]
C:\>awk "BEGIN{ print rand() }" 0.237788
To generate different random numbers everytime you run the awk command, use the srand() function

Code: [Select]
C:\>gawk "BEGIN{ srand(); print rand() }" 0.14306 C:\>gawk "BEGIN{ srand(); print rand() }" 0.807121 C:\>gawk "BEGIN{ srand(); print rand() }" 0.663245
To concatenate strings , just write them next to each other, like this
Code: [Select]
C:\>awk "BEGIN{ print \"2\" \"3\" }" 23
If writing the awk command on the command line, we have to take care of the double quotes that is used inside awk , by escaping the quotes. In unix shell, it can be written like this :

Code: [Select]
awk 'BEGIN{ print "2" "3"}'
For more information on operators, please consult the manual (http://www.gnu.org/software/gawk/manual/gawk.html#Concatenation).

to be continued ...

- brianadams
Title: Simple string manipulation
Post by: briandams on January 16, 2014, 05:59:24 AM: Here I cover simple string manipulation in awk using its in-built string functions
1) Getting part of a string - substring-ing
2) Getting index of a string
3) Splitting a string
4) Uppercase and lowercase

1) Getting part of a string

awk provides the substr() function to get part of a string, for example
Code: [Select]
C:\>echo chimpanzee| awk "{print substr($0,2,5) }" himpa
$0 is the current record/line, in this case, its the standard input passed to awk using the pipe. substr($0,2,5) just says to get the 5 characters starting at position 2 of the current record. It is the same as the DOS internal build in

Code: [Select]
%variable:~1,5%
where %variable% is "chimpanzee". Note that the "echo" command in DOS is particular about spaces (ref:foxidrive), so in the example above, no spaces after "echo chimpanzee"

2) Getting index of a string
This is equivalent to saying "get the first occurence of a string inside a string. eg To find the first occurence of the letter "h" in "elephant"

Code: [Select]
C:\>echo elephant| awk "{print index($0,\"h\") }" 5
(take note of the escaping of double quotes when writing on the command line)
If the letter is not found, index() will return 0. So you can check for ERRORLEVEL ==0 in DOS shell. This is useful if you want to see if a string is found inside another string.

Code: [Select]
C:\>echo elephant| awk "{print index($0,\"z\") }" 0
Next, the split command. awk provides the split() command to split a string based on a pattern. For example, let's split the word "euphoria" on the letter "p"
Code: [Select]
C:\>echo euphoria| awk "{ n=split($0,array,\"p\") } END{ print array[1], n} " eu 2
Again, $0 means current record (which is euphoria passed in from std input). the split() function takes in the first argument as the record, the 2nd argument as an array, and the last argument as the pattern to split on. This pattern can be a regular expression.

The results of the split are stored in "array". In the above example, we print out the first item of the array at the END block. split() returns the number of items in the array. So in the above, "n" has a value of 2, meaning there are 2 items in the array.

4) Uppercase and lowercase
Often you might want to change the case of words/strings in your task objective. Awk provides in-built functions tolower() and toupper(). eg
This one liner change all the characters in the file to uppercase

Code: [Select]
C:\>type myFile.txt computerhope.com C:\>awk "{ print toupper($0) ;}" myFile.txt COMPUTERHOPE.COM
If you want to change only one string,
Code: [Select]
C:\>echo test|awk "{print toupper($0) }" TEST C:\>echo TEST|awk "{print tolower($0) }" test
As usual, capture the result in using a DOS for loop.

to be continued...

- brianadams
Title: Printing in awk
Post by: briandams on January 16, 2014, 06:02:01 AM: awk provides at least 2 forms of printing to output,
1) print
2) printf and
3) sprintf()

1) print.
The basic statement to display output to the user is the print statement. It should be too difficult to understand how to use it. Just
Code: [Select]
print "your string"
Sometimes you also can redirect to an output file inside of awk by using the output redirection operator ">"
Code: [Select]
C:\>awk "BEGIN{print \"computerhope.com\" > \"testfile\" }" C:\>type testfile computerhope.com
2) printf().
This printf statement syntax look like this:
Code: [Select]
printf("format" , item1, item2 ...)
This printf statement is very much similar to printf() from C language, where you can put format specifiers such as %s (string), %d (integer), %f (float). For example, to format some number or floats to 2 decimal places
Code: [Select]
C:\>awk "BEGIN{ printf(\"%.2f\" , 100) }" 100.00 C:\>awk "BEGIN{ printf(\"%.2f\" , 3.14244) }" 3.14
To right justify a string 15 places
Code: [Select]
C:\>awk "BEGIN{ printf(\"%15s\" , \"mystring\") }" mystringIf you want to pad a string with 0's in front, eg
Code: [Select]
C:\>awk "BEGIN{ printf(\"%05d\" , 100) }" 00100

3) sprintf().
sprintf() works the same as printf() and it allows the formatting to be saved to a variable.
Code: [Select]
C:\>awk "BEGIN{PI = sprintf(\"%.4f\", 22/7); print PI }" 3.1429
here, the value of 22/7 is saved to "PI" variable with 4 decimal places. This variable can then be used in other parts of the awk script.

For more info and examples on print, printf and sprintf please consult the manual.

to be continued

- brianadams
Title: Awk Loops
Post by: briandams on January 16, 2014, 06:03:43 AM: Awk loops works the same as those in C language. Here I touch 2 of the most common loops ,
1) for loop
2) while loop

The syntax for a "for" loop in awk is this

Code: [Select]
for (initialization; condition; increment) bodyeg to generate a range of numbers from 1 to 9
Code: [Select]
C:\>awk "BEGIN{ for(i=1;i<10;i++){ print i } }" 1 2 3 4 5 6 7 8 9Use a DOS for loop to catch each number and use as desired. This is the same as
Code: [Select]
FOR /L %%G IN (1,1,9) DO echo %%G
The while loop is another popular form of looping construct in most programming language. For example, setting a count down and printing 10 "*"
Code: [Select]
C:\>awk "BEGIN{count=10; while(count>0 ){ print \"*\" ; count--} }" * * * * * * * * * *To put in more clearly

Code: [Select]
BEGIN{ count=10 # set count of 10. while(count>0 ) { print \"*\" # print * count-- # decrement the count each time through the loop } }
Of course, the above can be written with the for loop as well
Code: [Select]
C:\>awk "BEGIN{for(c=10;c>0;c--) print \"*\" }" * * * * * * * * * *

to be continued

- brianadams
Title: Awk arrays
Post by: briandams on January 16, 2014, 06:06:27 AM: Most programming language support data structures such as arrays that can be used to stored similar collection of items, instead of individual variables. Awk has arrays too and its called associative arrays. That means each array is a collection of pairs, called, index and value.

Here are simple example of how to use arrays in Awk.

Code: [Select]
C:\>awk "BEGIN{a[1]=\"one\" ; a[2]=10; print a[1]\",\"a[2] }" one,10
In the above example, we declare array "a" with item "1" having a value of "one" (a string) and item "2" with value of 10 (integer). Arrays in awk can have different data types for items and values. eg
Code: [Select]
C:\>awk "BEGIN{a[\"two\"]=2; print a[\"two\"] }" 2here, the item is "two" (a string) and value is the integer 2.

To iterate an array:, use an awk for loop
Code: [Select]
C:\>awk "BEGIN{a[1]=\"one\" ; a[\"two\"]=2; for(item in a) {print item\" \"a[item] } }" two 2 1 one
To put it more clearly:
Code: [Select]
BEGIN{ a[1]=\"one\" a[\"two\"]=2 for( item in a ) { print item\" \"a[item] } }
In awk, arrays have no order indexing, not like normal arrays in C. So by printing the array in the above case, the result is arbitrary.

To get the size of an array, you can use length() function as described earlier
Code: [Select]
C:\>awk "BEGIN{a[1]=\"one\" ; a[\"two\"]=2; a[2]=100; print length(a) }" 3
To see if an item exists in array, we can use the if statement
Code: [Select]
C:\>awk "BEGIN{a[1]=\"one\" ; a[\"one\"]=1; a[2]=100; if (2 in a) { print \"ok\"} }" ok
More clearer this way:
Code: [Select]
BEGIN{ a[1] =\"one\" # define array items and values a[\"one\"] = 1 a[2] = 100 if (2 in a) { print \"ok\" } }
To remove an item in array, use the delete statement, eg
Code: [Select]
C:\>awk "BEGIN{a[1]=\"one\" ; a[\"one\"]=1; a[2]=100; delete a[2]; if (2 in a) { print \"ok\"}else {print \"not ok\"} }" not ok
More clearer this way:
Code: [Select]
BEGIN{ a[1]=\"one\" a[\"one\"]=1 a[2]=100; delete a[2] # delete item "2" if (2 in a) { print \"ok\" }else { print \"not ok\" } }
To delete a whole array: just do delete array

See the manual for more elaborate examples on using arrays

to be continued ...

- brianadams
Title: Awk Flow control
Post by: briandams on January 16, 2014, 06:08:35 AM: Making decisions are part of our thought process every day. If you want to tell the computer to do something then the language must provide if/else constructs for that. :)

Awk provides the usual if/else/else if constructs that most languages have.

Code: [Select]
C:\>awk "BEGIN{ b=2; if( b==2 ) print \"it is 2\" }" it is 2
Basic construct
Code: [Select]
if ( condition ) { .... } else if (condition) { ... } else { .... }

The break Statement jumps out of loops, like this:
Code: [Select]
for ( conditions ){ ... break #breaks out of for loop ... }

The continue Statement is also used in loop and skips the rest of the loop causing the next cycle around the loop to begin immediately. eg
Code: [Select]
for (x = 0; x <= 10; x++) { if (x == 2) { continue # this continue skips the print statement below } print "something" }
Later version of Awk supports the switch statement, but its seldom needed as an if/else is good enough for most task. If you want to know more about switch statements, check the manual.

to be continued

- brianadams
Title: Some common Awk Variables
Post by: briandams on January 16, 2014, 06:10:52 AM: Awk has some internal variables that you should be familiar with for parsing string and files.
1) NR
2) NF
3) FS
4) RS
5) ORS
6) OFS

1) NR
NR stands for number of input records awk has processed since the beginning of the program's execution. For example, you want to find the line count of a file
Code: [Select]
C:\>type myFile.txt 1,%#2,%#3,%#4,%#5 6,%#7,%#8,%#9,%#10 a,%#b,%#c,%#d,%#e C:\>type myFile.txt | awk "END{print NR}" 3This is the same as what the Unix wc -l command gives you.

2) NF
NF stands for the number of fields in the current input record. For example
Code: [Select]
C:\>type myFile.txt 1,2,3,4,5 6,7,8,9,0,10 C:\> awk -F"," "{print NF}" myFile.txt 5 6here, because we have set the -F option (field delimiter) as comma, then the first record will have 5 fields, and the 2nd record will have 6.

3) FS
This is the input field separator, similar to -F option passed to awk. Usually its defined in the BEGIN block before any records are processed
Code: [Select]
awk "BEGIN{FS=","} {print}" myFile.txtFS can be any characters (multicharacters as well) and regular expressions

4) RS
RS stands for input record separator. By default awk's RS is the newline character, that's why awk processed lines one by one by default. You can set the RS to a different value, for example, let's say you want to display the above myFile.txt each number on a line by itself.
Code: [Select]
C:\>more myFile.txt 1,2,3,4,5 6,7,8,9,0,10 C:\>awk "BEGIN{RS=\",\"}{ print $0 } " myFile.txt 1 2 3 4 5 6 7 8 9 0 10Here, RS is set to comma "," , so now each record is just the numbers by itself.

5) ORS
ORS stands for Output record separator. Its default is newline "\n" and is the output of every print statement. For example, let's say you want "wrap" lines in a file to become a single line eg,
Code: [Select]
C:\>awk "BEGIN{ORS=\"#\"}{ print $0 } " myFile.txt 1,2,3,4,5#6,7,8,9,0,10#
you can change the ORS to "#", and the output will become one line. Notice the "#". Orignially, its "\n", now its "#". Hence this gives the effect of joining to become a single line.

6) OFS
This is the output field separator. ITs default is space, and its the output between the fields printed by a print statement. For example, changing the field separator to "#"
Code: [Select]
C:\>type myFile.txt 1,2,3,4,5 6,7,8,9,0,10 C:\>awk "BEGIN{OFS=\"#\"; FS=\",\"}{$1=$1;print } " myFile.txt 1#2#3#4#5 6#7#8#9#0#10
In the above example, because we are changing the OFS, the record need to be rebuild to "reflect" the changes. Hence its common idiom to use $1=$1. (you can consult the manual for explanation)

to be continued ..

-brianadams
Title: Pattern Matching and Substitution
Post by: briandams on January 16, 2014, 06:13:28 AM: Awk has in built pattern matching and functions for string substitutions. Here I show some basic examples of simple matching and substitution. Regular expressions is a vast topic so if for in depth regex , please consult a regex book. My favorite is Mastering Regular Expression from Oreilly.

Pattern matching
In awk, simple matching goes like this using the ~ operator. (all examples use myFile.txt)
Code: [Select]
C:\> type myFile.txt yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L 1981,NL,ATL,1,W,N,4,54,25,29 1981,NL,ATL,2,W,N,5,52,25,27 1981,AL,BAL,1,E,N,2,54,31,23 1981,AL,BAL,2,E,N,4,51,28,23 1981,AL,BOS,1,E,N,5,56,30,26 1981,AL,BOS,2,E,N,2,52,29,23 1981,AL,CAL,1,W,N,4,60,31,29 1981,AL,CAL,2,W,N,6,50,20,30 1981,AL,CHA,1,W,N,3,53,31,22 1981,AL,CHA,2,W,N,6,53,23,30 1981,NL,CHN,1,E,N,6,52,15,37 1981,NL,CHN,2,E,N,5,51,23,28 1981,NL,CIN,1,W,N,2,56,35,21 1981,NL,CIN,2,W,N,2,52,31,21 1981,AL,CLE,1,E,N,6,50,26,24 1981,AL,CLE,2,E,N,5,53,26,27 1981,AL,DET,1,E,N,4,57,31,26 1981,AL,DET,2,E,N,2,52,29,23 1981,NL,HOU,1,W,N,3,57,28,29 1981,NL,HOU,2,W,N,1,53,33,20 1981,AL,KCA,1,W,N,5,50,20,30 1981,AL,KCA,2,W,N,1,53,30,23 1981,NL,LAN,1,W,N,1,57,36,21 1981,NL,LAN,2,W,N,4,53,27,26 1981,AL,MIN,1,W,N,7,56,17,39 1981,AL,MIN,2,W,N,4,53,24,29 C:\>awk "/divID/" myFile.txt yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
The above says to find any lines that has the string "divID" . For pattern matching, the regex pattern to find is usually enclosed in / /.

If you want case-insensitive search , use the IGNORECASE variable
Code: [Select]
C:\>awk "BEGIN{IGNORECASE=1}/divid/" myFile.txt yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
Setting IGNORECASE to 0 toggles it back to case-sensitive.

If you want to find all records with 2nd column starting with "A", then
Code: [Select]
C:\>awk -F"," "$2 ~ /^A/ {print}" myFile.txt 1981,AL,BAL,1,E,N,2,54,31,23 1981,AL,BAL,2,E,N,4,51,28,23 1981,AL,BOS,1,E,N,5,56,30,26 1981,AL,BOS,2,E,N,2,52,29,23 1981,AL,CAL,1,W,N,4,60,31,29 1981,AL,CAL,2,W,N,6,50,20,30 1981,AL,CHA,1,W,N,3,53,31,22 1981,AL,CHA,2,W,N,6,53,23,30 1981,AL,CLE,1,E,N,6,50,26,24 1981,AL,CLE,2,E,N,5,53,26,27 1981,AL,DET,1,E,N,4,57,31,26 1981,AL,DET,2,E,N,2,52,29,23 1981,AL,KCA,1,W,N,5,50,20,30 1981,AL,KCA,2,W,N,1,53,30,23 1981,AL,MIN,1,W,N,7,56,17,39 1981,AL,MIN,2,W,N,4,53,24,29
First, give the -F"," option because the file is "," delimited. Then use $2 because its the 2nd column. Then using the regex /^A/. "^" means "starts with". After that "{print}" action will print the relevant records.

In awk, you can negate matches using !~ operator. For example , you want to find
records that doesn't have "DET" as the 3rd field
Code: [Select]
C:\>awk -F"," "$3 !~ /DET/{print}" myFile.txt yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L 1981,NL,ATL,1,W,N,4,54,25,29 1981,NL,ATL,2,W,N,5,52,25,27 1981,AL,BAL,1,E,N,2,54,31,23 1981,AL,BAL,2,E,N,4,51,28,23 1981,AL,BOS,1,E,N,5,56,30,26 1981,AL,BOS,2,E,N,2,52,29,23 1981,AL,CAL,1,W,N,4,60,31,29 1981,AL,CAL,2,W,N,6,50,20,30 1981,AL,CHA,1,W,N,3,53,31,22 1981,AL,CHA,2,W,N,6,53,23,30 1981,NL,CHN,1,E,N,6,52,15,37 1981,NL,CHN,2,E,N,5,51,23,28 1981,NL,CIN,1,W,N,2,56,35,21 1981,NL,CIN,2,W,N,2,52,31,21 1981,AL,CLE,1,E,N,6,50,26,24 1981,AL,CLE,2,E,N,5,53,26,27 1981,NL,HOU,1,W,N,3,57,28,29 1981,NL,HOU,2,W,N,1,53,33,20 1981,AL,KCA,1,W,N,5,50,20,30 1981,AL,KCA,2,W,N,1,53,30,23 1981,NL,LAN,1,W,N,1,57,36,21 1981,NL,LAN,2,W,N,4,53,27,26 1981,AL,MIN,1,W,N,7,56,17,39 1981,AL,MIN,2,W,N,4,53,24,29

If you just want to find records that doesn't have the string "DET", just do a !/DET/ using the "!" operator
Code: [Select]
C:\>awk -F"," "!/DET/" myFile.txt yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L 1981,NL,ATL,1,W,N,4,54,25,29 1981,NL,ATL,2,W,N,5,52,25,27 1981,AL,BAL,1,E,N,2,54,31,23 1981,AL,BAL,2,E,N,4,51,28,23 1981,AL,BOS,1,E,N,5,56,30,26 1981,AL,BOS,2,E,N,2,52,29,23 1981,AL,CAL,1,W,N,4,60,31,29 1981,AL,CAL,2,W,N,6,50,20,30 1981,AL,CHA,1,W,N,3,53,31,22 1981,AL,CHA,2,W,N,6,53,23,30 1981,NL,CHN,1,E,N,6,52,15,37 1981,NL,CHN,2,E,N,5,51,23,28 1981,NL,CIN,1,W,N,2,56,35,21 1981,NL,CIN,2,W,N,2,52,31,21 1981,AL,CLE,1,E,N,6,50,26,24 1981,AL,CLE,2,E,N,5,53,26,27 1981,NL,HOU,1,W,N,3,57,28,29 1981,NL,HOU,2,W,N,1,53,33,20 1981,AL,KCA,1,W,N,5,50,20,30 1981,AL,KCA,2,W,N,1,53,30,23 1981,NL,LAN,1,W,N,1,57,36,21 1981,NL,LAN,2,W,N,4,53,27,26 1981,AL,MIN,1,W,N,7,56,17,39 1981,AL,MIN,2,W,N,4,53,24,29
These are very simple examples on using regex operator ~, !~ for searching strings.

String replacement
Awk provides the sub() and gsub() functions to replace strings in files
The syntax for sub() is
sub(regexp, replacement [, target])

for example, replace "LAN" with "NAL"

Code: [Select]
C:\>awk "{sub(\"LAN\",\"NAL\", $0); print }" myFile.txt yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L 1981,NL,ATL,1,W,N,4,54,25,29 1981,NL,ATL,2,W,N,5,52,25,27 1981,AL,BAL,1,E,N,2,54,31,23 1981,AL,BAL,2,E,N,4,51,28,23 1981,AL,BOS,1,E,N,5,56,30,26 1981,AL,BOS,2,E,N,2,52,29,23 1981,AL,CAL,1,W,N,4,60,31,29 1981,AL,CAL,2,W,N,6,50,20,30 1981,AL,CHA,1,W,N,3,53,31,22 1981,AL,CHA,2,W,N,6,53,23,30 1981,NL,CHN,1,E,N,6,52,15,37 1981,NL,CHN,2,E,N,5,51,23,28 1981,NL,CIN,1,W,N,2,56,35,21 1981,NL,CIN,2,W,N,2,52,31,21 1981,AL,CLE,1,E,N,6,50,26,24 1981,AL,CLE,2,E,N,5,53,26,27 1981,AL,DET,1,E,N,4,57,31,26 1981,AL,DET,2,E,N,2,52,29,23 1981,NL,HOU,1,W,N,3,57,28,29 1981,NL,HOU,2,W,N,1,53,33,20 1981,AL,KCA,1,W,N,5,50,20,30 1981,AL,KCA,2,W,N,1,53,30,23 1981,NL,[color=#800000]NAL[/color],1,W,N,1,57,36,21 1981,NL,[color=#800000]NAL[/color],2,W,N,4,53,27,26 1981,AL,MIN,1,W,N,7,56,17,39 1981,AL,MIN,2,W,N,4,53,24,29
sub() only replaces one occurence of the string. For global replacement, use gsub() which has the same syntax as sub().

To replace the "BAL" string from the 4th line only, use NR==4 as the "pattern". then use sub().
Code: [Select]
C:\>awk "NR==4 { sub(\"BAL\",\"LAB\") } {print}" myFile.txt yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L 1981,NL,ATL,1,W,N,4,54,25,29 1981,NL,ATL,2,W,N,5,52,25,27 1981,AL,[color=#800000]LAB[/color],1,E,N,2,54,31,23 1981,AL,BAL,2,E,N,4,51,28,23 1981,AL,BOS,1,E,N,5,56,30,26 .....
to be continued

- brianadams
Title: Writing an awk script - Averaging example.
Post by: briandams on January 16, 2014, 06:15:54 AM: Awk commands are not just for one liners as we have seen so far. You can put awk commands in a script (aka, text file) and have awk run them for you. Its the same as writing a vbscript and having cscript engine runs the command for you.

The syntax for running awk scripting is simply (-f option)
Code: [Select]
c:\> awk -f myawkscript.awk input_file.csv
Lets round up this part of the primer by an example. Say you have last 20 days worth of Google financial data in a csv comma delimited file. you want to find the average of the closing price (column 5) and find out how many of the days (records) have their closing price greater than the average.

Code: [Select]
Date,Open,High,Low,Close,Volume,Adj Close 2014-01-03,1115.00,1116.93,1104.93,1105.00,1666700,1105.00 2014-01-02,1115.46,1117.75,1108.26,1113.12,1821400,1113.12 2013-12-31,1112.24,1121.00,1106.26,1120.71,1357900,1120.71 2013-12-30,1120.34,1120.50,1109.02,1109.46,1236100,1109.46 2013-12-27,1120.00,1120.28,1112.94,1118.40,1569700,1118.40 2013-12-26,1114.01,1119.00,1108.69,1117.46,1337800,1117.46 2013-12-24,1114.97,1115.24,1108.10,1111.84,734200,1111.84 2013-12-23,1107.84,1115.80,1105.12,1115.10,1721600,1115.10 2013-12-20,1088.30,1101.17,1088.00,1100.62,3261600,1100.62 2013-12-19,1080.77,1091.99,1079.08,1086.22,1665700,1086.22 2013-12-18,1071.85,1084.95,1059.04,1084.75,2210300,1084.75 2013-12-17,1072.82,1080.76,1068.38,1069.86,1535700,1069.86 2013-12-16,1064.00,1074.69,1062.01,1072.98,1602000,1072.98 2013-12-13,1075.40,1076.29,1057.89,1060.79,2162400,1060.79 2013-12-12,1079.57,1082.94,1069.00,1069.96,1595900,1069.96 2013-12-11,1087.40,1091.32,1075.17,1077.29,1695800,1077.29 2013-12-10,1076.15,1092.31,1075.65,1084.66,1853900,1084.66 2013-12-09,1070.99,1082.31,1068.02,1078.14,1482600,1078.14 2013-12-06,1069.79,1070.00,1060.08,1069.87,1428800,1069.87 2013-12-05,1057.20,1059.66,1051.09,1057.34,1133700,1057.34
For this, its too "complicated" to be a one liner so we put commands inside a file. You can use any text editor to create your script.
The basic layout of the script goes like this:
Code: [Select]
BEGIN{ # here you can initialize variables } { # here you Do processing For every record } End { # here you can Do End processing like printing final result }
Here's a snapshot of a the script
Code: [Select]
BEGIN{ # here you can initialize variables FS = "," # Set the field delimiter To comma sum = 0 # Set a variable called sum To store the total of column 5 } NR>1{ # use NR > 1 To exclude the header row # here you Do processing For every record sum += $5 # awk convert implictly Each column 5 values To integer } END { # here you can Do End processing like printing final result print "The total sum is " sum print "The average is " sum/NR }NR is the total number of records, so to average column 5 which is the closing price, just divide the sum by NR at the END block.

Running the script gives
Code: [Select]
C:\>awk -f average.awk google.csv The total sum is 21823.6 The average is 1039.22

Next we find how many days are there in the file that has closing price greater than average. This is the code
Code: [Select]
BEGIN{ # here you can initialize variables FS = "," # Set the field delimiter To comma sum = 0 # Set a variable called sum To store the total of column 5 } NR>1{ # use NR > 1 To exclude the header row # here you Do processing For every record sum += $5 # awk convert implictly Each column 5 values To integer days[$1] = $5 # store the closing price into Array, With the first column as index } END { # here you can Do End processing like printing final result average = sum/NR print "The total sum is " sum print "The average is " average print "Days greater than average" for( d in days ) { if ( days[d] > average ) { print d, days[d] } } }
running the script gives
Code: [Select]
C:\>awk -f average.awk google.csv The total sum is 21823.6 The average is 1039.22 Days greater than average 2013-12-10 1084.66 2013-12-11 1077.29 2013-12-20 1100.62 2013-12-12 1069.96 2013-12-30 1109.46 2013-12-13 1060.79 2013-12-31 1120.71 2013-12-05 1057.34 2013-12-23 1115.10 2014-01-02 1113.12 2013-12-06 1069.87 2013-12-24 1111.84 2014-01-03 1105.00 2013-12-16 1072.98 2013-12-17 1069.86 2013-12-26 1117.46 2013-12-09 1078.14 2013-12-18 1084.75 2013-12-27 1118.40 2013-12-19 1086.22

Very simple example to illustrate concepts shown so far. hope you understand how to use simple awk in your batch.

to be continued ..

- brianadams
Title: Writing an awk script - Parsing systeminfo example
Post by: briandams on January 16, 2014, 06:17:42 AM: Let's say you want to get some information from systeminfo command. eg you want to get the data from these items:

OS Name
System type
System Up Time
Original Install Date"
Total Physical Memory
Available Physical Memory
BIOS Version
OS Version

Here is the code, save as parse_systeminfo.awk
Code: [Select]
BEGIN{ # here you can initialize variables FS = ":[ ]+" # Set the field delimiter To : and one or more spaces # initialize lookup table array["OS Name"]="" array["System type"] = "" array["System Up Time"] = "" array["Original Install Date"] = "" array["Total Physical Memory"] = "" array["Available Physical Memory"] = "" array["BIOS Version"] = "" array["OS Version"] = "" } { # update table if ( $1 in array ){ array[$1] = $2 } } END { for( item in array ){ # beautify output by adjusting width using printf printf("%-30s ===> %-30s\n" , item, array[item]) } }

Another way to do it is just to use a regex inside the body, eg

Code: [Select]
/OS Name|Bios Version|....../ { array[$1] = $2 }

Results:
Code: [Select]
C:\>systeminfo | awk -f parse_systeminfo.awk System Up Time ===> 0 Days, 8 Hours, 25 Minutes, 58 Seconds OS Version ===> 5.1.2600 Service Pack 3 Build 2600 System type ===> X86-based PC Available Physical Memory ===> 244 MB Total Physical Memory ===> 575 MB BIOS Version ===> VBOX - 1 OS Name ===> Microsoft Windows XP Professional Original Install Date ===> 2013/12/09, 12:04:49 AM
Title: Awk User Defined Functions
Post by: briandams on January 16, 2014, 06:20:10 AM: For this section I am going to introduce user defined functions in awk. Awk in fact is a little "programming language" as you can already see what features it has so far. As such, you can create user defined functions inside an awk script. The purpose of functions is to provide a means for running repetitive tasks in the program. The syntax of awk functions is similar to other languages.

Code: [Select]
function name( argument1, argument2 ... ) { body-of-function return [expression] }you can put all the functions declarations before the BEGIN block, eg say you want to create a function that prints horizontal lines at various part of your code
Code: [Select]
function horizontal_line(){ # function prints 100 "dashs" for(i=0;i<100;i++){ printf "-" } print # add final new line } BEGIN{ print "Initializing..." horizontal_line() print "After horizontal_line function is called ..." }
output results:
Code: [Select]
C:\>awk -f myScript.awk Initializing... ---------------------------------------------------------------------------------------------------- After horizontal_line function is called ...
This is a simple example of a function with no arguments.

In awk, if you pass an array as the function argument, then the array is said to be "passed as reference". Otherwise, the argument is said to be "passed by value". For example, a string is passed by value.

Code: [Select]
animal = "monkey" z = zoo( animal ) function zoo( string ){ print string string = "snake" print string }
the function zoo does not change the value of "animal" in the main code. This is called "passed by value"

For arrays, its passed by reference, as in this example
Code: [Select]
function zoo(b){ b[1] = "hippo" # here we change the item to hippo } BEGIN{ # main code a[1] = "test" # define an item in array print "a[1] before function is: " a[1] zoo(a) # call zoo function print "a[1] after function is: " a[1] }

result:
Code: [Select]
C:\>awk -f myScript.awk a[1] before function is: test a[1] after function is: hippo
we can see that the array item is changed in the main code after calling the function zoo.

Values can be passed back to the calling program by using the return keyword.
Code: [Select]
function calculate(){ .. calculation code here... result = .... return result }
This is a simple introduction to user defined functions in awk

to be continued...

- brianadams
Title: Getting User Input and File Reading
Post by: briandams on January 16, 2014, 06:22:54 AM: In awk, you can get user input using the getline function eg
Code: [Select]
BEGIN{ print "Enter something" getline entered print "You entered " entered }result
Code: [Select]
C:\>awk -f test.awk Enter something test You entered test
here, the variable "entered" will contain the value of what the user has entered.

There is another common usage of getline function. Reading a file. Here's an example of how to read a file inside an awk script
Code: [Select]
BEGIN{ while ( ( getline line < "myFile.txt" ) > 0 ){ print "Read: " line } }
result
Code: [Select]
C:\>type myFile.txt computerhope.com is the best C:\>awk -f myScript.awk Read: computerhope.com Read: is Read: the Read: best
Let's dissect the while loop , first use getline to read in the file
Code: [Select]
( getline line < "myFile.txt" )
Every line that is successfully read in has a value more than 0.
Code: [Select]
( getline line < "myFile.txt" ) > 0
You can then use a while loop to iterate the file,
Code: [Select]
while ( ( getline line < "myFile.txt" ) > 0 ){ # do something with line }
each time checking the value if its greater than 0. Otherwise, getline will finish processing when reached end of file, and the while loop will end.

Lastly, another common way to use getline is using a pipe. Let's say you want to display the output of the "dir" DOS command inside awk. Here's how to do it. Its still using a while loop coupled with the getline function
Code: [Select]
BEGIN{ while ( ("dir" | getline line ) > 0 ){ print "Read: " line } close("dir") # close the pipe properly for next use in the program }
result
Code: [Select]
C:\>awk -f myScript.awk Read: Volume in drive C has no label. Read: Volume Serial Number is DCEB-67C9 Read: Read: Directory of C:\ Read: .... ... [ too long ] ...That's how you can call an external DOS command and have it displayed inside awk program itself.

getline returns 1 if it finds a record, and 0 if the end of the file is encountered. If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1. It is generally good practice to always explicitly test for >0 while reading a file or handling input from pipes.

to be continued

- brianadams
Title: Date and Time
Post by: briandams on January 16, 2014, 06:24:41 AM: Dealing with date and time is more or less a common task when batch scripting. Awk provides simple date and time function for basic time/date manipulation needs.
1) systime()
2) strftime()
3) mktime()

1) systime().
This is the the number of seconds since the system epoch. systime is commonly used to create a random number seed.

Code: [Select]
C:\>awk "BEGIN{ print systime(); } " 1389169226
2) strftime().
This is a function to format a timestamp based on the contents of the format string. This is useful if you want to create a time stamp on windows.eg To get the full 4-digits year, use the "%Y" format
Code: [Select]
C:\>awk "BEGIN{ print strftime(\"%Y\") } " 2014

To get YYYY-MM-DD-HH-mm-ss timestamp
Code: [Select]
C:\>awk "BEGIN{ print strftime(\"%Y-%m-%d-%H-%M-%S\") } " 2014-01-08-16-24-23
you can then capture the results in the usual DOS for loop.

3) mktime( date specs )
"date specs" argument to mktime is a string of the form YYYY MM DD HH MM SS.
YYYY = full year
MM = month, 1 to 12
DD = day, 1 to 31
HH = hour, 0 to 23
mm = minute, 0 to 59
SS = seconds, 0 to 59
mktime will create a timestamp similar to systime()
eg
Code: [Select]
C:\>awk "BEGIN{string=\"2014 01 01 0 0 0\"; print mktime(string) } " 1388505600
mktime is commonly use to get time difference. eg compare the date "2014 01 01 0 0 0 " against today's date and get their difference (in secs)

Code: [Select]
C:\>awk "BEGIN{string=\"2014 01 01 0 0 0\"; s=mktime(string); print (systime() - s) } " 664866this is useful if for example, you are parsing a log file and filtering the date/time column for a specific date.

to be continued

- brianadams
Title: Merging strings of similar items (keys).
Post by: briandams on January 16, 2014, 06:25:53 AM: Sometimes you many want to merge a collection of similar items. eg
Code: [Select]
C_1,KOG0155 C_1,KOG0306 C_2,KOG3259 C_3,KOG0931 C_2,KOG3638 C_4,KOG0956 C_6,KOG0155 C_1,KOG0306 C_3,KOG3259 C_4,KOG0931 C_5,KOG3638 C_1,KOG0956
to become something like this:
Code: [Select]
C_1,KOG0155 ,KOG0306,KOG0306,KOG0956 C_2,KOG3259, KOG3638 C_3,KOG0931, KOG3259 C_4,KOG0956, KOG0931 C_6,KOG0155 C_5,KOG3638
You can make use of associative arrays in awk

Code: [Select]
C:\>awk -F"," "{ array[$1] = array[$1]\",\"$2 }END{ for(idx in array) print idx, a[idx]}" C_3 ,KOG0931,KOG3259 C_4 ,KOG0956,KOG0931 C_5 ,KOG3638 C_6 ,KOG0155 C_1 ,KOG0155,KOG0306,KOG0306,KOG0956 C_2 ,KOG3259,KOG3638
Title: Re: Awk - A nifty little tool for text manipulation and more.
Post by: Salmon Trout on January 16, 2014, 08:06:49 AM: Lots of awk stuff lately from you.
Title: Re: Getting User Input and File Reading
Post by: Squashman on January 16, 2014, 08:49:21 AM: Quote from: briandams on January 16, 2014, 06:22:54 AM

That's how you can call an external DOS command and have it displayed inside awk program itself.

getline returns 1 if it finds a record, and 0 if the end of the file is encountered. If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1. It is generally good practice to always explicitly test for >0 while reading a file or handling input from pipes.

to be continued

- brianadams
The error is stored within AWK's error variable. It does not pass the error back to the calling batch file or CMD window you have open.
Title: Re: Getting User Input and File Reading
Post by: briandams on January 16, 2014, 09:03:05 AM: Quote from: Squashman on January 16, 2014, 08:49:21 AM
The error is stored within AWK's error variable. It does not pass the error back to the calling batch file or CMD window you have open.

awk internally doesn't have a mechanism for checking file existence such as -f test for linux. so most of the time if you want to do that then have to make a system call , OR to call getline and check -1.
Code: [Select]
C:\>awk "BEGIN{ x=getline < \"ddd\" ; print x }" -1ERRNO is just a string internal for awk.
Code: [Select]
C:\>awk "BEGIN{ getline < \"ddd\" ; print ERRNO }" No such file or directory
so it doesn't get returned to DOS errorlevel. you can capture it though using exit().
Code: [Select]
C:\>awk "BEGIN{ x=getline < \"ddd\" ; exit(x) }" C:\>echo %errorlevel% -1
or
Code: [Select]
C:\> awk "BEGIN{ if ((\"ddd\" | getline) <= 0 ) exit(-1) ; }" 2>nul C:\>echo %errorlevel% -1
Title: Re: Getting User Input and File Reading
Post by: Squashman on January 16, 2014, 09:42:31 AM: Quote from: briandams on January 16, 2014, 09:03:05 AM

awk internally doesn't have a mechanism for checking file existence such as -f test for linux. so most of the time if you want to do that then have to make a system call , OR to call getline and check -1.

Then why not use the shells built-in functionality to check for the file existence before running your AWK command.

Code: [Select]
IF EXIST foo.txt awk.........
Title: Re: Getting User Input and File Reading
Post by: BC_Programmer on January 16, 2014, 10:42:34 AM: Quote from: briandams on January 16, 2014, 09:03:05 AM
Quote
The error is stored within AWK's error variable. It does not pass the error back to the calling batch file or CMD window you have open.
awk internally doesn't have a mechanism for checking file existence such as -f test for linux. so most of the time if you want to do that then have to make a system call , OR to call getline and check -1.

I can see copy-pasting your posts from another forum (http://www.dostips.com/forum/viewtopic.php?f=3&t=5248) practically verbatim, because they had never really been posted here so could be valuable to some.But when responses like the above are copy-pasted verbatim to rather different questions, that's just a bit weird, I think.
Title: Re: Getting User Input and File Reading
Post by: Salmon Trout on January 16, 2014, 11:12:34 AM: Quote from: BC_Programmer on January 16, 2014, 10:42:34 AM
copy-pasting your posts from another forum

I wondered about that.
Title: Re: Getting User Input and File Reading
Post by: briandams on January 16, 2014, 04:21:14 PM: Quote from: Squashman on January 16, 2014, 09:42:31 AM
Then why not use the shells built-in functionality to check for the file existence before running your AWK command.

Code: [Select]
IF EXIST foo.txt awk.........
this can be done in awk as well as shown in the examples but if you want to do it in the shell , thats up to individual.
Title: Re: Getting User Input and File Reading
Post by: briandams on January 16, 2014, 04:24:26 PM: Quote from: BC_Programmer on January 16, 2014, 10:42:34 AM
But when responses like the above are copy-pasted verbatim to rather different questions, that's just a bit weird, I think.

The author for that dostips thread is yours truly . Hence I can copy and paste all I want. I don't have a blog, if not, i would just redirect readers there. That's not a different question. I just felt the response for the question look a bit similar as i had answered it in dostips. hence the copy and paste.
Title: Re: Getting User Input and File Reading
Post by: briandams on January 16, 2014, 04:27:38 PM: Quote from: Salmon Trout on January 16, 2014, 11:12:34 AM
I wondered about that.

as explained. I am the original author of that dostip thread.
Title: Re: Awk - A nifty little tool for text manipulation and more.
Post by: Squashman on January 16, 2014, 06:02:27 PM: Not that hard to start a free blog or free website.
Title: Filtering items in a file with another file
Post by: briandams on January 17, 2014, 02:41:37 AM: Sometimes you may need to filter a file using keywords from another file. say you have file1.txt and file2.txt
Code: [Select]
C:\>type file1.txt cheese milk sausage C:\>type file2.txt milk cheese popcorn pasta milk sausage cheese melon
you want to filter file2.txt with file1.txt such that only those not matching remains. eg
Code: [Select]
popcorn pasta melon
We can do this with awk one liner.
Code: [Select]
C:\>awk "FNR==NR{ a[$1] ;next} { if ( !($0 in a) ) { print } }" file1.txt file2.txt popcorn pasta melon
Explanation:
FNR==NR : FNR means the number of records read so far. NR means the TOTAL number of records read from all files. Hence, the idiom FNR==NR means to read all the records from the first file and store to array.
When awk finish processing the first file, the FNR and NR would be different values, so the 2nd file will be processed. In this case the
Code: [Select]
if ( !($0 in a) ) { print }statement just says to compare the item inside the array and print the record if not found.
Title: Handy one liners
Post by: briandams on January 17, 2014, 07:59:42 PM: Here are some commonly used one liners for file/text parsing

1) Deleting last line of a file
2) Deleting first line of file
3) Print a range of lines
4) Print lines not in a range
5) Concatenating two files
6) Transposing a file
7) Print first and last line
8) Print the line above and below a pattern
9) Print all lines until a matched pattern
10) Print from a matched pattern till the end of file

1) Deleting last line of a file
Code: [Select]
C:\>type myFile.txt CAT MAT RAT C:\>awk "BEGIN{ RS=\"\0\"} { for(i=1;i<NF;i++) print $i } " myFile.txt CAT MAT
2) Deleting first line of file
Code: [Select]
C:\> awk "NR>1 { print } " myFile.txt MAT RAT

3) Print a range of lines. eg print line 3 to line 5
Code: [Select]
C:\> type myFile.txt CAT MAT RAT BAT TAT DAT PAT C:\> awk "NR==3,NR==5{ print } " myFile.txt RAT BAT TAT
4) Print lines not in a range . eg don't print lines number 3 to 5
Code: [Select]
C:\>awk "!(NR>=3 && NR<=5) { print }" myFile.txt CAT MAT DAT PAT
5) Concatenating two files
Code: [Select]
C:\>awk "{print}" file1 file2 > newFile.txt

6) Transposing a file
Code: [Select]
C:\> awk "BEGIN{ORS=\" \"}{print}" myFile.txt CAT MAT RAT BAT TAT DAT PAT
7) Print first and last line
Code: [Select]
C:\> awk "NR==1;END{print}" myFile.txt CAT PAT
8) Print the line above and below a pattern. eg Search for "RAT" and print the lines above and below
Code: [Select]
C:\> type myFile.txt CAT MAT RAT BAT TAT DAT PAT C:\> awk "/RAT/{print y;print;f=1;next}f{print;f=0}{y=$0}" myFile.txt MAT RAT BAT
9) Print all lines until a matched pattern. eg Print until the word "BAT" is found
Code: [Select]
C:\> awk "/BAT/{exit}{print}" myFile.txt
10) Print from a matched pattern till the end of file
Code: [Select]
C:\> awk "/TAT/,0" myFile.txt TAT DAT PAT