Awk has in built pattern matching and functions for string substitutions. Here I show some basic examples of simple matching and substitution. Regular expressions is a vast topic so if for in depth regex , please consult a regex book. My favorite is Mastering Regular Expression from Oreilly.
Pattern matching
In awk, simple matching goes like this using the ~ operator. (all examples use myFile.txt)
C:\> type myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
C:\>awk "/divID/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
The above says to find any lines that has the string "divID" . For pattern matching, the regex pattern to find is usually enclosed in / /.
If you want case-insensitive search , use the IGNORECASE variable
C:\>awk "BEGIN{IGNORECASE=1}/divid/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
Setting IGNORECASE to 0 toggles it back to case-sensitive.
If you want to find all records with 2nd column starting with "A", then
C:\>awk -F"," "$2 ~ /^A/ {print}" myFile.txt
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
First, give the -F"," option because the file is "," delimited. Then use $2 because its the 2nd column. Then using the regex /^A/. "^" means "starts with". After that "{print}" action will print the relevant records.
In awk, you can negate matches using !~ operator. For example , you want to find
records that doesn't have "DET" as the 3rd field
C:\>awk -F"," "$3 !~ /DET/{print}" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
If you just want to find records that doesn't have the string "DET", just do a !/DET/ using the "!" operator
C:\>awk -F"," "!/DET/" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,LAN,1,W,N,1,57,36,21
1981,NL,LAN,2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
These are very simple examples on using regex operator ~, !~ for searching strings.
String replacement
Awk provides the sub() and gsub() functions to replace strings in files
The syntax for sub() is
sub(regexp, replacement [, target])
for example, replace "LAN" with "NAL"
C:\>awk "{sub(\"LAN\",\"NAL\", $0); print }" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,BAL,1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
1981,AL,BOS,2,E,N,2,52,29,23
1981,AL,CAL,1,W,N,4,60,31,29
1981,AL,CAL,2,W,N,6,50,20,30
1981,AL,CHA,1,W,N,3,53,31,22
1981,AL,CHA,2,W,N,6,53,23,30
1981,NL,CHN,1,E,N,6,52,15,37
1981,NL,CHN,2,E,N,5,51,23,28
1981,NL,CIN,1,W,N,2,56,35,21
1981,NL,CIN,2,W,N,2,52,31,21
1981,AL,CLE,1,E,N,6,50,26,24
1981,AL,CLE,2,E,N,5,53,26,27
1981,AL,DET,1,E,N,4,57,31,26
1981,AL,DET,2,E,N,2,52,29,23
1981,NL,HOU,1,W,N,3,57,28,29
1981,NL,HOU,2,W,N,1,53,33,20
1981,AL,KCA,1,W,N,5,50,20,30
1981,AL,KCA,2,W,N,1,53,30,23
1981,NL,[color=#800000]NAL[/color],1,W,N,1,57,36,21
1981,NL,[color=#800000]NAL[/color],2,W,N,4,53,27,26
1981,AL,MIN,1,W,N,7,56,17,39
1981,AL,MIN,2,W,N,4,53,24,29
sub() only replaces one occurence of the string. For global replacement, use gsub() which has the same syntax as sub().
To replace the "BAL" string from the 4th line only, use NR==4 as the "pattern". then use sub().
C:\>awk "NR==4 { sub(\"BAL\",\"LAB\") } {print}" myFile.txt
yearID,lgID,teamID,Half,divID,DivWin,Rank,G,W,L
1981,NL,ATL,1,W,N,4,54,25,29
1981,NL,ATL,2,W,N,5,52,25,27
1981,AL,[color=#800000]LAB[/color],1,E,N,2,54,31,23
1981,AL,BAL,2,E,N,4,51,28,23
1981,AL,BOS,1,E,N,5,56,30,26
.....
to be continued
- brianadams