As a Database Administrator or database developer, you may move data from file to database or from one database to another. The data coming to you from another source in the form or excel or text file may not be in the format that your destination database supports. Sometime, the file has more columns than your table, the phone or date format is not supported by the destination database. Before loading a file, data needs proper formatting to your need and then load it. How do you scrub or massage data and format it to your need? You can write a code in Java, Perl or C++ to read from a file, format data to your need and spit out to a file. Writing a program just to format to our need may take hours in these language. To avoid expensive and long programming hours, someone came with AWK programming which is famously known for one-line code. You can almost do anything with a single line code using AWK.
new line: \n
tab : \t
Redirection Operator to write result to a file.
What is AWK? It is a programming language for text processing, extracting and a reporting tool that comes standard for most Linux-like operating systems. AWK was developed in Bell Labs in 1970's by Alfred Aho, Peter Weingberger, and Brian Kernighan which got the name from their surname.
We are going to run through some examples to understand how AWK works? We have two files with data that are not in the correct format. We will work from these examples to slice and dice these files to format or extract the only data we need. We have two file statecapital.txt and employee.txt
pbaniya@dbarepublic.com:~ $cat statecapital.txt
S/N STATE CAPITAL
1) Alabama Montgomery
2) Alaska Juneau
3) Arizona Phoenix
4) Arkansas Little Rock
5) California Sacramento
6) Colorado Denver
7) Connecticut Hartford
8) Delaware Dover
9) Florida Tallahassee
10) Georgia Atlanta
Printing all Field using AWKawk '{print}' statecapital.txt
pbaniya@dbarepublic.com:~ $ awk '{printf}' state_capital.txt
S/N STATE CAPITAL
1) Alabama Montgomery
2) Alaska Juneau
3) Arizona Phoenix
4) Arkansas Little Rock
5) California Sacramento
6) Colorado Denver
7) Connecticut Hartford
8) Delaware Dover
9) Florida Tallahassee
10) Georgia Atlanta
Printing STATE and CAPITAL onlypbaniya@dbarepublic.com:~ $ awk '{print $2, $3}' state_capital.txt
STATE CAPITAL
Alabama Montgomery
Alaska Juneau
Arizona Phoenix
Arkansas Little
California Sacramento
Colorado Denver
Connecticut Hartford
Delaware Dover
Florida Tallahassee
Georgia Atlan
Printing Searched Column: Print row that contains Alabamapbaniya@dbarepublic.com:~ $ awk '/Alabama/' state_capital.txt
1) Alabama Montgomery
Adding Double Quote on State and Capitalpbaniya@dbarepublic.com:~ $ awk '{printf "\""$2"\"" "\t" "\""$3"\"" "\n"}' state_capital.txt
"STATE" "CAPITAL"
"Alabama" "Montgomery"
"Alaska" "Juneau"
"Arizona" "Phoenix"
"Arkansas" "Little"
"California" "Sacramento"
"Colorado" "Denver"
"Connecticut" "Hartford"
"Delaware" "Dover"
"Florida" "Tallahassee"
"Georgia" "Atlanta"
AWK Escape character: \new line: \n
tab : \t
Redirection Operator to write result to a file.
pbaniya@dbarepublic.com:~ $ awk '{printf "\""$2"\"" "\t" "\""$3"\"" "\n"}' state_capital.txt > stat_capital_formatted.txt
pbaniya@dbarepublic.com:~ $ cat state_capital_formatted.txt
"STATE" "CAPITAL"
"Alabama" "Montgomery"
"Alaska" "Juneau"
"Arizona" "Phoenix"
"Arkansas" "Little"
"California" "Sacramento"
"Colorado" "Denver"
"Connecticut" "Hartford"
"Delaware" "Dover"
"Florida" "Tallahassee"
"Georgia" "Atlanta"
Remove Double Quote: awk '{print $1, $2 }' state_capital_formatted.txt | sed 's/\"//g'pbaniya@dbarepublic.com:~ $ awk '{print $1 "\t" $2 }' state_capital_formatted.txt | sed 's/\"//g'
STATE CAPITAL
Alabama Montgomery
Alaska Juneau
Arizona Phoenix
Arkansas Little
California Sacramento
Colorado Denver
Connecticut Hartford
Delaware Dover
Florida Tallahassee
Georgia Atlanta
pbaniya@dbarepublic.com:~/sql_loader $ awk '{print }' employee_data.txt
EMPLOYEE_ID","FIRST_NAME","LAST_NAME","EMAIL","PHONE_NUMBER","HIRE_DATE","JOB_ID","SALARY","COMMISSION_PCT","MANAGER_ID","DEPARTMENT_ID"
100,"Steven","King","SKING","515.123.4567",17-JUN-87,"AD_PRES",24000,,,90
101,"Neena","Kochhar","NKOCHHAR","515.123.4568",21-SEP-89,"AD_VP",17000,,100,90
102,"Lex","De Haan","LDEHAAN","515.123.4569",13-JAN-93,"AD_VP",17000,,100,90
103,"Alexander","Hunold","AHUNOLD","590.423.4567",03-JAN-90,"IT_PROG",9000,,102,60
104,"Bruce","Ernst","BERNST","590.423.4568",21-MAY-91,"IT_PROG",6000,,103,60
105,"David","Austin","DAUSTIN","590.423.4569",25-JUN-97,"IT_PROG",4800,,103,60
106,"Valli","Pataballa","VPATABAL","590.423.4560",05-FEB-98,"IT_PROG",4800,,103,60
107,"Diana","Lorentz","DLORENTZ","590.423.5567",07-FEB-99,"IT_PROG",4200,,103,60
108,"Nancy","Greenberg","NGREENBE","515.124.4569",17-AUG-94,"FI_MGR",12000,,101,100
109,"Daniel","Faviet","DFAVIET","515.124.4169",16-AUG-94,"ACCOUNT",9000,,108,100
110,"John","Chen","JCHEN","515.124.4269",28-SEP-97,"ACCOUNT",8200,,108,100
111,"Ismael","Sciarra","ISCIARRA","515.124.4369",30-SEP-97,"ACCOUNT",7700,,108,100
112,"Jose Manuel","Urman","JMURMAN","515.124.4469",07-MAR-98,"ACCOUNT",7800,,108,100
113,"Luis","Popp","LPOPP","515.124.4567",07-DEC-99,"ACCOUNT",6900,,108,100
114,"Den","Raphaely","DRAPHEAL","515.127.4561",07-DEC-94,"PU_MAN",11000,,100,30
115,"Alexander","Khoo","AKHOO","515.127.4562",18-MAY-95,"PU_CLERK",3100,,114,30
Removing Double Quote from employee.txt filepbaniya@dbarepublic.com:~/sql_loader $ awk '{print }' employee_data.txt | sed 's/\"//g'
EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
100,Steven,King,SKING,515.123.4567,17-JUN-87,AD_PRES,24000,,,90
101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-89,AD_VP,17000,,100,90
102,Lex,De Haan,LDEHAAN,515.123.4569,13-JAN-93,AD_VP,17000,,100,90
103,Alexander,Hunold,AHUNOLD,590.423.4567,03-JAN-90,IT_PROG,9000,,102,60
104,Bruce,Ernst,BERNST,590.423.4568,21-MAY-91,IT_PROG,6000,,103,60
105,David,Austin,DAUSTIN,590.423.4569,25-JUN-97,IT_PROG,4800,,103,60
106,Valli,Pataballa,VPATABAL,590.423.4560,05-FEB-98,IT_PROG,4800,,103,60
107,Diana,Lorentz,DLORENTZ,590.423.5567,07-FEB-99,IT_PROG,4200,,103,60
108,Nancy,Greenberg,NGREENBE,515.124.4569,17-AUG-94,FI_MGR,12000,,101,100
109,Daniel,Faviet,DFAVIET,515.124.4169,16-AUG-94,ACCOUNT,9000,,108,100
110,John,Chen,JCHEN,515.124.4269,28-SEP-97,ACCOUNT,8200,,108,100
111,Ismael,Sciarra,ISCIARRA,515.124.4369,30-SEP-97,ACCOUNT,7700,,108,100
112,Jose Manuel,Urman,JMURMAN,515.124.4469,07-MAR-98,ACCOUNT,7800,,108,100
113,Luis,Popp,LPOPP,515.124.4567,07-DEC-99,ACCOUNT,6900,,108,100
114,Den,Raphaely,DRAPHEAL,515.127.4561,07-DEC-94,PU_MAN,11000,,100,30
115,Alexander,Khoo,AKHOO,515.127.4562,18-MAY-95,PU_CLERK,3100,,114,30
AWK field seperator: -Fpbaniya@dbarepublic.com:$ awk -F "," '{print $1, $2 }' employee_data.txt
100 Steven
101 Neena
102 Lex
103 Alexander
104 Bruce
105 David
106 Valli
107 Diana
108 Nancy
109 Daniel
110 John
111 Ismael
112 Jose Manuel
113 Luis
114 Den
115 Alexander
To everyone, picture talks thousands words; for database developer, developer, and dba, these examples talk million words. No explanation is needed. I hope these examples are useful to you as they were for me.
Interested in working with me? I can be reached at pbaniya04[at]gmail.com for any questions, consulting opportunities or you may drop a line to say HELLO. Thank your again for visiting my blog and looking forward to serving you more.
Have a Database-ious Day!
Have a Database-ious Day!
No comments