Top examples of awk command in unix: How to make use of awk to fetch meaningful information from raw log data

Meaning of some of the Awk Built-in Variables used below:
---------------------------------------------------------------------------------------

NF : Number of fields in current line/record
NR : Ordial number of current line/record
FS : Field Separator (Also -F can be used)
OFS : Output Field Separator (default=blank)
FILENAME : Name of current input file

All of following Awk one liners are based on the following input file 'test1.txt'
--------------------------------------------------------------------------------------------------------------

$ cat test1.txt
Continent:Val
AS:12000
AF:9800
AS:12300
NA:3400
OC:12000
AF:500
AS:1000

A) Print 'line number' NR and 'Number of fields' NF for each line

$ awk -F ":" '{print NR,NF}' test1.txt
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2

B) Print first field, colon delimited

$ awk -F ":" '{print $1}' test1.txt
Continent
AS
AF
AS
NA
OC
AF
AS

C) Same as B, but excluding the 'first line' (NR!=1)

$ awk -F ":" 'NR!=1 {print $1}' test1.txt
AS
AF
AS
NA
OC
AF
AS

D) Same as B but only for line number 1 (NR==1)

$ awk -F ":" 'NR==1 {print $1}' test1.txt
Continent

E) Print first and second field

$ awk -F ":" 'NR!=1 {print $1,$2}' test1.txt
AS 12000
AF 9800
AS 12300
NA 3400
OC 12000
AF 500
AS 1000

F) Setting output field separator as pipe

$ awk -F ":" 'BEGIN{OFS="|"} NR!=1 {print $1,$2}' test1.txt
AS|12000
AF|9800
AS|12300
NA|3400
OC|12000
AF|500
AS|1000

G) FS and OFS can be included in BEGIN section

$ awk 'BEGIN{FS=":"; OFS="|"} NR!=1 {print $1,$2}' test1.txt
AS|12000
AF|9800
AS|12300
NA|3400
OC|12000
AF|500
AS|1000

H) Anything on BEGIN executes first

$ awk 'BEGIN{FS=":"; OFS="|"; print "Con|SomeVal"} NR!=1 {print $1,$2}' test1.txt
Con|SomeVal
AS|12000
AF|9800
AS|12300
NA|3400
OC|12000
AF|500
AS|1000

I) Printing FILENAME, will be printed for all the lines

$ awk -F ":" '{print FILENAME}' test1.txt
test1.txt
test1.txt
test1.txt
test1.txt
test1.txt
test1.txt
test1.txt
test1.txt

#Same as above but printing only last instance using END clause

$ awk -F ":" ' END {print FILENAME}' test1.txt
test1.txt

J) Revisiting NF, number of fields in each line

$ awk -F ":" '{print NF}' test1.txt
2
2
2
2
2
2
2
2

K) Printing the last field of the file, same as printing $2 as there are only 2 fields

$ awk -F ":" '{print $NF}' test1.txt
Val
12000
9800
12300
3400
12000
500
1000

L) Matching, printing lines begin with "AS"

$ awk -F ":" '/^AS/' test1.txt
AS:12000
AS:12300
AS:1000

#'Not condition' of the L

$ awk -F ":" '!/^AS/' test1.txt
Continent:Val
AF:9800
NA:3400
OC:12000
AF:500

#Direct matching, first field as "AS"

$ awk -F ":" '$1=="AS"' test1.txt
AS:12000
AS:12300
AS:1000

$ awk -F ":" '$1=="AS" {print $2}' test1.txt
12000
12300
1000

#$0 prints the full line, same as {print}

$ awk -F ":" '$1=="AS" {print $0}' test1.txt
AS:12000
AS:12300
AS:1000

$ awk -F ":" '$1=="AS" {print}' test1.txt
AS:12000
AS:12300
AS:1000

M) 'Or' condition

$ awk -F ":" '$1=="AS" || $1=="OC" {print}' test1.txt
AS:12000
AS:12300
OC:12000
AS:1000

# 'Or' and 'AND' together

$ awk -F ":" '($1=="AS" || $1=="OC") && $NF > 11000 {print}' test1.txt
AS:12000
AS:12300
OC:12000

$ awk -F ":" '($1=="AS" || $1=="OC") && $NF > 11000 {print $2-10000}' test1.txt
2000
2300
2000

$ awk -F ":" '($1=="AS" || $1=="OC") && $NF > 11000 {print ($2-($2%100))}' test1.txt
12000
12300
12000

N) Partial Matching

$ awk -F ":" '$1 ~ /AS/ {print}' test1.txt
AS:12000
AS:12300
AS:1000

$ awk -F ":" '$1 ~ /A/ {print}' test1.txt
AS:12000
AF:9800
AS:12300
NA:3400
AF:500
AS:1000

O) Reading from STDOUT

$ cat test1.txt | awk -F ":" '!/Continent/ {print $1}' | sort | uniq
AF
AS
NA
OC

P) Add value 1000 to the 2nd field, where first field is "AF" and then print the output file

$ awk -F ":" '$1=="AF" {$2+=1000} {print}' test1.txt
Continent:Val
AS:12000
AF 10800
AS:12300
NA:3400
OC:12000
AF 1500
AS:1000

#As no OFS is mentioned above, by default, OFS is blank, specifying it now

$ awk -F ":" 'BEGIN {OFS=":"} $1=="AF" {$2+=1000} {print}' test1.txt
Continent:Val
AS:12000
AF:10800
AS:12300
NA:3400
OC:12000
AF:1500
AS:1000

Q) Sum of 2nd fields, exclude first line

$ awk -F ":" 'NR!=1 {sum+=$NF} END {print sum}' test1.txt
51000

#If END is not mentioned ?

$ awk -F ":" 'NR!=1 {sum+=$NF} {print sum}' test1.txt
12000
21800
34100
37500
49500
50000
51000

#Average of 2nd field, as first field is excluded, (NR-1) instead of NR for total number of items

$ awk -F ":" 'NR!=1 {sum+=$NF} END {print sum/(NR-1)}' test1.txt
7285.71

$ awk -F ":" 'NR!=1 && $1=="AS" {sum+=$NF} END {print sum}' test1.txt
25300

R) 'Group by' and 'Count' using associative array in awk

$ awk ' BEGIN {FS=OFS=":"}
NR==1 {print "Continent:Count:Sum(val)"}
NR>1 {a[$1]++;b[$1]=b[$1]+$2} END{for (i in a) print i,a[i],b[i]}' test1.txt

Continent:Count:Sum(val)
OC:1:12000
NA:1:3400
AF:2:10300
AS:3:25300

#Count of each of the continents(1st field)

$ awk -F ":" '
NR==1 {next}
NR>1 {a[$1]++} END{for (i in a) print i,a[i]}' test1.txt
OC 1
NA 1
AF 2
AS 3

#An alternative

$ awk -F ":" 'NR!=1 {print $1}' test1.txt | sort | uniq -c
2 AF
3 AS
1 NA
1 OC

#Reading from STDOUT

$ cat test1.txt | awk 'BEGIN {OFS=":"} {print NR-1,$0}'
0:Continent:Val
1:AS:12000
2:AF:9800
3:AS:12300
4:NA:3400
5:OC:12000
6:AF:500
7:AS:1000

S) Accessing external variable in awk:

$ awk -F ":" '$1=="AS" {print $2}' test1.txt
12000
12300
1000

#Suppose value of filterval is "AS"

$ filterval="AS"
$ echo $filterval
AS

Append text to filename using Awk:
----------------------------------------------------

I have some files in my current directory whose file-name is of this pattern:
$ ls -1
log.1024.94.1326776200.1326776300.172.16.12.6.1326844995.0.s-1326528000.r-8192.txt
log.1024.94.1326776400.1326776400.172.16.12.5.1326844995.0.s-1326528000.r-2234.txt
log.1024.95.1326776420.1326776460.172.16.12.5.1326844995.0.s-1326528000.r-8192.txt
Requirement: Append a text "MY-2" as the 6th field (dot delimited) of the filename. E.g.

log.1024.94.1326776200.1326776300.172.16.12.6.1326844995.0.s-1326528000.r-8192.txt
should be renamed to
log.1024.94.1326776200.1326776300.MY-2.172.16.12.6.1326844995.0.s-1326528000.r-8192.txt
A bash script using awk to achieve this:
for file in $(ls)
do
newfilename=$(echo $file | awk 'BEGIN {FS=OFS="."} {$6="MY-2" OFS $6} {print}')
mv -v $file $newfilename
done
All the files are renamed to:
$ ls -1
log.1024.94.1326776200.1326776300.MY-2.172.16.12.6.1326844995.0.s-1326528000.r-8192.txt
log.1024.94.1326776400.1326776400.MY-2.172.16.12.5.1326844995.0.s-1326528000.r-2234.txt
log.1024.95.1326776420.1326776460.MY-2.172.16.12.5.1326844995.0.s-1326528000.r-8192.txt

Time bucketing using Awk in Unix:
---------------------------------------------------

Each line of input file file.txt is having UNIX epoch timestamp and some value.
$ cat file.txt
Epoch,Value
1351605000,120
1351605060,130
1351605120,340
1351605180,200
1351605240,120
1351605300,890
1351605360,124
1351605420,450
1351605480,120
1351605540,120
1351605600,200
1351605660,120
1351605720,340
1351605780,670
1351605840,990
The UNIX epoch timestamps are of 1 minute interval. Converting the UNIX epoch timestamps to human readable format using awk 'strftime' function:
$ awk 'BEGIN {FS=OFS=","} NR != 1 {$1=strftime("%c",$1)} {print}' file.txt
Output:
Epoch,Value
Tue 30 Oct 2012 01:50:00 PM GMT,120
Tue 30 Oct 2012 01:51:00 PM GMT,130
Tue 30 Oct 2012 01:52:00 PM GMT,340
Tue 30 Oct 2012 01:53:00 PM GMT,200
Tue 30 Oct 2012 01:54:00 PM GMT,120
Tue 30 Oct 2012 01:55:00 PM GMT,890
Tue 30 Oct 2012 01:56:00 PM GMT,124
Tue 30 Oct 2012 01:57:00 PM GMT,450
Tue 30 Oct 2012 01:58:00 PM GMT,120
Tue 30 Oct 2012 01:59:00 PM GMT,120
Tue 30 Oct 2012 02:00:00 PM GMT,200
Tue 30 Oct 2012 02:01:00 PM GMT,120
Tue 30 Oct 2012 02:02:00 PM GMT,340
Tue 30 Oct 2012 02:03:00 PM GMT,670
Tue 30 Oct 2012 02:04:00 PM GMT,990
Required: Group each 5 minutes entries together to one time bucket and perform a corresponding sum(Value) of the entries. E.g. All entries from '01:50:00 PM' (1351605000) till '01:54:00 PM' (1351605240) should be grouped together to '01:50:00 PM' (1351605000) time bucket, '01:55:00 PM' (1351605300) till '01:59:00 PM' (1351605540) to '01:55:00 PM' (1351605300) bucket and so on.
$ awk 'BEGIN {FS=OFS=","} NR != 1 {arr[$1-($1%300)]+=$2} END {for (i in arr) print i,arr[i]}' file.txt
Output:

1351605000,910
1351605300,1704
1351605600,2320
Converting UNIX epoch timestamps to human readable format:
$ awk 'BEGIN {FS=OFS=","} NR != 1 {arr[$1-($1%300)]+=$2} END {for (i in arr) print i,arr[i]}' file.txt | awk '
BEGIN {FS=OFS=","} {$1=strftime("%c",$1)} {print}'

Separate last field with spaces :
----------------------------------------------

Each line of following file 'file.txt' actually has total 4 fields/columns. As you can see some of the values of the last field (4th field, header : Description) has spaces in them.
$ cat file.txt
ID Name Active Description
2312 DEMO-1 1 Demo for VOD LIVE
1245 DEMO-4 1 LIVE
1002 CUST2 0 VOD Event
19000 DEMO-2 1 VOD
189 DEMO-3 1 Demo for LIVE
Required output:
We need to separate the 4 fields with a comma (,) delimiter such that all the values after 3rd field is considered as 4th field/column. i.e. Required output is something like this:
ID, Name, Active, Description
2312, DEMO-1, 1, Demo for VOD LIVE
1245, DEMO-4, 1, LIVE
1002, CUST2, 0, VOD Event
19000, DEMO-2, 1, VOD
189, DEMO-3, 1, Demo for LIVE
One way to achieve this is to add comma (,) to each of the first 3 fields of every line/row.
$ awk ' { for ( i=1;i<=3;i++ ) {$i=$i","} print }' file.txt
Output:
ID, Name, Active, Description
2312, DEMO-1, 1, Demo for VOD LIVE
1245, DEMO-4, 1, LIVE
1002, CUST2, 0, VOD Event
19000, DEMO-2, 1, VOD
189, DEMO-3, 1, Demo for LIVE
Any other alternative ? Feel free to post in the comment section, much appreciated. Thanks.

And to print the last field of the above input file one solution using UNIX cut command:
$ cut -d" " -f4- file.txt
Output:
Description
Demo for VOD LIVE
LIVE
VOD Event
VOD
Demo for LIVE

Compare two numeric fields of two files:
----------------------------------------------------------

Following example will show how we can use UNIX paste and join command with awk to find difference of two numeric columns of two similar files.

e.g. Input file:
$ cat file1.txt
Aug-1,100
Aug-2,220
Aug-4,230
Aug-5,100

$ cat file2.txt
Aug-1,100
Aug-2,120
Aug-4,400
Aug-5,250
Required output:
As you can see both file1.txt and file2.txt has got first field (comma separated) in the same order. We will have to find out the corresponding (2nd field of file1.txt - 2nd field of file2.txt) and the output required is something like this:
Aug-1,100,100,0
Aug-2,220,120,100
Aug-4,230,400,-170
Aug-5,100,250,-150
Solution#1:
$ paste -d "," file1.txt file2.txt
Aug-1,100,Aug-1,100
Aug-2,220,Aug-2,120
Aug-4,230,Aug-4,400
Aug-5,100,Aug-5,250

$ paste -d "," file1.txt file2.txt | awk 'BEGIN {FS=OFS=","} {print $1,$2,$NF,$2-$NF}'
Aug-1,100,100,0
Aug-2,220,120,100
Aug-4,230,400,-170
Aug-5,100,250,-150
Solution#2:
$ join -t, -1 1 -2 1 file1.txt file2.txt
Aug-1,100,100
Aug-2,220,120
Aug-4,230,400
Aug-5,100,250

$ join -t, -1 1 -2 1 file1.txt file2.txt | awk ' BEGIN {FS=OFS=","} {print $0,$2-$NF}'
Aug-1,100,100,0
Aug-2,220,120,100
Aug-4,230,400,-170
Aug-5,100,250,-150

Convert epoch to date in the same file:
-------------------------------------------------------

Example 1)
Input file:
$ cat test1.txt
2|Z|1219071600|AF|0
3|N|1219158000|AF|89
4|N|1220799600|AS|12
1|Z|1220886000|AS|67
5|N|1220972400|EU|23
6|R|1221058800|OC|89

Required output:
2|Z|Mon 18 Aug 2008 03:00:00 PM UTC|AF|0
3|N|Tue 19 Aug 2008 03:00:00 PM UTC|AF|89
4|N|Sun 07 Sep 2008 03:00:00 PM UTC|AS|12
1|Z|Mon 08 Sep 2008 03:00:00 PM UTC|AS|67
5|N|Tue 09 Sep 2008 03:00:00 PM UTC|EU|23
6|R|Wed 10 Sep 2008 03:00:00 PM UTC|OC|89

i.e. convert the UNIX epoch values on 3 rd field of the above file to standard human readable date format using Awk.

The Awk solution using "strftime" function:
$ awk 'BEGIN {FS=OFS="|"}{$3=strftime("%c",$3)} {print}' test1.txt

More about strftime format specifiers can be found here

Example 2)
Input file:
$ cat test2.txt
2|Z|time:1219071600|AF|0
3|N|time:1219158000|AF|89
4|N|time:1220799600|AS|12
1|Z|time:1220886000|AS|67
5|N|time:1220972400|EU|23
6|R|time:1221058800|OC|89

Required output:
2|Z|time:Mon 18 Aug 2008 03:00:00 PM UTC|AF|0
3|N|time:Tue 19 Aug 2008 03:00:00 PM UTC|AF|89
4|N|time:Sun 07 Sep 2008 03:00:00 PM UTC|AS|12
1|Z|time:Mon 08 Sep 2008 03:00:00 PM UTC|AS|67
5|N|time:Tue 09 Sep 2008 03:00:00 PM UTC|EU|23
6|R|time:Wed 10 Sep 2008 03:00:00 PM UTC|OC|89

The awk solution:
$ awk 'BEGIN {FS=OFS="|"}{$3="time:"strftime("%c",substr($3,6,10))} {print}' test2.txt

Replace n-th occurrence of pattern:
----------------------------------------------------

Input file:
$ cat file.txt
FR 24
AA 33
EE 34
EE 46
BE 30
AA 31
DE 90
AL 10
AA 50
FR 67
EE 94
AA 80

Required:
Replace the "third" occurrence of first field "AA" with text "XX" in the above file.

Awk Solutions:
$ awk '$1=="AA" {
count++
if(count == 3){
sub("AA","XX",$1)
}
}
{print}' file.txt

Output:
FR 24
AA 33
EE 34
EE 46
BE 30
AA 31
DE 90
AL 10
XX 50
FR 67
EE 94
AA 80

In-case you want to add 100 to the 2nd field of that line where first field "AA" has occurred for the 3rd time, here is a way:
$ awk '$1=="AA" {
count++
if(count == 3){
$2=$2+100
}
}
{print}' file.txt

Output:
FR 24
AA 33
EE 34
EE 46
BE 30
AA 31
DE 90
AL 10
AA 150
FR 67
EE 94
AA 80

Print up-to nth occurrence of pattern :
--------------------------------------------------------

Input file:
$ cat file.txt
FR 24
AA 33
AA 36
EE 34
BE 30
AA 31
DE 90
AL 10
AA 50
FR 67

Required: Print the portion of the above file up-to the 3rd occurrence of the pattern "AA" as the first field.
$ awk '
$1=="AA" {c++} {print $0}
c==3 {exit}
' file.txt

Output:
FR 24
AA 33
AA 36
EE 34
BE 30
AA 31

Sum of multiple columns of file:
-----------------------------------------------

Input file:
$ cat file.txt
500:120:100
100:120:700
200:900:125
120:120:900

Required:
Compute the sum of individual columns of the above file. i.e. required output:

920:1260:1825

Awk solution - 1:

$ awk 'BEGIN {FS=OFS=":"}
NR == 1 { n1 = $1; n2 = $2; n3 = $3; next }
{ n1 += $1; n2 += $2; n3 += $3 }
END { print n1, n2, n3 }' file.txt

Output:

920:1260:1825

Awk solution - 2:

$ awk -F ":" '
{ for (i=1; i<=NF; ++i) sum[i] += $i; j=NF }
END { for (i=1; i <= j; ++i) printf "%s ", sum[i]; printf "\n"; }
' file.txt

A newbie tutorial on UNIX Awk utility

Sharing a few Awk one liners (very basic and random) which will help the Awk newbies to gain a quick confidence on one of the beautiful UNIX tools - Awk. Some of my recommendations for learning UNIX Awk are mentioned at the bottom of this post. Have a great time with Awk.

Check equality of multiple numbers:
-----------------------------------------------------

My input file 'file.txt' contains 4 values of a certain metric for each of the following 'Continents'.

$ cat file.txt
Continent Val1 Val2 Val3 Val4
AS 440518 440518 440516 440516
AF 253317 253317 253315 253317
EU 245397 245397 245397 245397
OC 226410 226410 226410 226410
NA 221961 221961 221962 221961

Required : I was required to find out only those 'Continents' for which 'all' values are 'same'.

Solutions:

1) Using awk:

$ awk '
/^Continent/ {print $1; next}
$2==$3 && $3==$4 && $4==$5 {print $1}
' file.txt

Output:

Continent
EU
OC

2) Wrote this python program using python 'sets' (Unordered collections of unique elements) to achieve the same. Something like:

from sets import Set
for line in open("file.txt"):
if line.startswith('Continent'):
print line.split()[0]
firstfield = line.split()[0]
remaining = line.split()[1:]
vals = Set(remaining)
if len(vals) == 1:
print firstfield

Executing it:

$ python printequal.py
Continent
EU
OC

Print particular instances of a file:
--------------------------------------------------

I am requesting everyone to provide a better alternative (in any scripting language) to this problem. Thanks in advance.

My input file has the following format:
- A instance is a combination of 'h' 'v' and one or more 'i' lines.
- All lines starting with 'h' are header lines and the 3rd field in that line is the 'header number'.
- All lines starting with 'v' are version lines and the second field in that line is the 'version number'.

$ cat file.txt
h,1,100
v,1
i,rt,200
i,rt,210
i,rt,810
h,1,101
v,5
i,rt,500
i,rt,700
h,1,100
v,2
i,rt,100
i,rt,910
h,1,500
v,1
i,rt,190
h,1,100
v,1
i,rt,900
i,rt,210
h,1,300
v,1
i,rt,800
i,rt,210

Required:
- Print all the 'i' lines associated with header number '100' and version number '1'
i.e. required output:

i,rt,200
i,rt,210
i,rt,810
i,rt,900
i,rt,210

The quick solution I can think about is to associate 'header number' and 'version number' with all 'i' lines

$ awk -F "," '$1=="h" {h_value=$NF}
$1=="v" {v_value=$NF}
$1=="i" {print h_value,v_value,$0}
' file.txt

Output:

100 1 i,rt,200
100 1 i,rt,210
100 1 i,rt,810
101 5 i,rt,500
101 5 i,rt,700
100 2 i,rt,100
100 2 i,rt,910
500 1 i,rt,190
100 1 i,rt,900
100 1 i,rt,210
300 1 i,rt,800
300 1 i,rt,210

And then print only the lines with header number=100 and version number=1.

$ awk -F "," '$1=="h" {h_value=$NF}
$1=="v" {v_value=$NF}
$1=="i" {print h_value,v_value,$0}
' file.txt | awk '$1==100 && $2==1 {print $NF}'

Output:

i,rt,200
i,rt,210
i,rt,810
i,rt,900
i,rt,210

Print double quotes in unix Awk

I have already described in one of the earlier posts on how to print string within single quote in Awk print statement. Here is how we can print strings without "double quotes" in Awk.
$ cat file.txt
6289693505455 Plan_DAIL_30D_AA
6289693505475 Plan_DAIL_30D_AA
6289693505462 Plan_DAIL_30D_AB
Output required:
Plan_DAIL_30D_AA "6289693505455"
Plan_DAIL_30D_AA "6289693505475"
Plan_DAIL_30D_AB "6289693505462"
i.e. Print the first field of the file within "double quotes".
Here are some of the alternatives:
$ awk '{print $2,$1}' file.txt
Plan_DAIL_30D_AA 6289693505455
Plan_DAIL_30D_AA 6289693505475
Plan_DAIL_30D_AB 6289693505462

$ awk '{print $2,"\""$1"\""}' file.txt
Plan_DAIL_30D_AA "6289693505455"
Plan_DAIL_30D_AA "6289693505475"
Plan_DAIL_30D_AB "6289693505462"

#Assigning the quotes sequence to a variable x
$ awk -v x="\"" '{print $2,x$1x}' file.txt
Plan_DAIL_30D_AA "6289693505455"
Plan_DAIL_30D_AA "6289693505475"
Plan_DAIL_30D_AB "6289693505462"

#Using octal code of double quotes
$ awk '{print $2,"\042"$1"\042"}' file.txt
Plan_DAIL_30D_AA "6289693505455"
Plan_DAIL_30D_AA "6289693505475"
Plan_DAIL_30D_AB "6289693505462"

#Using ASCII code of double quotes
$ awk '{print $2,"\x22"$1"\x22"}' file.txt
Plan_DAIL_30D_AA "6289693505455"
Plan_DAIL_30D_AA "6289693505475"
Plan_DAIL_30D_AB “6289693505462"

Sum two times of hh-mm-ss format:
----------------------------------------------------

2nd and 3rd field for input file 'file.txt' is in hh:mm:ss format.

$ cat file.txt
#slno round1 round2
505 01:54:15 00:24:05
509 01:34:39 00:23:03
503 01:51:55 00:22:55
503 01:45:10 00:12:15

Required:

- Sum $2 and $3 of the above file and print it as the 4th field in 'seconds' format.

The awk solution:

$ awk '
function convert(t) {

split(t,Arr,":")
return Arr[1]*3600+Arr[2]*60+Arr[3]

}

/^#/ {print $0,"\ttotal(sec)"; next}
{print $0,"\t",convert($2)+convert($3) }
' file.txt

Output:

#slno round1 round2 total(sec)
505 01:54:15 00:24:05 8300
509 01:34:39 00:23:03 7062
503 01:51:55 00:22:55 8090
503 01:45:10 00:12:15 7045

Sum numbers in each row:
-----------------------------------------

Input file contains the scores of few students in certain rounds of a game in the following format.

id,Name,score1,score2,score3 etc

$ cat file.txt
id9,Mohit Kishore,19,13,14,10
id2,Niraj Kumar,13,8,23,8
id8,Kate Nil,19,18,15
id4,Rashi S,19,28,65,10,19

Required output:

Calculate the sum and average score of each student (notice that the number of rounds played by each student is not constant, few played 3 rounds, few 4 etc)

The awk script:

$ awk '
BEGIN {FS=OFS=","}
{
sum=0; n=0
for(i=3;i<=NF;i++)
{sum+=$i; ++n}
print $0,"sum:"sum,"count:"n,"avg:"sum/n
}' file.txt

Output:

id9,Mohit Kishore,19,13,14,10,sum:56,count:4,avg:14
id2,Niraj Kumar,13,8,23,8,sum:52,count:4,avg:13
id8,Kate Nil,19,18,15,sum:52,count:3,avg:17.3333
id4,Rashi S,19,28,65,10,19,sum:141,count:5,avg:28.2
Awk - Print first occurrence of a set

Input file 'order-pref.txt' contains two sql UPDATE commands for each item.

$ cat order-pref.txt
computer networking book
UPDATE orders SET quantity = '5' WHERE id = '9';
UPDATE orders SET quantity = '5' WHERE id = '8';
puzzle and skill games
UPDATE orders SET quantity = '5' WHERE id = '99';
UPDATE orders SET quantity = '5' WHERE id = '98';
arithmetic sequence
UPDATE orders SET quantity = '5' WHERE id = '55';
UPDATE orders SET quantity = '5' WHERE id = '56';

Required:
For each item find and print the first UPDATE sql command (i.e. first occurrence of each set of sql commands for each item).

i.e. required output:

UPDATE orders SET quantity = '5' WHERE id = '9';
UPDATE orders SET quantity = '5' WHERE id = '99';
UPDATE orders SET quantity = '5' WHERE id = '55';

Awk solution:

$ awk '/^UPDATE/{++c;if(c%2==1)print}' order-pref.txt > o.sql

Output:

$ cat o.sql
UPDATE orders SET quantity = '5' WHERE id = '9';
UPDATE orders SET quantity = '5' WHERE id = '99';
UPDATE orders SET quantity = '5' WHERE id = '55’;

Expand entries in file using awk :
--------------------------------------------------

Input file:

$ cat data.txt
Manager3|sw5
Manager2|sw engg9,sw12
Manager1|sw1,sw4,sw2,sw engg0

Output required:

Manager3|sw5
Manager2|sw engg9
Manager2|sw12
Manager1|sw1
Manager1|sw4
Manager1|sw2
Manager1|sw engg0

I have already posted (using awk and another using python) the reverse solution of the above, i.e. converting the above expected output to the input file format.

Awk solution:

$ awk -F "|" '{
n=split($2,Arr,",")
for(i=1; i<=n;i++){printf "%s|%s\n",$1,Arr[i]}}
‘ data.txt

Printing single quote in awk:
------------------------------------------

Input file:

$ cat /tmp/file.txt
Computer programming:Zia:78
discrete mathematics:Nil:82
Quantum physics:Leni:91
biomedical engineering:Qureg:82
computer architecture:Anu:90

Required output:

Top in 'Computer programming' : 'Zia'
Top in 'discrete mathematics' : 'Nil'
Top in 'Quantum physics' : 'Leni'
Top in 'biomedical engineering' : 'Qureg'
Top in 'computer architecture' : 'Anu'

i.e.

Top in '1st field' : '2nd field'

Using awk variable assignment technique, i.e. assigning the value 'single quote' to the variable x below:

$ awk -F: -v x="'" '
{print "Top in",x$1x,":",x$2x}
' /tmp/file.txt

And to use "double-quote":

$ awk -F: -v x="\"" '
{print "Top in",x$1x,":",x$2x}
' /tmp/file.txt

Top in "Computer programming" : "Zia"
Top in "discrete mathematics" : "Nil"
Top in "Quantum physics" : "Leni"
Top in "biomedical engineering" : "Qureg"
Top in "computer architecture" : "Anu"

Another solution will be to use the ASCII for 'single colon':

$ awk -F: '
{print "Top in","\x27"$1"\x27",":","\x27"$2"\x27"}
' /tmp/file.txt

Since number of fields in the input file is very few, we can try this using 'sed'; something like:

$ sed "
s_$.*$:$.*$:$.*$_Top in '\1' : '\2'_g
" /tmp/file.txt

Numbering lines in file using awk:
--------------------------------------------------

Input file 'file.txt' contains names of few students.

$ cat file.txt
Sam G
Ashok Niak
Rosy M
Peter K
Sid Thom
Rasi Yad
Papu S
Niaraj J
Aloh N K
Nipu H
Quam L

Required output:

For the entries of the above file,
- add a serial number to each line
- Also add 'House' number such that all the students are group into total 4 houses in the following fashion:

Sl No,Name,House
1,Sam G,House1
2,Ashok Niak,House2
3,Rosy M,House3
4,Peter K,House4
5,Sid Thom,House1
6,Rasi Yad,House2
7,Papu S,House3
8,Niaraj J,House4
9,Aloh N K,House1
10,Nipu H,House2
11,Quam L,House3

The awk solution using awk NR variable:

$ awk '
BEGIN {OFS=","; print "Sl No,Name,House"}
{print NR,$0,"House"((NR-1)%4)+1}
' file.txt

Lets format the output for a better look:

$ awk '
BEGIN {
FORMAT="%-8s%-18s%s\n" ;
{printf FORMAT,"Sl No","Name","House"}
}
{printf FORMAT,NR,$0,"House"((NR-1)%4)+1}
' file.txt

Output:

Sl No Name House
1 Sam G House1
2 Ashok Niak House2
3 Rosy M House3
4 Peter K House4
5 Sid Thom House1
6 Rasi Yad House2
7 Papu S House3
8 Niaraj J House4
9 Aloh N K House1
10 Nipu H House2
11 Quam L House3

A Bash script for the same will be something like this:

#!/bin/sh
i=0
while read
do
echo "$((i+1)),$REPLY,House$((i++ % 4 + 1))"
done < file.txt

Output:

$ sh numbering.sh
1,Sam G,House1
2,Ashok Niak,House2
3,Rosy M,House3
4,Peter K,House4
5,Sid Thom,House1
6,Rasi Yad,House2
7,Papu S,House3
8,Niaraj J,House4
9,Aloh N K,House1
10,Nipu H,House2
11,Quam L,House3

Now a question:
What is that '$REPLY' in the above script ?

Answer: '$REPLY' is the default value when a variable is not supplied to read.

So the above script is same as:

#!/bin/sh
i=0
while read line
do
echo "$((i+1)),$line,House$((i++ % 4 + 1))"
done < file.txt

In general, numbering of the lines of a file can be done in several ways viz

Using UNIX/Linux nl(1) command - number lines of files

$ nl file.txt
1 Sam G
2 Ashok Niak
3 Rosy M
4 Peter K
5 Sid Thom
6 Rasi Yad
7 Papu S
8 Niaraj J
9 Aloh N K
10 Nipu H
11 Quam L

Using awk NR:

$ awk '{print "\t"NR"\t"$0}' file.txt
1 Sam G
2 Ashok Niak
3 Rosy M
4 Peter K
5 Sid Thom
6 Rasi Yad
7 Papu S
8 Niaraj J
9 Aloh N K
10 Nipu H
11 Quam L

Using sed syntax:

$ sed = file.txt | sed 'N;s/\n/\t/'
1 Sam G
2 Ashok Niak
3 Rosy M
4 Peter K
5 Sid Thom
6 Rasi Yad
7 Papu S
8 Niaraj J
9 Aloh N K
10 Nipu H
11 Quam L

Extract range of lines using sed awk bash:
------------------------------------------------------------

Below are few different ways to print or extract a section of a file based on line numbers.

Lets try to extract lines between line number 27 and line number 99 of input file 'file.txt'

Using sed editor:

$ sed -n '27,99 p' file.txt > /tmp/file1

Which is same as:

$ sed '27,99 !d' file.txt > /tmp/file2

Awk alternative : you can make use of awk NR variable

$ awk 'NR >= 27 && NR <= 99' file.txt > /tmp/file3

Using Linux/UNIX 'head' and 'tail' command:

$ head -99 file.txt | tail -73 > /tmp/file4

Which is basically:

$ head -99 file.txt | tail -$(((99-27)+1)) > /tmp/file5

In vi editor, we can use the following command in ex mode (open the main file 'file.txt' in vi):

:27,99 w! /tmp/file6

i.e. Write lines between line number 27 and line number 99 of main file 'file.txt' to file '/tmp/file6'

Perl alternative would be:

$ perl -ne 'print if 27..99' file.txt > /tmp/file7

And the solution using python:

$ python
Python 2.5.2 (r252:60911, Jul 22 2009, 15:35:03)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2

>>> fp = open("/tmp/file8","w")
>>> for i,line in enumerate(open("file.txt")):
... if i >= 26 and i < 99 :
... fp.write(line)
...
>>>

So the contents of all the output files produced (i.e /tmp/file[1-8]) will be the same (i.e. line number 27 to line number 99 of 'file.txt')

Exponential value is awk sum output:
------------------------------------------------------

In one of my Debian box with mawk 1.3.3 (mawk is an interpreter for the AWK Programming Language), if I try to add the 2nd fields of the following file using awk:

$ cat data.txt
a:99540232
b:89795683
a:08160808
c:0971544
d:99500728
a:12212539898
d:98065599
e:92640031
a:3129013
c:4085555

The output:

$ awk -F ":" '{sum+=$NF} END {print sum}' data.txt
1.27084e+10

So, awk is giving sum output as exponential format as seen above.

To get the above sum output in integer, here is a way:

$ awk -F ":" '{sum+=$NF} END { printf ("%0.0f\n", sum)} ‘ data.txt
12708429091

Split file vertically on columns:
-----------------------------------------------

I have already put a post on - how we can split a file into multiple sub-files based on different conditions (that was basically a horizontal splitting of file); lets see how we can split a file vertically.

Input file 'file.txt' is a csv file:

$ cat file.txt
A,B,C,D,E,F,G,H,I
1,2,3,4,5,6,7,8,9
I,II,III,IV,V,VII,VIII,IX
a,b,c,d,e,f,g,h,i

Required:

Split the above file into two sub-files such that 1st 3 columns are written to sub-file1 and rest of the columns to sub-file2.
i.e.

sub-file1 content will be

A,B,C
1,2,3
I,II,III
a,b,c

And sub-file2 content will be

D,E,F,G,H,I
4,5,6,7,8,9
IV,V,VII,VIII,IX
d,e,f,g,h,i

Well, this is a pretty simple task using Linux/UNIX cut command

#Printing first 3 columns of 'file.txt'
$ cut -d"," -f1-3 file.txt
or
$ cut -d"," -f-3 file.txt

and

#Printing from 4th column till end
$ cut -d"," -f4-9 file.txt
or
$ cut -d"," -f4- file.txt

Awk solution:

$ awk -F "," '
{
for(i=1;i<=NF;i++) {
if(i <= 3) {
printf "%s,", $i >> "sub-file1"
if(i==3){
printf "\n" >> "sub-file1"
}
} else {
printf "%s,", $i >> "sub-file2"
if(i==NF){
printf "\n" >> "sub-file2"
}
}
}
}' file.txt

Sub-files generated after running the above awk script:

$ cat sub-file1
A,B,C,
1,2,3,
I,II,III,
a,b,c,

$ cat sub-file2
D,E,F,G,H,I,
4,5,6,7,8,9,
IV,V,VII,VIII,IX,
d,e,f,g,h,i,

If else examples in awk:
-----------------------------------

Input file: Each line of 'num.txt' contains 2 numbers (say A and B).

$ cat num.txt
34,140
190,140
89,120
110,110
210,115

Required: Calculate and print percentage (A/B)*100 with the following conditions:

- If percentage is less than 100, print the calculated actual percentage
- If percentage is more than 100, print the percentage as 100

First solution:

$ awk '
BEGIN {FS=OFS=","}
{if($1>$2) {print $0,100}
else {print $0,($1/$2)*100}
}' num.txt

Output:
34,140,24.2857
190,140,100
89,120,74.1667
110,110,100
210,115,100

Lets do some text alignment and formatting using awk.

$ awk '
BEGIN {FS="," ; {printf "%-10s%-8s%s\n","A","B","% age"}}
{if($1>=$2) {printf "%-10s%-8s%s\n",$1,$2,100}
else {printf "%-10s%-8s%2.2f\n",$1,$2,($1/$2)*100}
}' num.txt

Output:
A B % age
34 140 24.29
190 140 100
89 120 74.17
110 110 100
210 115 100

Or a different look of the above script:

$ awk '
BEGIN {
FS="," ; FORMAT="%-10s%-8s%s\n" ;
{printf FORMAT,"A","B","% age"}
}
{
if($1>=$2) {printf FORMAT,$1,$2,100}
else {printf FORMAT,$1,$2,($1/$2)*100}
}' num.txt

Output:
A B % age
34 140 24.2857
190 140 100
89 120 74.1667
110 110 100
210 115 100

Another way of writing if else in AWK.

$ awk '
{printf("%-10s%-8s%2.2f\n",\
$1,$2, ($1<=$2) ? ($1/$2)*100 : 100)
}' FS="," num.txt

Output:
34 140 24.29
190 140 100.00
89 120 74.17
110 110 100.00
210 115 100.00

Print except few columns using awk :
------------------------------------------------------

Input file:

$ cat details.txt
AX|23.45|1932323|A|VI|-|Y|0
TY|93.45|2932323|B|VI|-|Y|1
RE|63.25|8932323|A|VI|0|N|1
AY|83.85|0932323|C|VI|-|Y|0

Required:
Print all columns from the above file except column number 2 and 7.
i.e. required output:

AX|1932323|A|VI|-|0
TY|2932323|B|VI|-|1
RE|8932323|A|VI|0|1
AY|0932323|C|VI|-|0

Basically for the above file we have to print column # 1,3,4,5,6,8

i.e.

$ awk '
BEGIN{FS=OFS="|"}{print $1,$3,$4,$5,$6,$8}
' details.txt

But if number of fields is very large on the input file, the above method is not going to be so useful. So here is another technique.

$ awk '
BEGIN{FS=OFS="|"}
{ for (i=1; i<=NF;i++)
if( i==2 || i==7 ) continue
else
printf("%s%s", $i,(i!=NF) ? OFS : ORS)}
' details.txt

And if you want to exclude a range of column numbers (say exclude column 3 to column 6) here is my earlier post

An additional tip:
Suppose you need to generate the print 'statement' for printing a number of consecutive fields for an awk program, here is a quick way:

$ seq -s ",$" 1 8 | sed 's/.*/{print $&}/'

Output:

{print $1,$2,$3,$4,$5,$6,$7,$8}

Split file using awk:
--------------------------------

One of good use of awk is splitting files (based on different conditions) into sub-files.
Lets see some examples:

Example 1:

Input file log1.txt:
- Line starting with H is the main header line.
- Line starting with "h" and subsequent lines starting with "s" (till the next "h" line) are part of the same entry/section.

$ cat log1.txt
H,555,etho0
h,1,3
s,1233,456456,1212
s,4251,452456,7215
s,6283,851456,1219
h,9,2
s,2233,156456,1912
s,9233,256456,8212
h,2,4
s,4233,456456,1212
s,7251,252456,7215
s,1288,851456,9219
s,9183,851456,6219

Required:

- Split or subdivide the above file into sub files corresponding to each entry (one entry being the section starting with "h" and "s" lines till the next "h" line)
- Each sub-file should contain(start with) the main header line ("H" line).
- The required output is 3 sub files with the following contents and filename convention.

$ cat 555.1.1.log
H,555,etho0
h,1,3
s,1233,456456,1212
s,4251,452456,7215
s,6283,851456,1219

$ cat 555.9.2.log
H,555,etho0
h,9,2
s,2233,156456,1912
s,9233,256456,8212

$ cat 555.2.3.log
H,555,etho0
h,2,4
s,4233,456456,1212
s,7251,252456,7215
s,1288,851456,9219
s,9183,851456,6219

The awk program:

$ awk -F "," '
$1=="H" {mainH=$0;id=$2;next}
/^h/{
hid=$2;close(id"."hid"."f".log")
f++
print mainH > id"."hid"."f".log"
}
{print $0 > id"."hid"."f".log"}
' log1.txt

Example 2:

Input file:

$ cat log.txt
1252468812,yahoo,3.5
1252468812,hotmail,2.4
1252468819,yahoo,1.2
1252468812,msn,8.9
1252468923,gmail,12
1252468819,live,3.4
1252468929,yahoo,9.0
1252468929,msn,1.2

Required:

a) Split the above files based on the first field (i.e. lines with same first field should go to the same file)

The awk one liner:

$ awk -F "," '{close(f);f=$1}{print > f".txt"}' log.txt

Output:
Above file is splited into the following sub-files.

$ cat 1252468812.txt
1252468812,yahoo,3.5
1252468812,hotmail,2.4
1252468812,msn,8.9

$ cat 1252468819.txt
1252468819,yahoo,1.2
1252468819,live,3.4

$ cat 1252468923.txt
1252468923,gmail,12

$ cat 1252468929.txt
1252468929,yahoo,9.0
1252468929,msn,1.2

b) Send every 3 lines of above file into a sub file.

The awk code:

$ awk '{print >("log_" int((NR+2)/3))}' log.txt

Output:
The sub-files generated.

$ cat log_1
1252468812,yahoo,3.5
1252468812,hotmail,2.4
1252468819,yahoo,1.2

$ cat log_2
1252468812,msn,8.9
1252468923,gmail,12
1252468819,live,3.4

$ cat log_3
1252468929,yahoo,9.0
1252468929,msn,1.2

Replace duplicate line with blank :
--------------------------------------------------

I just received an query as a comment on one of my older post on "removing duplicates based on fields using awk"

Question was:

Any Idea on how to replace duplicate line with blank line instead of deleting them?
e.g.

Input:

test1
test1
test2
test2
test2
test3

Output:

test1

test2

test3

Thought of making it a separate post here.

The solution using awk:

$ awk 'x[$0]++ {$0=""} {print}’ file.txt

GROUP BY clause functionality in awk - bash

Q: For a sample two-column data below (cont_bd.txt), how to sum the second column and group by the first column:

Input file:

$ cat cont_bd.txt
continent:mval
SA:2345
AF:123
SA:89
OC:890
EU:24
AF:90
NA:5678
AF:345
OC:90
OC:23
SA:1234
EU:90
AF:12
SA:909

Awk solution for group by clause implementation:

$ awk 'BEGIN{FS=":"; print "continent count total avg"} NR!=1 {a[$1]++;b[$1]=b[$1]+$2}END{for (i in a) printf("%s %10.0f %10.0f %10.2f\n", i, a[i], b[i], b[i]/a[i])} ‘ cont_bd.txt

Bash rename command - rename multiple files

Bash rename command is very useful to rename files in bulk.

e.g.

$ ls
a.cpp b.cpp c.cpp d.cpp

Now to rename all the *.cpp to *.cpp.bak using rename

$ rename 's/\.cpp$/.cpp.bak/' *.cpp

$ ls
a.cpp.bak b.cpp.bak c.cpp.bak d.cpp.bak

Now to rename all files matching "*.bak" to strip the extension,

$ rename 's/\.bak$//' *.bak

$ ls
a.cpp b.cpp c.cpp d.cpp
Delete lines based on another file - awk

$ cat main.txt
ID1:A:45
ID2:B:12
ID4:C:12
ID3:D:56
ID7:F:90
ID9:K:14
ID5:P:32

$ cat filter.txt
ID7:0
ID3:0
ID4:0

Required output: Delete those lines from "main.txt" for which the ID field (first field) matched with that in "filter.txt". Basically the output file say rest.txt will be subtraction of filter.txt from main.txt.

The awk solution:

$ awk >rest.txt 'NR==FNR{arr[$1];next}!($1 in arr)' FS=":" filter.txt main.txt

or

$ awk >rest.txt 'NR==FNR{_[$1];next}!($1 in _)' FS=":" filter.txt main.txt

Result:

$ cat rest.txt
ID1:A:45
ID2:B:12
ID9:K:14
ID5:P:32

Matrix addition using awk in bash:
---------------------------------------------------

This is how can do matrix addition using awk ( basically adding columns of two files)

Input files:

$ cat mat1.txt
1 5 6
2 2 6
4 1 8

$ cat mat2.txt
4 5 3
2 4 5
2 4 6

Output required:
5 10 9
4 6 11
6 5 14

$ awk '
FNR==NR {
for(i=1; i<=NF; i++)
_[FNR,i]=$i
next
}
{
for(i=1; i<=NF; i++)
printf("%d%s", $i+_[FNR,i], (i==NF) ? "\n" : FS);
}' mat1.txt mat2.txt

Calculate sum and average of multiple lines - awk in bash:
---------------------------------------------------------------------------------

Input file:

$ cat details.txt
line1|5002|1200|90
line2|3002|4200|80
line3|5052|1600|90
line4|2006|3260|10

Required: Calculate and print the sum and averages of each fields in all the lines of details.txt

$ awk 'BEGIN {FS=OFS="|"} { print; for (i=2; i<=NF; ++i) sum[i] += $i; j=NF }
END { printf "%s%s", "------------------", "\ntotal"; for (i=2; i <= j; ++i) printf "%s%s", OFS, sum[i]; printf "\n"; }' details.txt

Output:
line1|5002|1200|90
line2|3002|4200|80
line3|5052|1600|90
line4|2006|3260|10
------------------
total|15062|10260|270

$ awk 'BEGIN {FS=OFS="|"} { print; for (i=2; i<=NF; ++i) sum[i] += $i; j=NF }
END { printf "%s%s", "------------------", "\nAvg"; for (i=2; i <= j; ++i) printf "%s%s", OFS, sum[i]/NR; printf "\n"; }' details.txt

Output:
line1|5002|1200|90
line2|3002|4200|80
line3|5052|1600|90
line4|2006|3260|10
------------------
Avg|3765.5|2565|67.5

Sort date in ddmmyyyy format - awk and bash script

Input file is having first field as ddmmyyyy format.

$ cat myf.dat
12082008;pull done;ret=34;Y
08072008;push hanged;s=3;N
15082008;pull done;ret=34;Y
01062008;psuh done;ret=23;Y
18082007;old entry;old;N

Required output: We need to sort the above file based on first field date in ddmmyyyy format; so that the final output after sort should be:

18082007;old entry;old;N
01062008;psuh done;ret=23;Y
08072008;push hanged;s=3;N
12082008;pull done;ret=34;Y
15082008;pull done;ret=34;Y

The solution is divided into 3 steps:

1) Adding a temporary field to the beginning. This field is nothing but the yyyymmdd format of the corresponding first field.

$ awk '{
tempfield=sprintf("%s%s%s",substr($1,5),substr($1,3,2),substr($1,1,2))
print tempfield","$0
}' FS=";" myf.dat

20080812,12082008;pull done;ret=34;Y
20080708,08072008;push hanged;s=3;N
20080815,15082008;pull done;ret=34;Y
20080601,01062008;psuh done;ret=23;Y
20070818,18082007;old entry;old;N

2) Now Doing a numeric sort.

$ awk '{
tempfield=sprintf("%s%s%s",substr($1,5),substr($1,3,2),substr($1,1,2))
print tempfield","$0
}' FS=";" myf.dat | sort -n

20070818,18082007;old entry;old;N
20080601,01062008;psuh done;ret=23;Y
20080708,08072008;push hanged;s=3;N
20080812,12082008;pull done;ret=34;Y
20080815,15082008;pull done;ret=34;Y

3) Removing the temporary field from beginning.

$ awk '{
tempfield=sprintf("%s%s%s",substr($1,5),substr($1,3,2),substr($1,1,2))
print tempfield","$0
}' FS=";" myf.dat | sort -n | cut -d"," -f2

18082007;old entry;old;N
01062008;psuh done;ret=23;Y
08072008;push hanged;s=3;N
12082008;pull done;ret=34;Y
15082008;pull done;ret=34;Y

Merging lines using awk - bash:
-----------------------------------------------

Input file:

$ cat myfile.txt
unix,bash
,scripting
awk is powerful
u1,u2
,u3,u4,u5

Objective: Merge the lines starting with comma "," with the previous line.
i.e. the output required.

unix,bash,scripting
awk is powerful
u1,u2,u3,u4,u5

Awk solution:

$ awk 'END { print r }
r && !/^,/ { print r; r = "" }
{ r = r ? r $0 : $0 }
' myfile.txt

Append text to filename using Awk

---------------------------------------------------

I have some files in my current directory whose file-name is of this pattern:

$ ls -1

log.1024.94.1326776200.1326776300.172.16.12.6.1326844995.0.s-1326528000.r-8192.txt

log.1024.94.1326776400.1326776400.172.16.12.5.1326844995.0.s-1326528000.r-2234.txt

log.1024.95.1326776420.1326776460.172.16.12.5.1326844995.0.s-1326528000.r-8192.txt

Requirement: Append a text "MY-2" as the 6th field (dot delimited) of the filename. E.g.

log.1024.94.1326776200.1326776300.172.16.12.6.1326844995.0.s-1326528000.r-8192.txt

should be renamed to

log.1024.94.1326776200.1326776300.MY-2.172.16.12.6.1326844995.0.s-1326528000.r-8192.txt

A bash script using awk to achieve this:

for file in $(ls)

newfilename=$(echo $file | awk 'BEGIN {FS=OFS="."} {$6="MY-2" OFS $6} {print}')

mv -v $file $newfilename

done

All the files are renamed to:

$ ls -1

log.1024.94.1326776200.1326776300.MY-2.172.16.12.6.1326844995.0.s-1326528000.r-8192.txt

log.1024.94.1326776400.1326776400.MY-2.172.16.12.5.1326844995.0.s-1326528000.r-2234.txt

log.1024.95.1326776420.1326776460.MY-2.172.16.12.5.1326844995.0.s-1326528000.r-8192.txt

Merge multiple consecutive lines

------------------------------------------------

Input file:

$ cat infile.txt

aid=33

pw=3

nn=90

aid=32

pw=30

nn=70

aid=56

pw=3

nn=93

Required:

Combine or merge every three consecutive lines of the above file so that the output becomes:

aid=33,pw=3,nn=90

aid=32,pw=30,nn=70

aid=56,pw=3,nn=93

Awk solution: If line number is divisible by 3 then put a new line(\n) else put a comma(,) i.e.

$ awk '{printf("%s%s", $0, (NR%3 ? "," : "\n"))}' infile.txt

aid=33,pw=3,nn=90

aid=32,pw=30,nn=70

aid=56,pw=3,nn=93

Another way using Awk:

$ awk 'NR%3{printf $0",";next;}1' infile.txt

aid=33,pw=3,nn=90

aid=32,pw=30,nn=70

aid=56,pw=3,nn=93

Using UNIX paste command:

$ paste -d"," - - - < infile.txt

aid=33,pw=3,nn=90

aid=32,pw=30,nn=70

aid=56,pw=3,nn=93

A bash command line solution:

$ while read line1; do read line2; read line3; echo "$line1,$line2,$line3"; done < infile.txt

aid=33,pw=3,nn=90

aid=32,pw=30,nn=70

aid=56,pw=3,nn=93

Time bucketing using Awk in Unix

---------------------------------------------------

Each line of input file file.txt is having UNIX epoch timestamp and some value.

$ cat file.txt

Epoch,Value

1351605000,120

1351605060,130

1351605120,340

1351605180,200

1351605240,120

1351605300,890

1351605360,124

1351605420,450

1351605480,120

1351605540,120

1351605600,200

1351605660,120

1351605720,340

1351605780,670

1351605840,990

The UNIX epoch timestamps are of 1 minute interval. Converting the UNIX epoch timestamps to human readable format using awk 'strftime' function:

$ awk 'BEGIN {FS=OFS=","} NR != 1 {$1=strftime("%c",$1)} {print}' file.txt

Output:

Epoch,Value

Tue 30 Oct 2012 01:50:00 PM GMT,120

Tue 30 Oct 2012 01:51:00 PM GMT,130

Tue 30 Oct 2012 01:52:00 PM GMT,340

Tue 30 Oct 2012 01:53:00 PM GMT,200

Tue 30 Oct 2012 01:54:00 PM GMT,120

Tue 30 Oct 2012 01:55:00 PM GMT,890

Tue 30 Oct 2012 01:56:00 PM GMT,124

Tue 30 Oct 2012 01:57:00 PM GMT,450

Tue 30 Oct 2012 01:58:00 PM GMT,120

Tue 30 Oct 2012 01:59:00 PM GMT,120

Tue 30 Oct 2012 02:00:00 PM GMT,200

Tue 30 Oct 2012 02:01:00 PM GMT,120

Tue 30 Oct 2012 02:02:00 PM GMT,340

Tue 30 Oct 2012 02:03:00 PM GMT,670

Tue 30 Oct 2012 02:04:00 PM GMT,990

Required: Group each 5 minutes entries together to one time bucket and perform a corresponding sum(Value) of the entries. E.g. All entries from '01:50:00 PM' (1351605000) till '01:54:00 PM' (1351605240) should be grouped together to '01:50:00 PM' (1351605000) time bucket, '01:55:00 PM' (1351605300) till '01:59:00 PM' (1351605540) to '01:55:00 PM' (1351605300) bucket and so on.

$ awk 'BEGIN {FS=OFS=","} NR != 1 {arr[$1-($1%300)]+=$2} END {for (i in arr) print i,arr[i]}' file.txt

Output:

1351605000,910

1351605300,1704

1351605600,2320

Converting UNIX epoch timestamps to human readable format:

$ awk 'BEGIN {FS=OFS=","} NR != 1 {arr[$1-($1%300)]+=$2} END {for (i in arr) print i,arr[i]}' file.txt | awk '

BEGIN {FS=OFS=","} {$1=strftime("%c",$1)} {print}'

#Accessing value of variable filterval inside Awk

$ awk -F ":" -v con=$filterval '$1==con {print $2}' test1.txt
12000
12300
1000

#Set 2nd value as 0 where first field is "AS"

$ awk -F ":" 'BEGIN {OFS=":"} $1=="AS" {$2=0} {print}' test1.txt
Continent:Val
AS:0
AF:9800
AS:0
NA:3400
OC:12000
AF:500
AS:0

Top examples of awk command in unix

Wednesday, 11 March 2015

How to make use of awk to fetch meaningful information from raw log data

No comments:

Post a Comment