Tuesday, November 22, 2005

Regular Expression in Vi, Sed, grep, egrep

Here are a few representative, simple examples.

vi command

What it does



:%s/ */ /g

Change 1 or more spaces into a single space.

:%s/ *$//

Remove all spaces from the end of the line.

:%s/^/ /

Insert a space at the beginning of every line.

:%s/^[0-9][0-9]* //

Remove all numbers at the beginning of a line.

:%s/b[aeio]g/bug/g

Change all occurences of bag, beg, big, and bog, to bug.

:%s/t\([aou]\)g/h\1t/g

Change all occurences of tag, tog, and tug to hat, hot, and hug respectively.



Medium Examples (Strange Incantations)

Example 1

Change all instances of foo(a,b,c) to foo(b,a,c). where a, b, and c can be any parameters supplied to foo(). That is, we must be able to make changes like the following:

Before


After

foo(10,7,2)


foo(7,10,2)

foo(x+13,y-2,10)


foo(y-2,x+13,10)

foo( bar(8), x+y+z, 5)


foo( x+y+z, bar(8), 5)

The following substitution command will do the trick :

    :%s/foo(\([^,]*\),\([^,]*\),\([^)]*\))/foo(\2,\1,\3)/g

Now, let's break this apart and analyze what's happening. The idea behind this expression is to identify invocations of foo() with three parameters between the parentheses. The first parameter is identified by the regular expression \([^,]*\), which we can analyze from the inside out.

[^,]


means any character which is not a comma

[^,]*


means 0 or more characters which are not commas

\([^,]*\)


tags the non-comma characters as \1 for use in the replacement part of the command

\([^,]*\),


means that we must match 0 or more non-comma characters which are followed by a comma. The non-comma characters are tagged.

This is a good time to point out one of the most common problems people have with regular expressions. Why would we use an expression like [^,]*, instead of something more straightforward like .*, to match the first parameter? Consider applying the pattern .*, to the string "10,7,2". Should it match "10," or "10,7," ? To resolve this ambiguity, regular expressions will always match the longest string possible. In this case "10,7," which covers two parameters instead of one parameter like we want. So, by using the expression [^,]*, we force the pattern to match all characters up to the first comma.

The expression up to this point is: foo(\([^,]*\), and can be roughly translated as "after you find foo( tag all characters up to the next comma as \1". We tag the second parameter just like the first and it can be referenced as \2. The tag used on the third parameter is exactly like the others except that we search for all characters up to the right parenthesis. It may be superfluous to search for the last parameter since we don't have to move it. But this pattern guarantees that we update only those instances of foo() where 3 parameters are specified. In these times of function and method overloading, being explicit often proves to be useful. In the substitution portion of the command, we explicitly enter the invocation of foo() as we want it, referencing the matched patterns in the new order where the first and second parameter have been switched.

Example 2

We have a CSV (comma separated value) file with information we need, but in the wrong format. The columns of data are currently arranged in the following order: Name, Company Name, State, Postal Code. We need to reorganize the data into the following order in order to use it with a particular piece of software: Name, State-Postal Code, Company Name. This means that we must change the order of the columns in addition to merging two columns to form a new column value. The particular piece of software that needs this data will not work if there are any whitespace characters (spaces or tabs) before or after the commas. So we must remove whitespace around the commas.

Here are a few lines from the data we have:

    Bill Jones, HI-TEK Corporation , CA, 95011
    Sharon Lee Smith, Design Works Incorporated, CA, 95012
    B. Amos , Hill Street Cafe, CA, 95013
    Alexander Weatherworth, The Crafts Store, CA, 95014
    ...

We need to transform them to look like this:

    Bill Jones,CA 95011,HI-TEK Corporation
    Sharon Lee Smith,CA 95012,Design Works Incorporated
    B. Amos,CA 95013,Hill Street Cafe
    Alexander Weatherworth,CA 95014,The Crafts Store
    ...

We'll look at two regular expressions to solve this problem. The first moves the columns around and merges the data. The second removes the excess spaces.

Here is the first pass at a substitution command that will solve the problem:

    :%s/\([^,]*\),\([^,]*\),\([^,]*\),\(.*\)/\1,\3 \4,\2/

The approach is similar to that of Example 1. The Name is matched by the expression \([^,]*\), that is, all characters up to the first comma. The name can then be referenced as \1 in the replacement pattern. The Company Name and State fields are matched just like the Name field and are referenced as \2 and \3 in the replacement pattern. The last field is matched with the expression \(.*\) which can be translated as "match all characters through the end of the line". The replacement pattern is constructed by calling out each tagged expression in the appropriate order and adding or not adding the delimeter.

The following substitution command will remove the excess spaces:

    :%s/[ \t]*,[ \t]*/,/g

To break it down: [ \t] matches a space or tab character; [ \t]* matches 0 or more spaces or tabs; [ \t]*, matches 0 or more spaces or tabs followed by a comma; and finally [ \t]*,[ \t]* matches 0 or more spaces or tabs followed by a comma followed by 0 or more spaces or tabs. In the replacement pattern, we simply replace whatever we matched with a single comma. The optional g parameter is added to the end of the substitution command to apply the substitution to all commas in the line.

Example 3

Suppose you have a multi-character sequence that repeats. For example, consider the following:

Billy tried really hard
Sally tried really really hard
Timmy tried really really really hard
Johnny tried really really really really hard

Now suppose you want to change "really", "really really", and any number of consecutive "really" strings to a single word: "very". The command

:%s/\(really \)\(really \)*/very /

changes the text above to:

Billy tried very hard
Sally tried very hard
Timmy tried very hard
Johnny tried very hard

The expression \(really \)* matches 0 or more sequences of "really ". The sequence \(really \)\(really \)* matches one or more instances of the sequence "really ".

Hard Examples (Magical Hieroglyphics)

coming soon.


OK, you'd like to use regular expressions, but you can't bring yourself to use vi. Here, then, are a few examples of how to use regular expressions in other tools. Also, I have attempted to summarize the differences in regular expressions you will find between different programs.

You can use regular expressions in the Visual C++ editor. Select Edit->Replace, then be sure to check the checkbox labled "Regular expression". For vi expressions of the form :%s/pat1/pat2/g set the Find What field to pat1 and the Replace with field to pat2. To simulate the range (% in this case) and the g option you will have to use the Replace All button or appropriate combinations of Find Next and Replace

sed

Sed is a Stream EDitor which can be used to make changes to files or pipes. For complete details, see the man page sed

Here are a few interesting sed scripts. Assume that we're processing a file called price.txt. Note that the edits don't actually happen to the input file, sed simply processes each line of the file with the command you supply and echos the result to its standard out.

sed script


Description




sed 's/^$/d' price.txt


removes all empty lines

sed 's/^[ \t]*$/d' price.txt


removes all lines containing only whitespace

sed 's/"//g' price.txt


remove all quotation marks

awk

Awk is a programming language which can be used to perform sophisticated analysis and manipulation of text data. For complete details, see the man page awk Its peculiar name is an acronym made up of the first character of its authors last names (Aho, Weinberger, and Kernighan).

There are many good awk examples in the book The AWK Programming Language (written by Aho, Weinberger, and Kernighan). Please don't form any broad opinions about awk's capabilities based on the following trivial sample scripts. For purposes of these examples, assume that we're working with a file called price.txt. As with sed, awk simply echos its output to its standard out.

awk script


Description




awk '$0 !~ /^$/' price.txt


removes all empty lines

awk 'NF > 0' price.txt


a better way to remove all lines in awk

awk '$2 ~ /^[JT]/ {print $3}' price.txt


print the third field of all lines whose second field begins with 'J' or 'T'

awk '$2 !~ /[Mm]isc/ {print $3 + $4}' price.txt


for all lines whose second field does not contain 'Misc' or 'misc' print the sum of columns 3 and 4 (assumed to be numbers).

awk '$3 !~ /^[0-9]+\.[0-9]*$/ {print $0}' price.txt


print all lines where field 3 is not a number. The number must be of the form: d.d or d. where d is any number of digits from 0 to 9.

awk '$2 ~ /John|Fred/ {print $0}' price.txt


print the entire line if the second field contains 'John' or 'Fred'

grep

grep is a program used to match regular expressions in one or more specified files or in an input stream. Its name programming language which can be used to perform data manipulation on files or pipes. For complete details, see the man page grep. Its peculiar name stems from its roots as a command in vi, g/re/p meaning global regular expression print.

For the examples below, assume we have the text below in a file named phone.txt. Its format is last name followed by a comma, first name followed by a tab, then a phone number.

    Francis, John 5-3871
    Wong, Fred 4-4123
    Jones, Thomas 1-4122
    Salazar, Richard 5-2522

grep command


Description




grep '\t5-...1' phone.txt


print all the lines in phone.txt where the phone number begins with 5 and ends with 1. Note that the tab character is represented by \t.

grep '^S[^ ]* R' phone.txt


print lines where the last name begins with S and first name begins with R.

grep '^[JW]' phone.txt


print lines where the last name begins with J or W

grep ', ....\t' phone.txt


print lines where the first name is 4 characters. The tab character is represented by \t.

grep -v '^[JW]' phone.txt


print lines that do not begin with J or W

grep '^[M-Z]' phone.txt


print lines where the last name begins with any letter from M to Z.

grep '^[M-Z].*[12]' phone.txt


print lines where the last name begins with a letter from M to Z and where the phone number ends with a 1 or 2.

egrep

egrep is an extended version of grep. It supports a few more metacharacters in its regular expressions. For the examples below, assume we have the text below in a file named phone.txt. Its format is last name followed by a comma, first name followed by a tab, then a phone number.

    Francis, John 5-3871
    Wong, Fred 4-4123
    Jones, Thomas 1-4122
    Salazar, Richard 5-2522

egrep command


Description




egrep '(John|Fred)' phone.txt


print all lines that contain the name John or Fred.

egrep 'John|22$|^W' phone.txt


print lines that contain John or that end with 22 or that begin with W.

egrep 'net(work)?s' report.txt


print lines in report.txt contain networks or nets.

Command or
Environment

.

[ ]

^

$

\( \)

\{ \}

?

+

|

( )

vi

X

X

X

X

X






Visual C++

X

X

X

X

X






awk

X

X

X

X



X

X

X

X

sed

X

X

X

X

X

X





Tcl

X

X

X

X

X


X

X

X

X

ex

X

X

X

X

X

X





grep

X

X

X

X

X

X





egrep

X

X

X

X

X


X

X

X

X

fgrep

X

X

X

X

X






perl

X

X

X

X

X


X

X

X

X

The vi Substitution Command

Vi's substitution command has the form

    :ranges/pat1/pat2/g

where

    : begins an ex (command line editor) command which is applied to the file currently being edited.

    range is the line range specifier. Use the percent sign (%) to indicate all lines. Use the dot (.) to indicate the current line. Use the dollar sign to indicate the last line. You can also use specific line numbers. Examples: 10,20 means lines 10 through 20; .,$ means from the current line to the last line; .+2,$-5 means from two lines after the current through the fifth line up from the end of the file.

    s is the substitution command.

    pat1 is the regular expression to be searched for. This paper is full of examples.

    pat2 is the replacement pattern. This paper is full of examples.

    g is optional. When present the substitution is made to all matches on the line. When it is not present, the substitution is applied only to the first match on the line.