The power of the UNIX shell command line comes from a foundation of abstraction of computer programming science defined about 30 years ago. This abstraction layer served as the building block of power techniques which includes regular expressions. The abstractions include modular software design and how those modules talk to each other via a stream, which is one of two types: data or control. This paper starts at this abstract level and introduces derived concepts and terminology to explain how to read a regular expression. Due to the complexity of reading a regular expression a person who can read one can write simple ones immediately, or complex ones with practice.
Regular Expressions (REs) has spread from the UNIX platform to all platforms and most high level languages (C and Java have classes for them). REs are used by power shell users and programmers not just once a week, but several times a day. To change from a newbie to a guru REs are a firm requirement.
I found REs hard to learn in the early 80's. It really is easy though with the right foundation. Not much was written about them in the 80's except in the vi man pages. Today not much is different except O'Reilly books and the Perl POD people have done a wonderful job of providing many perspectives on REs that honed my understanding and I recommend them.
To bridge the gap that remains for the novice I have developed this paper. It starts with the strong foundation of UNIX science abstractions (modules and streams) and introduces the concepts and terminology unique to REs.
The ability to read a RE is all that is needed to write a RE. Once you can read any RE, writing them or modifying a RE is easy. I have use this paper to teach users with no RE knowledge how to read and write them by the end of the lesson.
This paper was updated for a Linux Users of Los Angeles LULA monthly meeting. The paper is not intended as a stand alone learning aid but may be used as one. The paper is web published in the historical tradition of the Internet, that of free distribution of educational and research materials. Please use this paper to learn REs and to assist others.
The term "Regular Expression" is commonly abbreviated as regex, regexp, re or RE. These abbreviations are used in this paper.
The "shell command line interpreter" is referred to by any combination of those words in this paper.
A RE is primarily a text string pattern matching method with extremely fancy wild cards, especially ranges of a variety of restrictive wild characters and groups.
REs can used on the command line for globbing (filename expansion), and in UNIX filters like sed, awk, grep, egrep and perl, and in text editors like vi, joe, emacs, textpad, and in high level languages like Basic, C, Fortran, and Java.
This paper will approach the use of RE's from the high ground.
To ease the understanding of RE a few technical terms will be introduced. The terms may appear to have no relationship to RE's but RE's are intimately, academically and theoretically based on the abstractions and concepts introduced below. Understanding these concepts will greatly ease reading and creating your own RE's.
Each RE definition or rule may have one "exception," but not more. There are common ones and uncommon ones. It is very important to know that many RE rules have exceptions. It is these exceptions that give RE's such power and flexibility.
Every "rule" below for RE's usually has an exception. It is outside the scope of this paper to document all of them. The more common ones are explained. And example of a RE rule with its exception would read like this familiar English spelling one:
"i" before "e" except after "c."
Mostly just learn one way of doing an RE and do not try any others and you will hardly ever need to worry about exceptions.
Several UNIX technical terms are introduced.
A stream is data (a sequence of binary, hex, or character values) that is passed from one process to another process. A process can be the command line, a program, a file, or a monitor screen, to mention a few. Processes can also exist within another process. For example, in a high level language, an assignment statement within an if statement where the value of the assignment is passed to the if statement for testing. Here the word "process" is not used as the typical UNIX "process."
Also good to understand is that streams can be contained inside streams. In fact, that understanding is crucial for the rest of the paper.
A stream has two types, data and control commands (or just control). Data and control commands can range from binary bits to paragraphs.
A stream can be all data, all commands or mixed. When mixed the data and commands are typically rigorously parseable which determines what type the stream part is. The parsing may have to be done by a special process as the mixed stream may have positional type streams, that their type, data or control, is determine by context, sometimes by what comes before or after the stream part. Examples are presented later.
To examine a string byte by byte, or frequently a text string character by character, and either:
Frequently in a mixed stream an "escape" character is used to toggle or alternate from data to command. The escape could be a blank, for example on the command line. Or an escape character could be the backslash character "\" or its position may defined it as an escape character such as a starting quote, which with its matching closing quote defines a text string, but is not part of the string, nor a control command used later in the handling. In the case of the quote two escape characters are next to each other, the blank next to the quote is also an escape character, being neither data nor control. Unless you look at if from certain processes like the "parser" which treats blanks as control commands, that is "end the existing field" commands.
The concept of nesting is heavily used by power RE users.
Certain characters used as escape or control characters come in pairs, a start character and an end character. These are actually quite easy to remember as they match the keyboard and your expectations as follows:
[] {} ()One might think that angle brackets or greater than and less than characters <> is used as well. They are not (yet:).
In some RE cases there is no legal nesting of these characters when they are used as control characters. That is, it is not legal to use them inside each other.
LEGAL USES: [] {} () [][] [()] [{}] ({}) ([]) ILLEGAL: {()} {[]}
The concept of quoting is not heavily used in REs, if at all.
However, REs on the command line that contain quotes require escaping, thus some details for quotes on the command line, inside and outside REs is presented later. Detailed usage of quotes on the command line is outside the scope of this RE paper.
The above concepts are now used to explain some stream details important to understand how to read and write RE's.
A pattern is what an RE expresses, that is a text string or set of text strings. It could a filename, or a set of filenames, perhaps something as simple as a partial filename followed by a wild card.
(Hmmm, actually that is handled by "globbing," a RE like process used by the command line interpreter, as the wild card asterisk in that case is slightly different than a true RE asterisk meaning. It was stated there are exceptions. ;)
Pattern matching is the process of comparing the pattern against an input data stream. It is why you write RE's in the first place to do. It is a form of data filtering.
Pattern substitution is one of the most powerful and common uses of RE's. It uses the data filtering of the RE to select data streams for substitution, thus not all lines have replacements, only those selected.
Special high speed pattern matching method. Hashing methods are outside the scope of this RE paper.
Sometimes many "passes" are required at the stream such as the command line interpreter parses the user typed command text string (a simple data stream), then the command line interpreter treats the first "data" as a control command and finds that command/program within the PATH environment variable directories, invokes the program and passes the remainder of the command line to the program, which then accepts the input and scans it for more commands or "options" and data or option/argument values. For example:
contains the follow data and control streams broken down into three steps as handled by the three "processes" that handles the stream:prompt: perl -n -e "{ print; }" filename<enter<
As type by the user
DATA: perl -n -e "{ print; }" filename CONTROL: press the "enter" key
As viewed by the command line interpreter
The interpreter upon seeing the enter keyboard code value for the 'enter' key from the keyboard stream will buffer future keyboard input for processing after current input data stream is processed by the shell.CONTROL: perl DATA: -n -e "{ print; }" filenameCONTROL: press the "enter" key
As viewed by the perl command line option parser which gets the input data string
which is tokenized as follows:-n -e "{ print; }" filename
CONTROL: -n CONTROL: -e DATA: "{ print; }" DATA: filename
As viewed by the perl interpretator/compiler that gets the data string following the control option "-e"
Notice that this perl code snippet was first viewed as a data string by perl and later as a control stream that effected what perl would do. Streams can change types, typically from data to control, not the reverse.CONTROL: "{ print; }"
The user types a data stream onto the command line and presses enter or generates a control character called "enter" at the end of the data stream to inform the next process to take control.
What type the parsed stream parts (tokens) are is not determined by the command line interpreter except the first token "perl." The remainder of the command line input is treated as data.
The program perl is then responsible for making a second pass at the data stream and assigning the stream type, control or data, based upon context. That is control and data can occur in any order and some control streams are required to be followed by a data stream, and some fields are optional and others are required (filename).
The below example was choosen as it introduces many concepts of RE including
^[a-z]* +\(.*\) +[0-9]$
Keeping in mind that concepts just learned one knows that this RE is a mixed stream is consisting of data and commands and escape characters. One could guess that the "brackets" and "parens" are likely like quotes, that is there is a starting and ending escape character that defines an internal text string.
CONTROL: ^ - when starting an RE means the pattern starts at the first character. - That is the character after the ^ must be the first character MIXED: [a-z] - "one" character that matches the RANGE lower case letters a to z CONTROL: * - the preceding character can occur zero, one or more times. DATA: " " - a space or blank CONTROL: + - the preceding character can occur one or more times. MIXED: \(.*\) - a "group" THE MOST IMPORTANT CONSTRUCT FOR SUBSTITION!!! DATA: " " - a space or blank CONTROL: + - the preceding character can occur one or more times. MIXED: [0-9] - "one" character that matches any the RANGE from 0 to 9 CONTROL: $ - when ending an RE means the pattern ends at the last character - That is the character before the $ must be the last character
Taking just the GROUP and breaking it down further:
BREAKDOWN: MIXED: \( - start a GROUP MIXED: .* - zero or more characters (a COMMON pair or controls) MIXED: \) - ends a GROUPFURTHER BREAKDOWN: ESCAPE: \ - next character is treated as CONTROL CONTROL: ( - start a GROUP CONTROL: . - any character (the RE wild card) CONTROL: * - the preceding character can occur zero, one or more times. ESCAPE: \ - next character is treated as CONTROL CONTROL: ) - end a GROUP NOTE: A paren with no preceding ESCAPE \ is data, just another character EXCEPTION: Perl RE uses the opposite meanings!!! ( ) - defines a GROUP \( \) - just data, normal start and end paren. NOTE: Another perl exception starting with Perl 5 is: The At Sign "@" must be preceded with an escape backslash \, otherwise they are treated as an association array variable, which is outside the scope of this paper on RE's.
Taking the RANGE [a-z] and breaking it down further:
BREAKDOWN: CONTROL: [ - start a RANGE DATA: a - lower case a CONTROL: - - a DASH inside brackets is THE RANGE Character DATA: z - lower case z CONTROL: ] - end the RANGE
Correcting a little white lie the RANGE character is ONLY the dash when used inside brackets and the dash is surrounded by "like" data in the proper ascending order.
Here are bad ranges:
z-a 8-4 E-A a-Z 9-0 1-0
Note that "all digits" MUST be 0-9, not 1-0, as zero comes first.
Good ranges are:
a-z 4-8 A-E a-zA-Z 0-9 0-1
Note the "order" of the characters surrounding the dash must be "increasing"
Note the range of upper and lower case letters where two ranges are side by side.
Note the range 0-1 has no digits inside the range but is still legal.
Remember the dash must be used inside brackets to be a RANGE.
To get an ordinary dash inside brackets either use an escaped dash using the backslash \- or start or end with a dash inside the brackets, that is [-aeiou] or [aeiou-].
RE parsing is VERY intelligent and will seldom produce an error message if the RE stream can be legally parsed. Of course, your RE is likely to give you unexpected results.
Continuing to correct the white lie the brackets express a RANGE of values for just ONE character in the PATTERN.
Now back to the RE example and the equivalent wording of
is^[a-z]* +\(.*\) +[0-9]$
^[a-z]* a string starting with zero or more lower case letters + followed by at least one blank \(.*\) grouping everything found before + at least one or more blanks before [0-9]$ a single trailing digit
When reading a RE remember that
The following meta characters have their standard egrep-ish meanings:
\ Quote the next meta character ^ Match the beginning of the line . Match any character (except newline) $ Match the end of the line | Alternation () Grouping [] Character class
The following standard quantifiers are recognized:
* Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times *? Match 0 or more times +? Match 1 or more times ?? Match 0 or 1 time {n}? Match exactly n times {n,}? Match at least n times {n,m}? Match at least n but not more than m times
Since patterns are processed as double quoted strings, the following also work:
\t tab \n newline \r return \f form feed \v vertical tab, whatever that is \a alarm (bell) \e escape \033 octal char \x1b hex char \c[ control char \l lowercase next char \u uppercase next char \L lowercase till \E \U uppercase till \E \E end case modification \Q quote regexp meta characters till \E
In addition, Perl defines the following:
\w Match a "word" character (alphanumeric plus "_") \W Match a non-word character \s Match a white space character \S Match a non-white space character \d Match a digit character \D Match a non-digit character
Note that \w matches a single alphanumeric character, not a whole word. To match a word you'd need to say \w+. You may use \w, \W, \s, \S, \d and \D within character classes (though not as either end of a range).
Perl defines the following zero-width assertions:
\b Match a word boundary \B Match a non-(word boundary) \A Match only at beginning of string \Z Match only at end of string \G Match only where previous m//g left off
Code Line Counting
Perl where the comments start with a # symbol:
grep -v "^[ \t]*#" | grep -v "^[ \t]*$" | wc -lC where the comments start with a \* and end with *\
perl -n -e "if ( ! ( /\/\*/ .. /\*\// ) ) { print; }" | wc -l
Counting Files by Extension
To count the types of just one file extensions in one subdirectory
ls *.h | wc -lTo count the types of one or more file extensions in one subdirectory
ls *.[ch] | wc -lTo count the types of one or more file extensions three subdirectories down.
ls *.[ch] */*.[ch] */*/*.[ch] | wc -l
Command Line foreach Example
How to make multiple subsitutions in multiple files with the command foreach
echo "s/\.htm/\.html/g" >> sub.sed echo "s/\.jpeg/\.jpg/g" >> sub.sed echo "s/\.JPG/\.jpg/g" >> sub.sed echo "s/\.JPg/\.jpg/g" >> sub.sed echo "s/\\\$/\$/g" >> sub.sed echo "s/POB /P.O. Box /g" >> sub.sed echo "s/mispelled/misspelled/g" >> sub.sed mkdir 2 foreach file (*.html) do sed -f sub.sed $file > 2/$file end mkdir backup mv *.html backup/. cp 2/* .
Miscellaneous Examples
Telephone Number Reformating from (###) ###-#### to ###.###.####
s/\([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]/\1.\2.\3/ s/\([0-9]{3}) [0-9]{3}-[0-9]{3}/\1.\2.\3/ s/\(\d{3}) \d{3}-\d{3}/\1.\2.\3/
Testing a web page form input field for Post Office Box
where PO is the minimum and the maximum is P.O. Box, and includes variations such as: P.O., Pob, or PO Box./P\.* *O\.* *B*O*X*/i
Find a filename by wildcard
ls name* */name* */*/name* */*/*/name* */*/*/*/name* */*/*/*/*/name* find . -name 'name*' -print
Find records of a certain date and time in the Sendmail log or Apache web log
grep "^Sep 7 11:56" maillog grep "^[^ ]* [^ ]* [^ ]* \[16/Sep/2003:16:20" access_log
Matching Shell Globs as Regular Expressions
To convert from Shell Globbing to Perl the differences are between the three wildcard characters "?.*", where the string must be forced to match starting at the beginning to the very end (use of ^ and $). See below for some examples. (Thanks to OReilly's "Perl Cookbook" pages 180 and 181)
Shell Perl list.? ^list\..$ * ^.*$ *.* ^.*\..*$ project.* project\..*$ *old ^.*old$ type*.[ch] ^type.*\.[ch]$
The concept of quoting is not heavily used in REs, if at all. However, REs on the command line that contain quotes require escaping. The normal use of quotes on the command line without REs is followed by using quotes inside of command line REs. Detailed usage of quotes on the command line may be found in the man pages for that shell.
Quotes include the following characters
where a starting and ending quote are shown and a 'string' would occur inside them."" '' ``
But in certain processes, like a command line REs, quotes are used in a special way. Quotes used in an RE on the command line must be "escaped" or force to be a "literal" to avoid being interpreted by the command line interpreter. Thus, escaped RE quotes would then be processed by the RE expansion algorithm.
Single and double quotes on the command line have different effects.
Single quotes prevent shell variable subsitution and globbing on the command line.
For example:
The dollar sign character is treated as a literal dollar sign, not as a control character indicating the start of a variable name.where the second echo command input was treated as a literal with no additional command line processing such as variable subsitution.> echo $PATH /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin > echo '$PATH' $PATH
Wild card characters are 'turned off' both for the asterisk "*" and the RE ".*" wildcard.
where the second ls command filename input was treated as a literal with no additional command line processing such as globbing.> ls *.html about.html event.html guestbook.html index.html photo.html > ls '*.html' ls: *.html: No such file or directory
On the command line a pair of double quotes is used to group a single text string
that contains embedded spaces that would otherwise be parsed by the command line
interpreter into separate tokens, not a single text string.
For example:
where the last grep command without the double quotes resulted in an error of filename 'photo' not found.> grep "expensive photo" event.html For Sale: expensive photo dated 1801 > grep expensive photo event.html grep: photo: No such file or directory
Strings contained inside double quotes on the command line will have
both globbing and variable subsitution performed on the string
before the shell's interpreter sends the resulting string to
the invoked 'command.'
In the example below the 'ls' command here sees only the option "-l" and
not the *.html which will have been expanded by the shell interpreter to the string> ls -l *.html -rw-rw-r-- 1 pete pete 7439 Dec 18 2000 about.html -rw-rw-r-- 1 pete pete 4339 Sep 15 15:22 event.html -rw-rw-r-- 1 pete pete 4339 Aug 3 19:40 guestbook.html -rw-rw-r-- 1 pete pete 10240 Mar 17 2003 index.html -rw-rw-r-- 1 pete pete 23418 Apr 15 2002 photo.html
about.html event.html guestbook.html index.html photo.htmlbefore the 'ls' command is invoked. Thus, what the 'ls' command executable reads was on the command line is this string:
-l about.html event.html guestbook.html index.html photo.htmlCare must taken with globbing to not have the expanded string exceed the command line buffer size.
One might think that apostrophe ` is special like quotes. While they "enclose" a string, that string is treated special, depending upon the process handling the apostrophe.
Inside an RE the apostrophe character has no special RE meaning. It is a literal apostrophe.
A RE apostrophe does not have to be escaped, unless the RE is on the command line.
On the command line the enclosed string is substituted, that is the enclosed string is treated as a command and the STDOUT stream from the command replaces the enclosed string value. The STDOUT stream will have newline characters replaced with a blank.
For example:
`ls -1 *.html`To see this effect try these commands:becomesabout.html event.html guestbook.html index.html photo.htmlinstead ofabout.html event.html guestbook.html index.html photo.html
where the last two 'echo' commands have identical output of a single line, while the first two 'ls' commands have different outputs with multiple lines with either 2 or more filenames per line or with the option -1 there is only one filename per line.ls *.html ls -1 *.html echo `ls *.html` echo `ls -1 *.html`
A Perl assignment statement has the string enclosed in apostrophes processed as
a shell command and the variable's value is set to the STDOUT.
If the Perl variable type is a scalar, then the STDOUT includes new lines.
If the Perl variable type is an array, then the STDOUT is separated into array subscripts after the new line characters.
Apostrophes used inside awk or sed have no special meaning (as of the 1998 first version of this paper).
When using quotes inside of RE's the quote must be preceded with the escape character backslash "\" in order to force the command line interpreter to treat the quote as a literal. The interpreter will remove the escape backslash and pass the resulting RE string to the invoked command. Example:
sed -e 's/\"Book Title Here\"/<i>Book Title Here<\/i>/g' bookreview.html
RE resources come from four primary areas: man pages, perl POD, web pages and books. Some of each are below.
A complete set of RE manual pages is traditionally found in the
pages, and now more commonly inman vi
man regexg
Command line globbing rules are in the
pages or the shell man pages of your choice.man sh
Manual pages for perl RE can be found linked from http://www.perl.com about 3-4 mouse clicks away or try the search.
Perl 5.8 Regular Expression POD resources are good reading for any type of RE:
and install with any port of Perl in perl/html/perlmain.htmlThis paper's URL will always be linked from http://peterbenjamin.com/seminars
A quick and easy to read RE web page is http://www.mit.edu/perl/perlre.html
The ORA Safari online, keyword searchable books.
O'Reilly RE book search (as more every year).
(c) 1998-2003 Copyright by Peter Benjamin. All rights reserved.