Peter Benjamin's Seminar on Regular Expressions From The Ground Up

Regular Expressions From The Ground Up
Version 2.1
September 16, 2003
By Peter Benjamin

Abstract

The power of the UNIX shell command line comes from a foundation of abstraction of computer programming science defined about 30 years ago. This abstraction layer served as the building block of power techniques which includes regular expressions. The abstractions include modular software design and how those modules talk to each other via a stream, which is one of two types: data or control. This paper starts at this abstract level and introduces derived concepts and terminology to explain how to read a regular expression. Due to the complexity of reading a regular expression a person who can read one can write simple ones immediately, or complex ones with practice.

Abstract

Abstractions and Concepts

RE Terminology

RE Example Walk Thru

Brief List of RegExp Characters

More Examples

Quotes

Resources and References

Introduction

Regular Expressions (REs) has spread from the UNIX platform to all platforms and most high level languages (C and Java have classes for them). REs are used by power shell users and programmers not just once a week, but several times a day. To change from a newbie to a guru REs are a firm requirement.

I found REs hard to learn in the early 80's. It really is easy though with the right foundation. Not much was written about them in the 80's except in the vi man pages. Today not much is different except O'Reilly books and the Perl POD people have done a wonderful job of providing many perspectives on REs that honed my understanding and I recommend them.

To bridge the gap that remains for the novice I have developed this paper. It starts with the strong foundation of UNIX science abstractions (modules and streams) and introduces the concepts and terminology unique to REs.

The ability to read a RE is all that is needed to write a RE. Once you can read any RE, writing them or modifying a RE is easy. I have use this paper to teach users with no RE knowledge how to read and write them by the end of the lesson.

Purpose

This paper was updated for a Linux Users of Los Angeles LULA monthly meeting. The paper is not intended as a stand alone learning aid but may be used as one. The paper is web published in the historical tradition of the Internet, that of free distribution of educational and research materials. Please use this paper to learn REs and to assist others.

Abbreviations

The term "Regular Expression" is commonly abbreviated as regex, regexp, re or RE. These abbreviations are used in this paper.

The "shell command line interpreter" is referred to by any combination of those words in this paper.

What is a RE?

A RE is primarily a text string pattern matching method with extremely fancy wild cards, especially ranges of a variety of restrictive wild characters and groups.

REs can used on the command line for globbing (filename expansion), and in UNIX filters like sed, awk, grep, egrep and perl, and in text editors like vi, joe, emacs, textpad, and in high level languages like Basic, C, Fortran, and Java.

This paper will approach the use of RE's from the high ground.

From an abstract level a data stream has patterns matched.
From a concept level a data string or file is pattern matched for variably defined or constant text strings.
At a detailed level a text string is examined for matching multiple patterns as defined by a RE that is a data stream that alternates between control and data streams (strings).

To ease the understanding of RE a few technical terms will be introduced. The terms may appear to have no relationship to RE's but RE's are intimately, academically and theoretically based on the abstractions and concepts introduced below. Understanding these concepts will greatly ease reading and creating your own RE's.

Exceptions

Each RE definition or rule may have one "exception," but not more. There are common ones and uncommon ones. It is very important to know that many RE rules have exceptions. It is these exceptions that give RE's such power and flexibility.

Every "rule" below for RE's usually has an exception. It is outside the scope of this paper to document all of them. The more common ones are explained. And example of a RE rule with its exception would read like this familiar English spelling one:

"i" before "e" except after "c."

Mostly just learn one way of doing an RE and do not try any others and you will hardly ever need to worry about exceptions.

ABSTRACTIONS AND CONCEPTS

Several UNIX technical terms are introduced.

STREAM

A stream is data (a sequence of binary, hex, or character values) that is passed from one process to another process. A process can be the command line, a program, a file, or a monitor screen, to mention a few. Processes can also exist within another process. For example, in a high level language, an assignment statement within an if statement where the value of the assignment is passed to the if statement for testing. Here the word "process" is not used as the typical UNIX "process."

Also good to understand is that streams can be contained inside streams. In fact, that understanding is crucial for the rest of the paper.

STREAM: DATA, CONTROL or MIXED

A stream has two types, data and control commands (or just control). Data and control commands can range from binary bits to paragraphs.

A stream can be all data, all commands or mixed. When mixed the data and commands are typically rigorously parseable which determines what type the stream part is. The parsing may have to be done by a special process as the mixed stream may have positional type streams, that their type, data or control, is determine by context, sometimes by what comes before or after the stream part. Examples are presented later.

PARSE and TOKENS

To examine a string byte by byte, or frequently a text string character by character, and either:

Append the current character to the current stream,
Start a new stream with the current character of the same type or different type, or
Other operation like setting a flag or ignoring the character.

This parsing is often called tokenizing and the resulting streams called tokens. Remember a stream is either data or a control command and it is not required that the type alternatives, that is there may be two or more control streams in a row, or two or more data streams.

ESCAPE CHARACTER

Frequently in a mixed stream an "escape" character is used to toggle or alternate from data to command. The escape could be a blank, for example on the command line. Or an escape character could be the backslash character "\" or its position may defined it as an escape character such as a starting quote, which with its matching closing quote defines a text string, but is not part of the string, nor a control command used later in the handling. In the case of the quote two escape characters are next to each other, the blank next to the quote is also an escape character, being neither data nor control. Unless you look at if from certain processes like the "parser" which treats blanks as control commands, that is "end the existing field" commands.

NESTING

The concept of nesting is heavily used by power RE users.

Certain characters used as escape or control characters come in pairs, a start character and an end character. These are actually quite easy to remember as they match the keyboard and your expectations as follows:

[] {} ()

One might think that angle brackets or greater than and less than characters <> is used as well. They are not (yet:).

In some RE cases there is no legal nesting of these characters when they are used as control characters. That is, it is not legal to use them inside each other.

LEGAL USES:   []  {}  ()  [][]   [()]  [{}]  ({})  ([])

ILLEGAL:  {()}  {[]}

QUOTING

The concept of quoting is not heavily used in REs, if at all.

However, REs on the command line that contain quotes require escaping, thus some details for quotes on the command line, inside and outside REs is presented later. Detailed usage of quotes on the command line is outside the scope of this RE paper.

RE Terminology - Using The Concepts To Learn The Details

The above concepts are now used to explain some stream details important to understand how to read and write RE's.

PATTERN

A pattern is what an RE expresses, that is a text string or set of text strings. It could a filename, or a set of filenames, perhaps something as simple as a partial filename followed by a wild card.

(Hmmm, actually that is handled by "globbing," a RE like process used by the command line interpreter, as the wild card asterisk in that case is slightly different than a true RE asterisk meaning. It was stated there are exceptions. ;)

PATTERN MATCHING

Pattern matching is the process of comparing the pattern against an input data stream. It is why you write RE's in the first place to do. It is a form of data filtering.

PATTERN SUBSTITUTION

Pattern substitution is one of the most powerful and common uses of RE's. It uses the data filtering of the RE to select data streams for substitution, thus not all lines have replacements, only those selected.

HASH

Special high speed pattern matching method. Hashing methods are outside the scope of this RE paper.

DYNAMIC STREAM TYPES

Sometimes many "passes" are required at the stream such as the command line interpreter parses the user typed command text string (a simple data stream), then the command line interpreter treats the first "data" as a control command and finds that command/program within the PATH environment variable directories, invokes the program and passes the remainder of the command line to the program, which then accepts the input and scans it for more commands or "options" and data or option/argument values. For example:

prompt: perl -n -e "{ print; }" filename<enter<

contains the follow data and control streams broken down into three steps as handled by the three "processes" that handles the stream:

End user
Shell command line interpreter
Perl program's command line parser
Perl program's interpretator/compiler

As type by the user

DATA: perl -n -e "{ print; }" filename
CONTROL: press the "enter" key

As viewed by the command line interpreter

CONTROL: perl
DATA: -n -e "{ print; }" filename
CONTROL: press the "enter" key

The interpreter upon seeing the enter keyboard code value for the 'enter' key from the keyboard stream will buffer future keyboard input for processing after current input data stream is processed by the shell.

As viewed by the perl command line option parser which gets the input data string

-n -e "{ print; }" filename

which is tokenized as follows:

CONTROL: -n
CONTROL: -e
DATA: "{ print; }"
DATA: filename

As viewed by the perl interpretator/compiler that gets the data string following the control option "-e"

CONTROL: "{ print; }"

Notice that this perl code snippet was first viewed as a data string by perl and later as a control stream that effected what perl would do. Streams can change types, typically from data to control, not the reverse.

The user types a data stream onto the command line and presses enter or generates a control character called "enter" at the end of the data stream to inform the next process to take control.

What type the parsed stream parts (tokens) are is not determined by the command line interpreter except the first token "perl." The remainder of the command line input is treated as data.

The program perl is then responsible for making a second pass at the data stream and assigning the stream type, control or data, based upon context. That is control and data can occur in any order and some control streams are required to be followed by a data stream, and some fields are optional and others are required (filename).

RE Example

The below example was choosen as it introduces many concepts of RE including

common ranges (lower case letters or digits)
a group
matching begin line and end line
three common wildcard characters of the many wild characters

^[a-z]* +\(.*\) +[0-9]$

Keeping in mind that concepts just learned one knows that this RE is a mixed stream is consisting of data and commands and escape characters. One could guess that the "brackets" and "parens" are likely like quotes, that is there is a starting and ending escape character that defines an internal text string.

CONTROL:  ^       - when starting an RE means the pattern starts at the first character.
                  - That is the character after the ^ must be the first character
  MIXED:  [a-z]   - "one" character that matches the RANGE lower case letters a to z
CONTROL:  *       - the preceding character can occur zero, one or more times.
   DATA:  " "     - a space or blank
CONTROL:  +       - the preceding character can occur       one or more times.
  MIXED:  \(.*\)  - a "group"  THE MOST IMPORTANT CONSTRUCT FOR SUBSTITION!!!
   DATA:  " "     - a space or blank
CONTROL:  +       - the preceding character can occur       one or more times.
  MIXED:  [0-9]   - "one" character that matches any the RANGE from 0 to 9
CONTROL:  $       - when ending an RE means the pattern ends at the last character
                  - That is the character before the $ must be the last character

Taking just the GROUP and breaking it down further:

BREAKDOWN:
MIXED:  \(  - start a GROUP
MIXED:  .*  - zero or more characters (a COMMON pair or controls)
MIXED:  \)  - ends a GROUP


FURTHER BREAKDOWN:
 ESCAPE:  \  - next character is treated as CONTROL
CONTROL:  (  - start a GROUP
CONTROL:  .  - any character (the RE wild card)
CONTROL:  *  - the preceding character can occur zero, one or more times.
 ESCAPE:  \  - next character is treated as CONTROL
CONTROL:  )  - end a GROUP

NOTE: A paren with no preceding ESCAPE \ is data, just another character

EXCEPTION: Perl RE uses the opposite meanings!!!
	( ) - defines a GROUP
	\( \) - just data, normal start and end paren.

NOTE: Another perl exception starting with Perl 5 is:
      The At Sign "@" must be preceded with an escape backslash \,
      otherwise they are treated as an association array variable,
      which is outside the scope of this paper on RE's.

Taking the RANGE [a-z] and breaking it down further:

BREAKDOWN:
CONTROL:  [  - start a RANGE
   DATA:  a  - lower case a
CONTROL:  -  - a DASH inside brackets is THE RANGE Character
   DATA:  z  - lower case z
CONTROL:  ]  - end the RANGE

Correcting a little white lie the RANGE character is ONLY the dash when used inside brackets and the dash is surrounded by "like" data in the proper ascending order.
Here are bad ranges:

z-a   8-4  E-A  a-Z     9-0  1-0

Note that "all digits" MUST be 0-9, not 1-0, as zero comes first.

Good ranges are:

a-z   4-8  A-E  a-zA-Z  0-9  0-1

Note the "order" of the characters surrounding the dash must be "increasing"

Note the range of upper and lower case letters where two ranges are side by side.

Note the range 0-1 has no digits inside the range but is still legal.

Remember the dash must be used inside brackets to be a RANGE.

To get an ordinary dash inside brackets either use an escaped dash using the backslash \- or start or end with a dash inside the brackets, that is [-aeiou] or [aeiou-].

RE parsing is VERY intelligent and will seldom produce an error message if the RE stream can be legally parsed. Of course, your RE is likely to give you unexpected results.

Continuing to correct the white lie the brackets express a RANGE of values for just ONE character in the PATTERN.

Now back to the RE example and the equivalent wording of

^[a-z]* +\(.*\) +[0-9]$

^[a-z]*  a string starting with zero or more lower case letters
 +       followed by at least one blank
\(.*\)   grouping everything found before
 +       at least one or more blanks before
[0-9]$   a single trailing digit

When reading a RE remember that

\ is an escape character.
GROUPS are  - in Perl 5 GROUPS are ()
RANGES are [ ]
- inside a [] means a RANGE only if surrounded by similar character types in order.
^ and $ when at the start or end of the RE respectively mean the adjacent character must start or end the pattern respectively.
. is the WILD CARD
* is zero, one or more of the preceding character
+ is one or more of the preceding character

and that completes the basic set of RE's.

A BRIEF LIST OF THE SPECIAL REGEX CHARACTERS

The following meta characters have their standard egrep-ish meanings:

\        Quote the next meta character
^        Match the beginning of the line
.        Match any character (except newline)
$        Match the end of the line
|        Alternation
()       Grouping
[]       Character class

The following standard quantifiers are recognized:

*      Match 0 or more times
+      Match 1 or more times
?      Match 1 or 0 times
{n}    Match exactly n times
{n,}   Match at least n times
{n,m}  Match at least n but not more than m times


*?     Match 0 or more times
+?     Match 1 or more times
??     Match 0 or 1 time
{n}?   Match exactly n times
{n,}?  Match at least n times
{n,m}? Match at least n but not more than m times

Since patterns are processed as double quoted strings, the following also work:

\t        tab
\n        newline
\r        return
\f        form feed
\v        vertical tab, whatever that is
\a        alarm (bell)
\e        escape
\033      octal char
\x1b      hex char
\c[       control char
\l        lowercase next char
\u        uppercase next char
\L        lowercase till \E
\U        uppercase till \E
\E        end case modification
\Q        quote regexp meta characters till \E

In addition, Perl defines the following:

\w        Match a "word" character (alphanumeric plus "_")
\W        Match a non-word character
\s        Match a white space character
\S        Match a non-white space character
\d        Match a digit character
\D        Match a non-digit character

Note that \w matches a single alphanumeric character, not a whole word. To match a word you'd need to say \w+. You may use \w, \W, \s, \S, \d and \D within character classes (though not as either end of a range).

Perl defines the following zero-width assertions:

\b        Match a word boundary
\B        Match a non-(word boundary)
\A        Match only at beginning of string
\Z        Match only at end of string
\G        Match only where previous m//g left off

More Examples

Code Line Counting

Perl where the comments start with a # symbol:

grep -v "^[ \t]*#" | grep -v "^[ \t]*$" | wc -l

C where the comments start with a \* and end with *\

perl -n -e "if ( ! ( /\/\*/ .. /\*\// ) ) { print; }" | wc -l

Counting Files by Extension

To count the types of just one file extensions in one subdirectory

ls *.h | wc -l

To count the types of one or more file extensions in one subdirectory

ls *.[ch] | wc -l

To count the types of one or more file extensions three subdirectories down.

ls *.[ch] */*.[ch] */*/*.[ch] | wc -l

Command Line foreach Example

How to make multiple subsitutions in multiple files with the command foreach

echo  "s/\.htm/\.html/g"         >> sub.sed
echo  "s/\.jpeg/\.jpg/g"         >> sub.sed
echo  "s/\.JPG/\.jpg/g"          >> sub.sed
echo  "s/\.JPg/\.jpg/g"          >> sub.sed
echo  "s/\\\$/\$/g"              >> sub.sed
echo  "s/POB /P.O. Box /g"       >> sub.sed
echo  "s/mispelled/misspelled/g" >> sub.sed

mkdir  2

foreach  file  (*.html)  do
  sed  -f sub.sed   $file  > 2/$file
end

mkdir  backup
mv  *.html  backup/.
cp  2/*  .

Miscellaneous Examples

Telephone Number Reformating from (###) ###-#### to ###.###.####

s/\([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]/\1.\2.\3/

s/\([0-9]{3}) [0-9]{3}-[0-9]{3}/\1.\2.\3/

s/\(\d{3}) \d{3}-\d{3}/\1.\2.\3/

Testing a web page form input field for Post Office Box

/P\.* *O\.* *B*O*X*/i

where PO is the minimum and the maximum is P.O. Box, and includes variations such as: P.O., Pob, or PO Box.

Find a filename by wildcard

ls  name*   */name*   */*/name*   */*/*/name*   */*/*/*/name*   */*/*/*/*/name*

find . -name 'name*' -print

Find records of a certain date and time in the Sendmail log or Apache web log

grep "^Sep  7 11:56" maillog

grep "^[^ ]* [^ ]* [^ ]* \[16/Sep/2003:16:20" access_log

Matching Shell Globs as Regular Expressions

To convert from Shell Globbing to Perl the differences are between the three wildcard characters "?.*", where the string must be forced to match starting at the beginning to the very end (use of ^ and $). See below for some examples. (Thanks to OReilly's "Perl Cookbook" pages 180 and 181)

Shell Perl

list.? ^list\..$

* ^.*$

*.* ^.*\..*$

project.* project\..*$

*old ^.*old$

type*.[ch] ^type.*\.[ch]$

QUOTES

The concept of quoting is not heavily used in REs, if at all. However, REs on the command line that contain quotes require escaping. The normal use of quotes on the command line without REs is followed by using quotes inside of command line REs. Detailed usage of quotes on the command line may be found in the man pages for that shell.

Quotes include the following characters

  ""
  ''
  ``

where a starting and ending quote are shown and a 'string' would occur inside them.

But in certain processes, like a command line REs, quotes are used in a special way. Quotes used in an RE on the command line must be "escaped" or force to be a "literal" to avoid being interpreted by the command line interpreter. Thus, escaped RE quotes would then be processed by the RE expansion algorithm.

Single and double quotes on the command line have different effects.

Single quotes on the command line

Single quotes prevent shell variable subsitution and globbing on the command line.
For example:

The dollar sign character is treated as a literal dollar sign, not as a control character indicating the start of a variable name.
> echo $PATH
/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin

> echo '$PATH'
$PATH
where the second echo command input was treated as a literal with no additional command line processing such as variable subsitution.

Wild card characters are 'turned off' both for the asterisk "*" and the RE ".*" wildcard.

> ls *.html
about.html    event.html     guestbook.html    index.html     photo.html

> ls '*.html'
ls: *.html: No such file or directory

where the second ls command filename input was treated as a literal with no additional command line processing such as globbing.

Double quotes on the command line

On the command line a pair of double quotes is used to group a single text string that contains embedded spaces that would otherwise be parsed by the command line interpreter into separate tokens, not a single text string.
For example:

> grep "expensive photo" event.html
For Sale: expensive photo dated 1801

> grep expensive photo event.html
grep: photo: No such file or directory

where the last grep command without the double quotes resulted in an error of filename 'photo' not found.

Strings contained inside double quotes on the command line will have both globbing and variable subsitution performed on the string before the shell's interpreter sends the resulting string to the invoked 'command.'
In the example below the 'ls' command here sees only the option "-l" and

> ls -l *.html
-rw-rw-r--  1 pete  pete    7439 Dec 18  2000 about.html
-rw-rw-r--  1 pete  pete    4339 Sep 15 15:22 event.html
-rw-rw-r--  1 pete  pete    4339 Aug  3 19:40 guestbook.html
-rw-rw-r--  1 pete  pete   10240 Mar 17  2003 index.html
-rw-rw-r--  1 pete  pete   23418 Apr 15  2002 photo.html

not the *.html which will have been expanded by the shell interpreter to the string

about.html event.html guestbook.html index.html photo.html

before the 'ls' command is invoked. Thus, what the 'ls' command executable reads was on the command line is this string:

-l about.html event.html guestbook.html index.html photo.html

Care must taken with globbing to not have the expanded string exceed the command line buffer size.

Apostrophes

One might think that apostrophe ` is special like quotes. While they "enclose" a string, that string is treated special, depending upon the process handling the apostrophe.

Inside an RE the apostrophe character has no special RE meaning. It is a literal apostrophe.

A RE apostrophe does not have to be escaped, unless the RE is on the command line.

On the command line the enclosed string is substituted, that is the enclosed string is treated as a command and the STDOUT stream from the command replaces the enclosed string value. The STDOUT stream will have newline characters replaced with a blank.
For example:

`ls -1 *.html`
becomes
about.html event.html guestbook.html index.html photo.html
instead of
about.html
event.html
guestbook.html
index.html
photo.html

To see this effect try these commands:

ls *.html

ls -1 *.html

echo `ls *.html`

echo `ls -1 *.html`

where the last two 'echo' commands have identical output of a single line, while the first two 'ls' commands have different outputs with multiple lines with either 2 or more filenames per line or with the option -1 there is only one filename per line.

A Perl assignment statement has the string enclosed in apostrophes processed as a shell command and the variable's value is set to the STDOUT.
If the Perl variable type is a scalar, then the STDOUT includes new lines.
If the Perl variable type is an array, then the STDOUT is separated into array subscripts after the new line characters.

Apostrophes used inside awk or sed have no special meaning (as of the 1998 first version of this paper).

RE Escape Quoting

When using quotes inside of RE's the quote must be preceded with the escape character backslash "\" in order to force the command line interpreter to treat the quote as a literal. The interpreter will remove the escape backslash and pass the resulting RE string to the invoked command. Example:

sed -e 's/\"Book Title Here\"/<i>Book Title Here<\/i>/g'  bookreview.html

Resources and References

RE resources come from four primary areas: man pages, perl POD, web pages and books. Some of each are below.

Man pages

A complete set of RE manual pages is traditionally found in the

man vi

pages, and now more commonly in

man regexg

Command line globbing rules are in the

man sh

pages or the shell man pages of your choice.

Perl POD (Web) Pages

Manual pages for perl RE can be found linked from http://www.perl.com about 3-4 mouse clicks away or try the search.

Perl 5.8 Regular Expression POD resources are good reading for any type of RE:

and install with any port of Perl in perl/html/perlmain.html

Web Pages

This paper's URL will always be linked from http://peterbenjamin.com/seminars

A quick and easy to read RE web page is http://www.mit.edu/perl/perlre.html

The ORA Safari online, keyword searchable books.

Books

O'Reilly RE book search (as more every year).

Shell	Perl

list.?	^list\..$
*	^.*$
.	^.\..$
project.*	project\..*$
*old	^.*old$
type*.[ch]	^type.*\.[ch]$

Regular Expressions From The Ground Up Version 2.1 September 16, 2003 By Peter Benjamin

Abstract

Table Of Contents

Introduction

Purpose

Abbreviations

What is a RE?

Exceptions

ABSTRACTIONS AND CONCEPTS

STREAM

STREAM: DATA, CONTROL or MIXED

PARSE and TOKENS

ESCAPE CHARACTER

NESTING

QUOTING

RE Terminology - Using The Concepts To Learn The Details

PATTERN

PATTERN MATCHING

PATTERN SUBSTITUTION

HASH

DYNAMIC STREAM TYPES

RE Example

A BRIEF LIST OF THE SPECIAL REGEX CHARACTERS

More Examples

QUOTES

Single quotes on the command line

Double quotes on the command line

Apostrophes

RE Escape Quoting

Resources and References

Man pages

Perl POD (Web) Pages

Web Pages

Books

Regular Expressions From The Ground Up
Version 2.1
September 16, 2003
By Peter Benjamin