Regular Expressions

Regular Expressions
May 21, 1998
By Peter Benjamin

Prepared for a learning seminar Linux Users of Los Angeles LULA. The paper is not intended as a stand alone learning aid.

Regular Expressions has spread from the unix platform to all platforms and some high level languages (C and Java have classes for them).

The Perl 5 HTML Manual page URL is:

http://reference.perl.com/guides/perl5.html

and the RE page is at:

http://www.mit.edu/perl/perlre.html

Regular Expressions is commonly abbreviated as Regex, regexp, RE, re and these will be used in this paper.

What is a RE? Primarly a text string pattern matching method with extremely fancy wild cards, especially ranges of a variety of restrictive wild characters.

REs can used on the command line for globbing (filename expansion), and in unix filters like sed, awk and perl, and in text editors like vi, joe, emacs, textpad, and in high level languages like Basic, C, Fortran, and Java.

This paper will approach the use of RE's from the high ground.

From an abstract level a stream of data has patterns matched.
From a concept level a list of data is examined for matching text strings.
And at a detailed level a text string is examined for matching multiple patterns as defined by a data stream that alternates command and data streams.

To ease the understanding of RE a few technical terms will be introduced. The terms may appear to have no relationship to RE's but RE's are intimately, academically and theoretically based on the abstractions and concepts introduced below. Understanding these concepts will greatly ease reading and creating your own RE's.

EXCEPTIONS

Each RE definition or rule may have one "exceptions," but not more . There are common ones and uncommon ones. It is very important to know that many RE rules have exceptions. It is these exceptions that give RE's such power and flexibility.

Every "rule" below for RE's usually has an exception. It is outside the scope of this paper to document all of them. The more common ones are explained. And example of a RE rule with its exception would read like this familiar english spelling one:

"i" before "e" except after "c."

Mostly just learn one way of doing an RE and do not try any others and you will hardly ever need to worry about exceptions.

ABSTRACTIONS AND CONCEPTS

STREAM

A stream is data (a sequence of binary, hex, or character values) that is passed from one process to another process. A process can be the command line, a program, a file, or a monitor screen, to mention a few. Processes can also exist within another process. For example, in a high level language, an assignment statement within an if statement where the value of the assignment is passed to the if statement for testing. Here the word "process" is not used as the typical unix "process."

Also good to understand is that streams can be contained inside streams. In fact, that understanding is crucial for the rest of the paper.

STREAM: DATA, CONTROL or MIXED

A stream has two types, data and control commands. Data and control commands can range from binary bits to paragraphs.

A stream can be all data, all commands or mixed. When mixed the data and commands are typically rigorously parseable which determines what type the stream part is. The parsing may have to be done by a special process as the mixed stream may have positional type streams, that their type, data or control, is determine by context. Examples are presented below.

PARSE and TOKENS

To examine a string, frequently a text string, character by character and either add the current character to the last stream, start a new stream, or start a new stream with the current character. Remember a stream is either data or a control command. In the case of "parsing" these streams are frequently called "tokens".

ESCAPE CHARACTER

Frequently in a mixed stream an "escape" character is used to toggle or alternate from data to command. The escape could be a blank, for example on the command line. Or an escape character could be the backslash character "\" or its position may defined it as an escape character such as a starting quote, which with its matching closing quote defines a text string, but is not part of the string, nor a control command used later in the handling. In the case of the quote two escape characters are next to each other, the blank next to the quote is also an escape character, being neither data nor control (unless you look at if from certain processes like the "parser" which treats blanks as control commands, that is "end the existing field" commands.

NESTING

The concept of nesting is heavily used by power RE users. So a brief introduction is in order.

Certain characters used as escape or control characters come in pairs, a start character and an end character. These are actually quite easy to remember as they match the keyboard and your expectations as follows:

[] {} ()

One might think that angle brackets or greater than and less than characters <> is used as well. They are not (yet:).

In some RE cases there is no legal nesting of these characters when they are used as control characters. That is, it is not legal to use them inside each other.

LEGAL USES:   [] {} () [][]   [()]  [{}]  ({})  ([])

ILLEGAL:  {()}  {[]}

QUOTING

The concept of quoting is NOT heavily used in RE. But in certain processes, like the command line RE, the command line uses quotes in a special way, and thus when quotes are used in an RE on the command line the quote must be "escaped" or force to be a "literal" to be interpreted not by the command line interpreter but by the RE expansion algorithm. So a brief introduction is in order. The actual usage of quotes on the command line is outside the scope of this RE paper.

The special use of double and single quotes "" '' `` One might think that apostrophe ` is special like quotes, but while they "enclose" a string, that string is treated special, depending upon the process handling the apostrophe.

On the command line the enclosed string is substituted, that is the enclosed string is treated as a command and the STDOUT stream from the command replaces the enclosed string value.

Used inside perl

Used inside awk or sed no special meaning is used.

USING THE CONCEPTS TO LEARN THE DETAILS

The above concepts are now used to explain some stream details important to understand how to read and write RE's.

PATTERN

A pattern is what an RE expresses, that is a text string or set of text strings. It could a filename. Or a set of filenames. Something as simple as a partial filename followed by a wild card.

(Hmm, actually that is handled by "globbing," a RE like process used by the command line interpreter, as the wild card asterisk in that case is slightly different than a true RE asterisk meaning. It was stated there are exceptions... ;)

PATTERN MATCHING

Pattern matching is the process of comparing the pattern against an input data stream. It is why you write RE's in the first place to do. It is a form of data filtering.

PATTERN SUBSTITUTION

Pattern substitution is one of the most powerful and common uses of RE's. It uses the data filtering of the RE to select data streams for substitution, thus not all lines have replacements, only those selected.

HASH

Special high speed pattern matching method. Hashing methods are outside the scope of this RE paper.

DYNAMIC STREAM TYPES

Sometimes many "passes" are required at the stream such as the command line interpreter parses the user typed command text string (a simple data stream), then the command line interpreter treats the first "data" as a control command and finds that command/program within the PATH environment variable directories, invokes the program and passes the remainder of the command line to the program, which then accepts the input and scans it for more commands or "options" and data or option/argument values. For example:

prompt: perl -n -e "{ print; }" filename

contains the follow data and control streams browken down into three steps as handled by the three "processes" that handles the stream: 1) user, 2) Command line interpreter, and 3) program perl.

As type by the user

DATA: perl -n -e "{ print; }" filename
CONTROL: press "enter/return" key

As viewed by the command line interpreter

CONTROL: perl
DATA: -n -e "{ print; }" filename

As viewed by the perl command line option parser

CONTROL: -n
CONTROL: -e 
DATA: "{ print; }" 
DATA: filename

As viewed by the perl interpretator/compiler

CONTROL: "{ print; }"

The user types a data stream onto the command line and presses enter or generates a control character called "enter" at the end of the data stream to inform the next process to take control.

What type the parsed stream parts (tokens) are is not determined by the command line interpreter except the first token "perl." The remainder of the command line input is treated as data.

The program perl is then responsible for making a second pass at the data stream and assigning the stream type, control or data, based upon context. That is control and data can occur in any order and some control streams are required to be followed by a data stream, and some fields are optional and others are required (filename).

RE EXAMPLE

Here is a typical RE and its parts:

^[a-z]* +\(.*\) +[0-9]$

Keeping in mind that concepts just learned one knows that this RE is a mixed stream is consisting of data and commands and escape characters. One could guess that the "brackets" and "parens" are likely like quotes, that is there is a starting and and ending escape character that defines an internal text string.

CONTROL: ^      - when starting an RE means the pattern starts at the first character.
                - That is the character after the ^ must be the first character
MIXED:   [a-z]  - "one" character that matches the RANGE lower case letters a to z 
CONTROL: *      - the preceding character can occur zero, one or more times.
DATA:    " "    - a space or blank
CONTROL: +      - the preceding character can occur       one or more times.
MIXED:   \(.*\) - a "group"  THE MOST IMPORTANT CONSTRUCT FOR SUBSTITION!!!
DATA:    " "    - a space or blank
CONTROL: +      - the preceding character can occur       one or more times.
MIXED:   [0-9]  - "one" character that matches any the RANGE from 0 to 9
CONTROL: $      - when ending an RE means the pattern ends at the last character
                - That is the character before the $ must be the last character

Taking just the GROUP and breaking it down further:

BREAKDOWN: 
MIXED: \( - start a GROUP
MIXED: .* - zero or more characters (a COMMON pair or controls)
MIXED: \) - ends a GROUP


FURTHER BREAKDOWN:
ESCAPE:  \ - next character is treated as CONTROL
CONTROL: ( - start a GROUP
CONTROL: . - any character (the RE wild card)
CONTROL: * - the preceding character can occur zero, one or more times.
ESCAPE:  \ - next character is treated as CONTROL
CONTROL: ) - end a GROUP

NOTE: A paren with no preceding ESCAPE \ is data, just another character

EXCEPTION: Perl RE uses the opposite meanings!!!
	( ) - defines a GROUP
	\( \) - just data, normal start and end paren.

NOTE: Another perl exception starting with Perl 5 is:
      The At Sign "@" must be preceded with an escape backslash \,
      otherwise they are treated as an association array variable,
      which is outside the scope of this paper on RE's.

Taking the RANGE [a-z] and breaking it down further:

BREAKDOWN: 
CONTROL: [ - start a RANGE
DATA:    a - lower case a
CONTROL: - - a DASH isnside brackets is THE RANGE Character
DATA:    z - lower case z
CONTROL: ] - end the RANGE

Correcting a little white lie the RANGE character is ONLY the dash when used inside brackets and the dash is surrounded by "like" data in the proper ascending order. That is here are bad ranges:

z-a   8-4  E-A  a-Z     9-0  1-0

Note that all digits MUST be 0-9, not 1-0, as zero comes first.

Good ranges are:

a-z   4-8  A-E  a-zA-Z  0-9  0-1

Note the "order" of the characters surrounding the dash must be "increasing"

Note the range of upper and lower case letters where two ranges are side by side.

Note the range 0-1 has no digits inside the range but is still legal.

Remember the dash must be used inside brackets to be a RANGE.

To get an ordinary dash inside brackets either use an escaped dash using the backslash \- or start or end with a dash inside the brackets, that is [-aeiou] or [aeiou-].

RE parsing is VERY intelligent and will seldom produce an error message if the RE stream can be legally parsed. Of course, your RE is likely to give you unexpected results.

Continuing to correct the white lie the brackets express a RANGE of values for just ONE character in the PATTERN.

Now back to the RE example and the equivalent wording of

^[a-z]* +\(.*\) +[0-9]$

^[a-z]*  a string starting with zero or more lower case letters   
 +       followed by at least one blank
\(.*\)   grouping everything found before
 +       at least one or more blanks before
[0-9]$   a single trailing digit

When reading a RE remember that

\ is an escape character.
GROUPS are  - in Perl 5 GROUPS are ()
RANGES are [ ]
- inside a [] means a RANGE only if surrounded by similar character types in order.
^ and $ when at the start or end of the RE respectively mean the adjacent character must start or end the pattern respectively.
. is the WILD CARD
* is zero, one or more of the preceding character
+ is one or more of the preceding character

and that completes the basic set of RE's.

A complete set of RE manual pages is found in the

man vi

pages. Globbing rules are in the

man sh

pages.

Manual pages for perl RE can be found at http://www.perl.com or linked from there 2-3 clicks away and is now commonly distributed with Perl as web page perlexp.html.

A BRIEF LIST OF THE SPECIAL REGEX CHARACTERS

The following metacharacters have their standard egrep-ish meanings:

\        Quote the next metacharacter
^        Match the beginning of the line
.        Match any character (except newline)
$        Match the end of the line
|        Alternation
()       Grouping
[]       Character class

The following standard quantifiers are recognized:

*      Match 0 or more times
+      Match 1 or more times
?      Match 1 or 0 times
{n}    Match exactly n times
{n,}   Match at least n times
{n,m}  Match at least n but not more than m times


*?     Match 0 or more times
+?     Match 1 or more times
??     Match 0 or 1 time
{n}?   Match exactly n times
{n,}?  Match at least n times
{n,m}? Match at least n but not more than m times

Since patterns are processed as double quoted strings, the following also work:

\t        tab
\n        newline
\r        return
\f        form feed
\v        vertical tab, whatever that is
\a        alarm (bell)
\e        escape
\033      octal char
\x1b      hex char
\c[       control char
\l        lowercase next char
\u        uppercase next char
\L        lowercase till \E
\U        uppercase till \E
\E        end case modification
\Q        quote regexp metacharacters till \E

In addition, Perl defines the following:

\w        Match a "word" character (alphanumeric plus "_")
\W        Match a non-word character
\s        Match a whitespace character
\S        Match a non-whitespace character
\d        Match a digit character
\D        Match a non-digit character

Note that \w matches a single alphanumeric character, not a whole word. To match a word you'd need to say \w+. You may use \w, \W, \s, \S, \d and \D within character classes (though not as either end of a range).

Perl defines the following zero-width assertions:

\b        Match a word boundary
\B        Match a non-(word boundary)
\A        Match only at beginning of string
\Z        Match only at end of string
\G        Match only where previous m//g left off