GSI•Course

Chapter 2: Matching and Substitution

Regular Expressions

In the previous example, the expression /success/ is a very simple example of a more general concept of the "regular expression".

All pattern matching in Perl is based on this concept of regular expressions. Regular expressions are an important part of computer science, and entire books are devoted to the topic. Regular expressions form a standard way of expressing almost any text pattern unambiguously.

Luckily, for our purposes regular expressions can start out simple. Once you have these key concepts mastered, you'll be able to find out and learn more about them on your own through many online resources. One online resource to note is https://perldoc.perl.org/perlop.html, which is the perlop (perl operators) documentation. This section describes the matching and substitution operators in detail. Look in particular at m// and s///. Another online resource to look at is https://perldoc.perl.org/perlre.html which discusses regular expressions in great detail.

The power of regular expressions starts to become clear when you discover they can represent words and phrases but also far more general patterns of text.

Note In many of the following examples of pattern matching, only the pattern match is shown. If you put any of these patterns into effect you still have to use a variable and the binding operator. Usually. ;-)

Plain Character Expressions

Many letters and characters can represent themselves in a matching pattern, so often just the plain word by itself will act as a regular expression. E.g. /success/, /failure/, and /nearly all plain text/ are all pattern matches that are very straightforward in their meaning.

It's important to note that without any additional qualification, these search patterns can occur anywhere in the string being searched, so /success/ would match any of the strings: "success", "This sentence contains success", and "unsuccessful". Many times, plain vanilla search patterns like this are adequate for the job. Virtually any plain English word without any punctuation can be used as a regular expression to represent itself as a search pattern.

Special Characters

Some special characters or combinations of characters have a special meaning and do not represent themselves. This is what give regular expressions their power. For example, the lowly period does not stand for a period in a match. Instead, it stands for any character.

The pattern /b.g/ would match "bag", "big", "bug", etc, as well as any other sequence: "b2g", "b&g", "b]g" and so on. It would match "b.g" itself, where . does represent a period. /b.g/ would also match longer expressions: "bigger", "bug swatter".

Matching simply means "found somewhere, anywhere, within the searched string". You can use special characters to specify the position where the search pattern must be located.

A ^ character stands for the beginning of the searched string, so:

/^success/ would match "success" but not "unsuccessful".

A $ character stands for the end of the searched string, so:

/success$/ would match "unsuccess" but not "successful".

Using both ^ and $ together nails the pattern down at both ends, so:

/^success$/ will only match the exact string "success".

Other special characters include:

\ - a form of a "quote" character
| - alternation, used for "or'ing"
() - grouping matched elements
[] - character class

The first character, "\", is used in combination with special letters to take away their special meaning. E.g.:

\. will match a period
\$ will match a dollar sign
\^ will match a caret
\\ will match a backslash

and so on.

The pipe symbol "|" is used to provide alternatives:

/good|bad/ will match either "good vibes" or "bad karma".

The parentheses group matched elements, so

/(good|bad) example/

is the same as searching simultaneously for

/good example/ or
/bad example/

Without the ()'s, this would be the same as searching simultaneously for

/good/ or
/bad example/

The square brackets indicate a class of characters, so

/^[abcdefg]/ would match any strings beginning with the letters a through g. This can also be written in shorthand as /^[a-g]/.

Special Backslash Combination Characters:

The backslash character is not just used to "quote metacharacters" (in other words to remove their special meaning) as above. It is also used in conjunction with non-special characters to give them a special meaning. For instance

\t is a tab character
\n is a newline character
\d is any digit
\D is any non-digit
\s is a whitespace character
\S is any nonwhitespace character

You'll find yourself using these backslash combinations a lot in practice.

Repetition Characters

The expressions above show you how to match certain characters, but they don't allow you to control how many matches should be made at once. Matching repetition is controlled by a few other special characters:

+ means 1 or more matches
* means 0 or more matches
? means 0 or 1 matches
{n} exactly n matches
{m,n} m to n matches

The best way to learn Regular Expressions is by example, so let's go on to see how these amazing things can be put to work together in the next section.

Mike Gossland's Perl Tutorial Course for Windows

Chapter 2: Matching and Substitution

Regular Expressions