Regular Expressions
The patterns used in pattern matching are regular expressions such as those
supplied in the Version 8 regexp routines. (In fact, the routines are
derived (distantly) from Henry Spencer's freely redistributable
reimplementation of the V8 routines.) See the section on Version 8 Regular
Expressions for details.
In particular the following metacharacters have their standard egrep-ish
meanings:
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
By default, the "^" character is guaranteed to match at only the beginning
of the string, the "$" character at only the end (or before the newline at
the end) and Perl does certain optimizations with the assumption that the
string contains only one line. Embedded newlines will not be matched by
"^" or "$". You may, however, wish to treat a string as a multi-line
buffer, such that the "^" will match after any newline within the string,
and "$" will match before any newline. At the cost of a little more
overhead, you can do this by using the /m modifier on the pattern match
operator. (Older programs did this by setting $*, but this practice is now
deprecated.)
To facilitate multi-line substitutions, the "." character never matches a
newline unless you use the /s modifier, which in effect tells Perl to
pretend the string is a single line--even if it isn't. The /s modifier
also overrides the setting of $*, in case you have some (badly behaved)
older code that sets it in another module.
The following standard quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
(If a curly bracket occurs in any other context, it is treated as a regular
character.) The "*" modifier is equivalent to {0,}, the "+" modifier to
{1,}, and the "?" modifier to {0,1}. n and m are limited to integral
values less than 65536.
By default, a quantified subpattern is "greedy", that is, it will match as
many times as possible (given a particular starting location) while still
allowing the rest of the pattern to match. If you want it to match the
minimum number of times possible, follow the quantifier with a "?". Note
that the meanings don't change, just the "greediness":
*? Match 0 or more times
+? Match 1 or more times
?? Match 0 or 1 time
{n}? Match exactly n times
{n,}? Match at least n times
{n,m}? Match at least n but not more than m times
Because patterns are processed as double quoted strings, the following also
work:
\t tab (HT, TAB)
\n newline (LF, NL)
\r return (CR)
\f form feed (FF)
\a alarm (bell) (BEL)
\e escape (think troff) (ESC)
\033 octal char (think of a PDP-11)
\x1B hex char
\c[ control char
\l lowercase next char (think vi)
\u uppercase next char (think vi)
\L lowercase till \E (think vi)
\U uppercase till \E (think vi)
\E end case modification (think vi)
\Q quote regexp metacharacters till \E
If use locale is in effect, the case map used by \l, \L, \u and <\U> is
taken from the current locale. See the perllocale manpage.
In addition, Perl defines the following:
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
Note that \w matches a single alphanumeric character, not a whole word. To
match a word you'd need to say \w+. If use locale is in effect, the list
of alphabetic characters generated by \w is taken from the current locale.
See the perllocale manpage. You may use \w, \W, \s, \S, \d, and \D within
character classes (though not as either end of a range).
Perl defines the following zero-width assertions:
\b Match a word boundary
\B Match a non-(word boundary)
\A Match at only beginning of string
\Z Match at only end of string (or before newline at the end)
\G Match only where previous m//g left off (works only with /g)
A word boundary (\b) is defined as a spot between two characters that has a
\w on one side of it and a \W on the other side of it (in either order),
counting the imaginary characters off the beginning and end of the string
as matching a \W. (Within character classes \b represents backspace rather
than a word boundary.) The \A and \Z are just like "^" and "$" except that
they won't match multiple times when the /m modifier is used, while "^" and
"$" will match at every internal line boundary. To match the actual end of
the string, not ignoring newline, you can use \Z(?!\n). The \G assertion
can be used to chain global matches (using m//g), as described in the
section on Regexp Quote-Like Operators in the perlop manpage.