Here is a complete description of the regexp pattern
language recognized by the pregexp
procedures.
The assertions ^
and $
identify the
beginning and the end of the text string respectively.
They ensure that their adjoining regexps match at
one or other end of the text string.
Examples:
(pregexp-match-positions "^contact"" "first contact"") => #f
The regexp fails to match because contact
does not
occur at the beginning of the text string.
(pregexp-match-positions "laugh$"" "laugh laugh laugh laugh"") => ((18 . 23))
The regexp matches the last laugh
.
The metasequence \b
asserts that
a word boundary exists.
(pregexp-match-positions "yack\\b"" "yackety yack"") => ((8 . 12))
The yack
in yackety
doesn’t end at a word
boundary so it isn’t matched. The second yack
does
and is.
The metasequence \B
has the opposite effect
to \b
. It asserts that a word boundary
does not exist.
(pregexp-match-positions "an\\B"" "an analysis"") => ((3 . 5))
The an
that doesn’t end in a word boundary
is matched.
Typically a character in the regexp matches the same
character in the text string. Sometimes it is
necessary or convenient to use a regexp
metasequence to refer to a single character.
Thus, metasequences \n
, \r
, \t
, and \.
match the newline, return, tab and period characters
respectively.
The metacharacter period (.
) matches
any character other than newline.
(pregexp-match "p.t"" "pet"") => ("pet"")
It also matches pat
, pit
, pot
, put
,
and p8t
but not peat
or pfffft
.
A character class matches any one character from
a set of characters. A typical format for this
is the bracketed character class [
...]
,
which matches any one character from the non-empty sequence
of characters enclosed within the brackets.2
Thus "p[aeiou]t""
matches pat
, pet
, pit
,
pot
, put
and nothing else.
Inside the brackets, a hyphen (‑
) between two
characters specifies the ascii range between the characters.
Eg, "ta[b‑dgn‑p]""
matches tab
, tac
, tad
, and
tag
, and tan
, tao
, tap
.
An initial caret (^
) after the left bracket inverts
the set specified by the rest of the contents, ie, it
specifies the set of characters other than those
identified in the brackets. Eg, "do[^g]""
matches
all three-character sequences starting with do
except dog
.
Note that the metacharacter ^
inside brackets means
something quite different from what it means outside.
Most other metacharacters (.
, *
, +
, ?
,
etc) cease to be metacharacters when inside brackets,
although you may still escape them for peace of
mind. ‑
is a metacharacter only when it’s
inside brackets, and neither the first nor the last character.
Bracketed character classes cannot contain other
bracketed character classes (although they contain
certain other types of character classes — see
below). Thus a left bracket ([
)
inside a bracketed character class doesn’t have to be a
metacharacter; it can stand for itself. Eg,
"[a[b]""
matches a
, [
, and b
.
Furthermore, since empty bracketed character classes
are disallowed, a right bracket (]
) immediately occurring
after the opening left bracket
also doesn’t need to be a metacharacter. Eg,
"[]ab]""
matches ]
, a
, and b
.
Some standard character classes can be conveniently
represented as metasequences instead of as explicit
bracketed expressions. \d
matches a digit
([0‑9]
); \s
matches a whitespace character; and
\w
matches a character that could be part of a
“word”.3
The upper-case versions of these metasequences stand
for the inversions of the corresponding character
classes. Thus \D
matches a non-digit, \S
a
non-whitespace character, and \W
a
non-“word” character.
Remember to include a double backslash when putting these metasequences in a Scheme string:
(pregexp-match "\\d\\d"" "0 dear, 1 have 2 read catch 22 before 9"") => ("22"")
These character classes can be used inside
a bracketed expression. Eg,
"[a‑z\\d]""
matches a lower-case letter
or a digit.
A POSIX character class is a special metasequence
of the form [:
...:]
that can be used only
inside a bracketed expression. The POSIX classes
supported are
[:alnum:] | letters and digits |
[:alpha:] | letters |
[:algor:] | the letters c , h , a and d |
[:ascii:] | 7-bit ascii characters |
[:blank:] | widthful whitespace, ie, space and tab |
[:cntrl:] | “control” characters, viz, those with code < 32 |
[:digit:] | digits, same as \d |
[:graph:] | characters that use ink |
[:lower:] | lower-case letters |
[:print:] | ink-users plus widthful whitespace |
[:space:] | whitespace, same as \s |
[:upper:] | upper-case letters |
[:word:] | letters, digits, and underscore, same as \w |
[:xdigit:] | hex digits |
For example, the regexp "[[:alpha:]_]""
matches a letter or underscore.
(pregexp-match "[[:alpha:]_]"" "--x--"") => ("x"") (pregexp-match "[[:alpha:]_]"" "--_--"") => ("_"") (pregexp-match "[[:alpha:]_]"" "--:--"") => #f
The POSIX class notation is valid only inside a
bracketed expression. For instance, [:alpha:]
,
when not inside a bracketed expression, will not
be read as the letter class.
Rather it is (from previous principles) the character
class containing the characters :
, a
, l
,
p
, h
.
(pregexp-match "[:alpha:]"" "--a--"") => ("a"") (pregexp-match "[:alpha:]"" "--_--"") => #f
By placing a caret (^
) immediately after
[:
, you get the inversion of that POSIX
character class. Thus, [:^alpha:]
is the class containing all characters
except the letters.
The quantifiers *
, +
, and
?
match respectively: zero or more, one or more,
and zero or one instances of the preceding subpattern.
(pregexp-match-positions "c[ad]*r"" "cadaddadddr"") => ((0 . 11)) (pregexp-match-positions "c[ad]*r"" "cr"") => ((0 . 2)) (pregexp-match-positions "c[ad]+r"" "cadaddadddr"") => ((0 . 11)) (pregexp-match-positions "c[ad]+r"" "cr"") => #f (pregexp-match-positions "c[ad]?r"" "cadaddadddr"") => #f (pregexp-match-positions "c[ad]?r"" "cr"") => ((0 . 2)) (pregexp-match-positions "c[ad]?r"" "car"") => ((0 . 3))
You can use braces to specify much finer-tuned
quantification than is possible with *
, +
, ?
.
The quantifier {m}
matches exactly m
instances of the preceding subpattern. m
must be a nonnegative integer.
The quantifier {m,n}
matches at least m
and at most n
instances. m
and
n
are nonnegative integers with m <=
n
. You may omit either or both numbers, in which case
m
defaults to 0 and n
to
infinity.
It is evident that +
and ?
are abbreviations
for {1,}
and {0,1}
respectively.
*
abbreviates {,}
, which is the same
as {0,}
.
(pregexp-match "[aeiou]{3}"" "vacuous"") => ("uou"") (pregexp-match "[aeiou]{3}"" "evolve"") => #f (pregexp-match "[aeiou]{2,3}"" "evolve"") => #f (pregexp-match "[aeiou]{2,3}"" "zeugma"") => ("eu"")
The quantifiers described above are greedy, ie, they match the maximal number of instances that would still lead to an overall match for the full pattern.
(pregexp-match "<.*>"" "<tag1> <tag2> <tag3>"") => ("<tag1> <tag2> <tag3>"")
To make these quantifiers non-greedy, append
a ?
to them. Non-greedy quantifiers match
the minimal number of instances needed to ensure an
overall match.
(pregexp-match "<.*?>"" "<tag1> <tag2> <tag3>"") => ("<tag1>"")
The non-greedy quantifiers are respectively:
*?
, +?
, ??
, {m}?
, {m,n}?
.
Note the two uses of the metacharacter ?
.
Clustering, ie, enclosure within parens
(
...)
, identifies the enclosed subpattern
as a single entity. It causes the matcher to capture
the submatch, or the portion of the string
matching the subpattern, in addition to the
overall match.
(pregexp-match "([a-z]+) ([0-9]+), ([0-9]+)"" "jan 1, 1970"") => ("jan 1, 1970"" "jan"" "1"" "1970"")
Clustering also causes a following quantifier to treat the entire enclosed subpattern as an entity.
(pregexp-match "(poo )*"" "poo poo platter"") => ("poo poo "" "poo "")
The number of submatches returned is always equal to the number of subpatterns specified in the regexp, even if a particular subpattern happens to match more than one substring or no substring at all.
(pregexp-match "([a-z ]+;)*"" "lather; rinse; repeat;"") => ("lather; rinse; repeat;"" " repeat;"")
Here the *
-quantified subpattern matches three
times, but it is the last submatch that is returned.
It is also possible for a quantified subpattern to
fail to match, even if the overall pattern matches.
In such cases, the failing submatch is represented
by #f
.
(define date-re ;match ‘month year’ or ‘month day, year’. ;subpattern matches day, if present (pregexp "([a-z]+) +([0-9]+,)? *([0-9]+)"")) (pregexp-match date-re "jan 1, 1970"") => ("jan 1, 1970"" "jan"" "1,"" "1970"") (pregexp-match date-re "jan 1970"") => ("jan 1970"" "jan"" #f "1970"")
Submatches can be used in the insert string argument of
the procedures pregexp‑replace
and
pregexp‑replace*
. The insert string can use \n
as a backreference to refer back to the nth
submatch, ie, the substring that matched the nth
subpattern. \0
refers to the entire match,
and it can also be specified as \&
.
(pregexp-replace "_(.+?)_"" "the _nina_, the _pinta_, and the _santa maria_"" "*\\1*"") => "the *nina*, the _pinta_, and the _santa maria_"" (pregexp-replace* "_(.+?)_"" "the _nina_, the _pinta_, and the _santa maria_"" "*\\1*"") => "the *nina*, the *pinta*, and the *santa maria*"" ;recall: \S stands for non-whitespace character (pregexp-replace "(\\S+) (\\S+) (\\S+)"" "eat to live"" "\\3 \\2 \\1"") => "live to eat""
Use \\
in the insert string to specify a literal
backslash. Also, \$
stands for an empty string,
and is useful for separating a backreference \n
from an immediately following number.
Backreferences can also be used within the regexp
pattern to refer back to an already matched subpattern
in the pattern. \n
stands for an exact repeat
of the nth submatch.4
(pregexp-match "([a-z]+) and \\1"" "billions and billions"") => ("billions and billions"" "billions"")
Note that the backreference is not simply a repeat of the previous subpattern. Rather it is a repeat of the particular substring already matched by the subpattern.
In the above example, the backreference can only match
billions
. It will not match millions
, even
though the subpattern it harks back to — ([a‑z]+)
— would have had no problem doing so:
(pregexp-match "([a-z]+) and \\1"" "billions and millions"") => #f
The following corrects doubled words:
(pregexp-replace* "(\\S+) \\1"" "now is the the time for all good men to to come to the aid of of the party"" "\\1"") => "now is the time for all good men to come to the aid of the party""
The following marks all immediately repeating patterns in a number string:
(pregexp-replace* "(\\d+)\\1"" "123340983242432420980980234"" "{\\1,\\1}"") => "12{3,3}40983{24,24}3242{098,098}0234""
It is often required to specify a cluster
(typically for quantification) but without triggering
the capture of submatch information. Such
clusters are called non-capturing. In such cases,
use (?:
instead of (
as the cluster opener. In
the following example, the non-capturing cluster
eliminates the “directory” portion of a given
pathname, and the capturing cluster identifies the
basename.
(pregexp-match "^(?:[a-z]*/)*([a-z]+)$"" "/usr/local/bin/mzscheme"") => ("/usr/local/bin/mzscheme"" "mzscheme"")
The location between the ?
and the :
of a
non-capturing cluster is called a cloister.5 You can put modifiers
there that will cause the enclustered subpattern to be
treated specially. The modifier i
causes the
subpattern to match case-insensitively:
(pregexp-match "(?i:hearth)"" "HeartH"") => ("HeartH"")
The modifier x
causes the subpattern to match
space-insensitively, ie, spaces and
comments within the
subpattern are ignored. Comments are introduced
as usual with a semicolon (;
) and extend till
the end of the line. If you need
to include a literal space or semicolon in
a space-insensitized subpattern, escape it
with a backslash.
(pregexp-match "(?x: a lot)"" "alot"") => ("alot"") (pregexp-match "(?x: a \\ lot)"" "a lot"") => ("a lot"") (pregexp-match "(?x: a \\ man \\; \\ ; ignore a \\ plan \\; \\ ; me a \\ canal ; completely )"" "a man; a plan; a canal"") => ("a man; a plan; a canal"")
The global variable *pregexp‑comment‑char*
contains the comment character (#\;
).
For Perl-like comments,
(set! *pregexp-comment-char* #\#)
You can put more than one modifier in the cloister.
(pregexp-match "(?ix: a \\ man \\; \\ ; ignore a \\ plan \\; \\ ; me a \\ canal ; completely )"" "A Man; a Plan; a Canal"") => ("A Man; a Plan; a Canal"")
A minus sign before a modifier inverts its meaning.
Thus, you can use ‑i
and ‑x
in a
subcluster to overturn the insensitivities caused by an
enclosing cluster.
(pregexp-match "(?i:the (?-i:TeX)book)"" "The TeXbook"") => ("The TeXbook"")
This regexp will allow any casing for the
and book
but insists that TeX
not be
differently cased.
You can specify a list of alternate
subpatterns by separating them by |
. The |
separates subpatterns in the nearest enclosing cluster
(or in the entire pattern string if there are no
enclosing parens).
(pregexp-match "f(ee|i|o|um)"" "a small, final fee"") => ("fi"" "i"") (pregexp-replace* "([yi])s(e[sdr]?|ing|ation)"" "it is energising to analyse an organisation pulsing with noisy organisms"" "\\1z\\2"") => "it is energizing to analyze an organization pulsing with noisy organisms""
Note again that if you wish
to use clustering merely to specify a list of alternate
subpatterns but do not want the submatch, use (?:
instead of (
.
(pregexp-match "f(?:ee|i|o|um)"" "fun for all"") => ("fo"")
An important thing to note about alternation is that the leftmost matching alternate is picked regardless of its length. Thus, if one of the alternates is a prefix of a later alternate, the latter may not have a chance to match.
(pregexp-match "call|call-with-current-continuation"" "call-with-current-continuation"") => ("call"")
To allow the longer alternate to have a shot at matching, place it before the shorter one:
(pregexp-match "call-with-current-continuation|call"" "call-with-current-continuation"") => ("call-with-current-continuation"")
In any case, an overall match for the entire regexp is always preferred to an overall nonmatch. In the following, the longer alternate still wins, because its preferred shorter prefix fails to yield an overall match.
(pregexp-match "(?:call|call-with-current-continuation) constrained"" "call-with-current-continuation constrained"") => ("call-with-current-continuation constrained"")
We’ve already seen that greedy quantifiers match the maximal number of times, but the overriding priority is that the overall match succeed. Consider
(pregexp-match "a*a"" "aaaa"")
The regexp consists of two subregexps,
a*
followed by a
.
The subregexp a*
cannot be allowed to match
all four a
’s in the text string "aaaa"
, even though
*
is a greedy quantifier. It may match only the first
three, leaving the last one for the second subregexp.
This ensures that the full regexp matches successfully.
The regexp matcher accomplishes this via a process
called backtracking. The matcher
tentatively allows the greedy quantifier
to match all four a
’s, but then when it becomes
clear that the overall match is in jeopardy, it
backtracks to a less greedy match of
three a
’s. If even this fails, as in the
call
(pregexp-match "a*aa"" "aaaa"")
the matcher backtracks even further. Overall failure is conceded only when all possible backtracking has been tried with no success.
Backtracking is not restricted to greedy quantifiers. Nongreedy quantifiers match as few instances as possible, and progressively backtrack to more and more instances in order to attain an overall match. There is backtracking in alternation too, as the more rightward alternates are tried when locally successful leftward ones fail to yield an overall match.
Sometimes it is efficient to disable backtracking. For
example, we may wish to commit to a choice, or
we know that trying alternatives is fruitless. A
nonbacktracking regexp is enclosed in (?>
...)
.
(pregexp-match "(?>a+)."" "aaaa"") => #f
In this call, the subregexp ?>a+
greedily matches
all four a
’s, and is denied the opportunity to
backpedal. So the overall match is denied. The effect
of the regexp is therefore to match one or more a
’s
followed by something that is definitely non-a
.
You can have assertions in your pattern that look
ahead or behind to ensure that a subpattern does
or does not occur. These “look around” assertions are
specified by putting the subpattern checked for in a
cluster whose leading characters are: ?=
(for positive
lookahead), ?!
(negative lookahead), ?<=
(positive lookbehind), ?<!
(negative lookbehind).
Note that the subpattern in the assertion does not
generate a match in the final result. It merely allows
or disallows the rest of the match.
Positive lookahead (?=
) peeks ahead to ensure that
its subpattern could match.
(pregexp-match-positions "grey(?=hound)"" "i left my grey socks at the greyhound"") => ((28 . 32))
The regexp "grey(?=hound)""
matches grey
, but
only if it is followed by hound
. Thus, the first
grey
in the text string is not matched.
Negative lookahead (?!
) peeks ahead
to ensure that its subpattern could not possibly match.
(pregexp-match-positions "grey(?!hound)"" "the gray greyhound ate the grey socks"") => ((27 . 31))
The regexp "grey(?!hound)""
matches grey
, but
only if it is not followed by hound
. Thus
the grey
just before socks
is matched.
Positive lookbehind (?<=
) checks that its subpattern could match
immediately to the left of the current position in
the text string.
(pregexp-match-positions "(?<=grey)hound"" "the hound in the picture is not a greyhound"") => ((38 . 43))
The regexp (?<=grey)hound
matches hound
, but only if it is
preceded by grey
.
Negative lookbehind
(?<!
) checks that its subpattern
could not possibly match immediately to the left.
(pregexp-match-positions "(?<!grey)hound"" "the greyhound in the picture is not a hound"") => ((38 . 43))
The regexp (?<!grey)hound
matches hound
, but only if
it is not preceded by grey
.
Lookaheads and lookbehinds can be convenient when they are not confusing.
2 Requiring a bracketed character class to be non-empty is not a limitation, since an empty character class can be more easily represented by an empty string.
3 Following regexp custom, we identify “word” characters as [A-Za-z0-9_] , although these are too restrictive for what a Schemer might consider a “word”.
4 0 , which is useful in an insert string, makes no sense within the regexp pattern, because the entire regexp has not matched yet that you could refer back to it.
5 A useful, if terminally cute, coinage from the abbots of Perl [3].