*
, *?
, +
, +?
, {n,m}
, and
{n,m}?
.
For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part--that's why it's called backtracking.
Here is an example of backtracking: Let's say you want to find the word following ``foo'' in the string ``Food is on the foo table.'':
When the match runs, the first part of the regular expression (\b(foo)
)
finds a possible match right at the beginning of the string, and loads up
$1 with ``Foo''. However, as soon as the matching engine sees that there's
no whitespace following the ``Foo'' that it had saved in $1, it realizes its
mistake and starts over again one character after where it had had the
tentative match. This time it goes all the way until the next occurrence
of ``foo''. The complete regular expression matches this time, and you get
the expected output of ``table follows foo.''
Sometimes minimal matching can help a lot. Imagine you'd like to match everything between ``foo'' and ``bar''. Initially, you write something like this:
Which perhaps unexpectedly yields:
That's because .*
was greedy, so you get everything between the
first ``foo'' and the last ``bar''. In this case, it's more effective
to use minimal matching to make sure you get the text between a ``foo''
and the first ``bar'' thereafter.
Here's another example: let's say you'd like to match a number at the end of a string, and you also want to keep the preceding part the match. So you write this:
That won't work at all, because .*
was greedy and gobbled up the
whole string. As \d*
can match on an empty string the complete
regular expression matched successfully.
Here are some variants, most of which don't work:
That will print out:
As you see, this can be a bit tricky. It's important to realize that a regular expression is merely a set of assertions that gives a definition of success. There may be 0, 1, or several different ways that the definition might succeed against a particular string. And if there are multiple ways it might succeed, you need to understand backtracking in order to know which variety of success you will achieve.
When using lookahead assertions and negations, this can all get even tricker. Imagine you'd like to find a sequence of nondigits not followed by ``123''. You might try to write that as
But that isn't going to match; at least, not the way you're hoping. It claims that there is no 123 in the string. Here's a clearer picture of why it that pattern matches, contrary to popular expectations:
This prints
You might have expected test 3 to fail because it just seems to a more
general purpose version of test 1. The important difference between
them is that test 3 contains a quantifier (\D*
) and so can use
backtracking, whereas test 1 will not. What's happening is
that you've asked "Is it true that at the start of $x, following 0 or more
nondigits, you have something that's not 123?" If the pattern matcher had
let \D*
expand to ``ABC'', this would have caused the whole pattern to
fail.
The search engine will initially match \D*
with ``ABC''. Then it will
try to match (?!123
with ``123'' which, of course, fails. But because
a quantifier (\D*
) has been used in the regular expression, the
search engine can backtrack and retry the match differently
in the hope of matching the complete regular expression.
Well now,
the pattern really, really wants to succeed, so it uses the
standard regexp backoff-and-retry and lets \D*
expand to just ``AB'' this
time. Now there's indeed something following ``AB'' that is not
``123''. It's in fact ``C123'', which suffices.
We can deal with this by using both an assertion and a negation. We'll say that the first part in $1 must be followed by a digit, and in fact, it must also be followed by something that's not ``123''. Remember that the lookaheads are zero-width expressions--they only look, but don't consume any of the string in their match. So rewriting this way produces what you'd expect; that is, case 5 will fail, but case 6 succeeds:
In other words, the two zero-width assertions next to each other work like
they're ANDed together, just as you'd use any builtin assertions: /^$/
matches only if you're at the beginning of the line AND the end of the
line simultaneously. The deeper underlying truth is that juxtaposition in
regular expressions always means AND, except when you write an explicit OR
using the vertical bar. /ab/
means match ``a'' AND (then) match ``b'',
although the attempted matches are made at different positions because ``a''
is not a zero-width assertion, but a one-width assertion.
One warning: particularly complicated regular expressions can take exponential time to solve due to the immense number of possible ways they can use backtracking to try match. For example this will take a very long time to run
And if you used *
's instead of limiting it to 0 through 5 matches, then
it would take literally forever--or until you ran out of stack space.