Regular Expressions — Beware Of (+)

Two levels of optional repetition yield a power set of possibilities.

Brief Explanation

If a regular expression (RE) engine uses backtracking to emulate parallelism (an NFA in the jargon) it can suffer from exponential slowdown. This page graphs the exhaustive nesting that can make a perfectly logical calculation seem to crash.



Try It Yourself

The regular expression we use is very reasonable: a list of comma-separated numbers. How hard can it be? The problem arises with an unexpected non-comma, here a semicolon.

If there is a match, the black line just goes right across. If there is no match, the line shows the path a regex engine might take, with backtracks in red.

Regular Expression (it’s fixed for now)
([0-9]+|,)

Click a or type some text here and hit the tab key.


Particular values




Remarks

The problem arises with the nesting of two (or more) optional repetitions. Some RE engines (see below) will — quite reasonably — explore all the implications of this nesting. This results in examining all the possible ways a sequence (not set) can be divided into subsequences. For a sequence of N elements, there are 2N-1 subsequences. For this reason, it may take longer than the known lifetime of the Universe to apply such an RE to a non-matching real-world source text.

This is an example more completely dissected in Jeffrey Friedl’s articles in The Perl Journal and in his monumental, magisterial book Mastering Regular Expressions. The behavior of the tool (e.g. awk, sed, Perl, Python, etc.) when confronted with such a RE can help you distinguish between three types of RE engines:



See Also



This page has been loaded at least several times.