I have a very simple regex similar to this:
With this test string...
_HOHO___HO_(shortest match, non-greedy)
_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_(longest match, looks greedy).
Why? How can I make it match the shortest match?
Adding and removing the
? gives the same result.
Edit - better test string that shows why
[^HOHO] doesn't work:
All I can think of is that maybe it is matching multiple times - but there's only one match for
_HO_, so I don't understand why it isn't taking the shortest match that ends at the
_HO_, discarding the rest.
I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.
I figured out a solution with some help from Regex lazy vs greedy confusion.
Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).
Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".
You can approximate a regex that is non-greedy in both directions by replacing the
. with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into
(?:(?!).). For example,
For example, the equivalent of
HOHO.*?_HO_ which is non-greedy on the left and right would be:
So the regex engine is essentially going through each character like this:
HOHO- Does this match the left side?
(?:(?!HOHO).)*- If so, can I reach the right-hand side without any repeats of the left side?
_HO_- If so, grab everything until the right-hand match
+- If there are multiple right-hand matches, choose the nearest one
Why it matches the whole string?
This is because regular-expression pattern matching is done by finding the first position in the string at which a match is possible. Since a match is possible starting at the first character of the string, shorter matches starting at subsequent characters are never even considered.
Let's consider a regular expression
/a+?b/ and test string
"aaaaaaaaab". When applied to the string it matches the whole string. Not just last
b. This is because the first position in the string where a match is possible is at the first
So, if you want to match
aaaaaaaaab, use a negated character class based regex rather than a lazy dot:
See the regex demo.
The result is non-greedy, because it's the shortest match from the first occurrence of
_HO_ is reached; the engine traverses the string from left to right and because it doesn't have to backtrack, it won't attempt to shorten anything.
To make it work in the way that's expected here, you need to have a greedy prefix in your expression:
The first memory capture contains the string that you're after; the greedy prefix will try to skip as many characters as possible, so it will match the last occurrence of
©2020 All rights reserved.