Greedy

Updated: 06/30/2020 by Computer Hope

With regular expressions and wildcards, greedy describes a type of matching that continues looking for a match even after a match is found. For example, the below Perl greedy regular expression ".*e" matches all text up to the last letter "e" in the $example variable. This example returns "Matched: Computer Hope," not "Compute," because the text has multiple "e" characters.

my $example = "Computer Hope";
$example =~ m/.*e/;
print "Matched: $&\n";
Tip

In Perl, $& is a quick way to find everything that was matched.

One method to make the regular expression not greedy (lazy matching), is to add a question mark (?) after the asterisk (*), as shown below. Adding the question mark tells the computer to stop looking for matches once one match is found.

my $example = "Computer Hope";
$example =~ m/.*?e/;
print "Matched: $&\n";
print "After: $'\n";

Running the script above returns the following text.

Matched: Compute
After: r Hope
Tip

In Perl, $' is a quick way to find everything after the match.

The question mark can be added to other regular expression tokens that are also greedy. For example, if you're using a plus (+) instead of an asterisk, you could change it to "+?" in your regular expression.

Why should you not do greedy matching?

Doing a greedy match adds a lot of extra work that's usually not required, which makes matching text a lot slower. For example, if you're parsing an HTML (hypertext markup language) file and want to remove all HTML tags, the following greedy regular expression causes extra work and fails.

my $html = "Test <b>one</b> two <b>three</b>.";
$html =~ s/<.*>//g;
print "Output: $html\n";

Because of the greedy matching, the regular expression matches the opening of the first tag and then matches everything to the end of the last tag. In our example above, the following text would be returned because everything between the first less than (<) and last greater than (>) is matched.

Output: Test .

Adding a question mark after the asterisk makes the regular expression lazy and displays a better output.

my $html = "Test <b>one</b> two <b>three</b>.";
$html =~ s/<.*?>//g;
print "Output: $html\n";

With lazy matching, text is matched up to the first greater than, but because /g (globally) is used, it repeats until the string no longer has HTML. With this regular expression we get the following output.

Output: Test one two three.

Lazy matching helps correct problems with matching too much text and increases the matching efficiency. However, it could be improved even further if the ".*" is replaced with exactly what text to match or not match. For example, using our same HTML example, we could use the following regular expression.

my $html = "Test <b>one</b> two <b>three</b>.";
$html =~ s/<[^<>]+>//g;
print "Output: $html\n";

In the example above, [^<>]+ in a Perl regular expression says to match any character that's not a less than or greater than symbol. This regular expression is faster, and it also helps with not matching bad HTML or text containing a less than or greater than symbol that's not escaped using the entity name or number.

Programming terms, Wildcard