Regular Expressions

in

Richard Eigenmann, 10 Nov 2015

Revised 21 Nov 2015, 27 Mar, 31 Mar, 13 Apr 2016

Unpleasant Surprise

Regular Expressions don't work in all compilers!

They worked fine in Microsoft Visual Studio but failed in gcc.

In gcc Regular Expressions are "partially implemented". Fixed in gcc 4.9.0. See Bug 53631

See also Stack Overflow: Options for using C++11 <regex> with a circa 2013 compiler

Patterns:

13 digit ISBN number. Wikipedia

978-0-321-99278-9

  1. prefix
  2. registration group element
  3. registrant element
  4. publication element
  5. check digit

Pattern: digits- digits- digits- digits- digit

\d+-\d+-\d+-\d+-\d

regex101

Swiss phone numbers:

Example: +41 44 2429788

Pattern: + 2 digits 1 space 2 digits 1 space 7 digits

\+\d{2}\s\d{2}\s\d{7}

regex101

IP v4 addresses:

Example: 192.168.0.1

1-3 digits . 1-3 digits . 1-3 digits . 1-3 digits

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

regex101

Syntax for one char:

\ddigit
\Deverything not a digit
\swhitespace
\Sany char but a whitespace
\wword character: A-Z, 0-9, including the _ (underscore)
\Weverything but a \w
[abc]a single char a, b, or c
[^abc]neither a, b, or c
[a-z]a single lowercase char a to z
[^a-z]no single lowercase char
.any single char
\.a period (note the escaping!)
\\a backslash
\+a plus
-a dash (no escaping!)

Repetition:

*Zero or more of the preceding element
+One or more of the preceding element
?Zero or one of the preceding element
{n}Exactly n of the preceding element
{n,}n or more of the preceding element
{n,m}Between m and n of the preceding element

Anchors:

^Beginning of line
$End of line
\bWord boundary ( \bdone\b doesn't match abandoned)

C++ regex_match

#include <boost/regex.hpp>
using namespace boost;
using namespace std;
int main() {
    regex ISBNPattern{ R"(^\d+-\d+-\d+-\d+-\d$)" };
    string isbn1 = "978-0-321-99278-9";
    string isbn2 = "978-0-321-99278";
    cout << isbn1 << " regex_match " 
        << regex_match( isbn1, ISBNPattern ) << '\n';
    cout << isbn2 << " regex_match " 
        << regex_match( isbn2, ISBNPattern ) << '\n';
    return 0;
}
978-0-321-99278-9 regex_match 1
978-0-321-99278 regex_match 0

Using boost regex to work under gcc < 4.9.0 else use #include <regex>

Run on Coliru

Regex magic

regex ISBNPattern{ R"(^\d+-\d+-\d+-\d+-\d$)" };

The R"(...)" means Raw string literal.

Bjarne writes: To get a double quote into a string literal we have to precede it with a backslash. This can quickly become unmanageable. In fact, in real use this “special character problem” gets so annoying that C++ and other languages have introduced the notion of raw string literals to be able to cope with realistic regular expression patterns. In a raw string literal a backslash is simply a backslash character (rather than an escape character) and a double quote is simply a double quote character (rather than an end of string).

Matching IPv4 addresses

#include <boost/regex.hpp>
using namespace boost;
using namespace std;
int main() {
    regex IPv4Pattern { R"(^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)" };
    string ip1 = "192.168.0.1";
    string ip2 = "7000.168.0.1";
    string ip3 = "192.168.0";
    cout << ip1 << " regex_match "
            << regex_match( ip1, IPv4Pattern ) << '\n';
    cout << ip2 << " regex_match "
            << regex_match( ip2, IPv4Pattern ) << '\n';
    cout << ip3 << " regex_match "
            << regex_match( ip3, IPv4Pattern ) << '\n';
}
192.168.0.1 regex_match 1
7000.168.0.1 regex_match 0
192.168.0 regex_match 0
                    

Run on Coliru

What if we want the 4 numbers?

Extracting Matches

#include <boost/regex.hpp>
using namespace boost;
using namespace std;
int main() {
    regex IPv4ExtractPattern
        { R"(^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$)" };
    smatch matches;
    string ip1 = "192.168.0.1";
    if ( regex_search( ip1, matches, IPv4ExtractPattern ) ) {
        cout << matches.size() << " matches\n";
        for (int i = 0; i < matches.size(); ++i)
            cout << "matches[" << i << "] = " << matches[i]<<'\n';
    }
   return 0;
}
5 matches
matches[0] = 192.168.0.1
matches[1] = 192
matches[2] = 168
matches[3] = 0
matches[4] = 1

Run on Coliru

Greedyness:

Given:<TD>Cute Kittens</TD><TD>Funny Cats</TD>
Pattern:<TD>(.*)<\/TD>
Match:<TD>Cute Kittens</TD><TD>Funny Cats</TD>
Pattern:<TD>(.*?)<\/TD>
Match:<TD>Cute Kittens</TD><TD>Funny Cats</TD>

A ? after a Repetition requests non-greedy repetition

regex101

Further reading:

Source: xkcd.com

Useful Links:

regexpal

regex101.com

reveal.js

highlight.js

coliru