Regular Expressions

A regular expression is a sequence of characters that define a search pattern. Regular expressions are used to replace substrings in a string, to test whether a string matches a search expression, to extract strings from a text…

Regular expression are used to :

It exists different standards, conventions to define a regular expression (different syntaxes) depending on the utilities or programming language you are working with :

Extended regular expressions support more features than basic regular expressions.

Perl compatible regular expressions support more features than Posix regular expressions.

Posix Regular Expressions

The family of standards, POSIX, splits regular expression implementations into two types of regular expressions :

Most of UNIX utilities use the “basic” regular expression but a few utilities such as awk and egrep use the extended one.

Extended expressions consist in the same metacharacters as basic expressions with a few additions.

BRE is used in the following Linux commands :

ERE is used in :

Regular Expression Syntax

“Each character in a regular expression (that is, each character in the string describing its pattern) is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. For example, in the regex a., a is a literal character which matches just ‘a’, while . is a metacharacter that matches every character except a newline.” (Wikipedia)

Special Characters (metacharacters) in regular expressions

There are characters which have special meaning in regex : [, ], ^, $, ., |, ?, *, +, (, ), {, }, and -.

Most of those special characters have to be escaped with a backslash symbol in order to be treated as literal characters when used in regular expressions.

Brackets characters “(” “)”, curly braces characters “{” “}”, ? and + characters are exceptions to the rule : their behaviour is different depending on the regex type: basic or extended. Refer to BRE and ERE FEATURES

Anchor Characters

In regular expressions, “anchor” characters are used to match a position before, after or between characters.

The character ^ is the starting anchor, and the character $ is the end anchor. The ^ is only an anchor if it is the first character in a regular expression. The $ is only an anchor if it is the last character. If they are not used at the proper end of the pattern, they no longer act as anchors.

Anchor Meaning
^ (as first character) start string or start of line depending on multiline mode
$ (as last character) end of string or end of line depending on multiline mode
Pattern Match
^A “A” at the beginning of a line
A$ “A” at the end of a line
A^ “A^” anywhere on a line
$A “$A” anywhere on a line
^^ “^” at the beginning of a line
$$ “$” at the end of a line

^ and $ lose their meaning if they are not placed respectfully at the beginning and at the end of the regular expression.

The metacharacter, ^, is also used as a special character in classes for negative conditions.

Period

A period, . matches any single character (except newline \n).

As an example, the regex defined by two consecutive periods, .., matches any group of two consecutive characters. The following strings successfully return one or more matches: "ab", "bc", "abc", 'afdc', etc. But none of the following strings return a match: "a", 'x', 'c', etc.

Python code to test if the pattern .. finds any match in the string “abcd”:

import re

pattern = '..'
test = 'abcd'

if re.match(pattern, test):
  print("Search successful.")
else:
  print("Search unsuccessful.") 
Search successful.

Another Python code example using anchors to test if the pattern ‘^b…s$’ finds any match in the string ‘boats’


import re

pattern = '^b...s$'
test = 'boats'

if re.match(pattern, test):
  print("Search successful.")
else:
  print("Search unsuccessful.") 
Search successful.

Same pattern tested with the string ‘boats are on shore’:


import re

pattern = '^b...s$'
test = 'boats are on the shore'

if re.match(pattern, test):
  print("Search successful.")
else:
  print("Search unsuccessful.") 
Search unsuccessful.

No match is found since the tested string should count 5 characters, start with ‘b’ and ends with ‘s’…

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found. List of quantifiers: * special characters: +, *, ? * {n}, {n,} and {n,m} assuming n<m, n and m representing positive integers.

ERE Syntax BRE Syntax Meaning
+ \+ one or more times
* * zero or more times
? \? zero or one time
{n} \{n\} n times
{n,m} \{n,m\} between n and m times
{n,} \{n,\} n or more times

Examples:

ERE Pattern Match
foob.* r ‘foobar’, ‘foobalkjdflkj9r’, ‘foobr’, …
foob.+r ‘foobar’, ‘foobalkjdflkj9r’, …, but not ‘foobr’
foob.?r ‘foobar’, ‘foobbr’. ‘foobr’, … but not ‘foobalkj9r’
fooba{2}r ‘foobaar’
fooba{2,}r ‘foobaar’, ‘foobaaar’, ‘foobaaaar’, …
fooba{2,3}r ‘foobaar’ and ‘foobaaar’ but not ‘foobaaaar’
^ * zero or more spaces at the beginning of the line
1+1 1, 11, 111, 1111, 111111, etc.
1\\+1 1+1 since the escaped + is no more interpreted as a special character
ba? “b” or “ba”

### Classes

You can specify a class of characters by enclosing a list of characters in [...] or specify a range of characters using - inside square brackets. The class match any character from the list.

class Match
[abcd] a, b, c, and d
[a-z] any lowercase letter
[A-Z] Any uppercase letter
[0-9] Any digit
[a-e] same as [abcde]
[tT]he the or The
[a-z0-9] Any lowercase letter or digit

If the first character after [ is ^ the class matches any character not in the list (negative conditions).

Pattern Match
[cC]at “cat” and “Cat”
hell[aei]r ‘hellar’, ‘heller’ and ‘hellir’
hell[^ aeiou]r ‘hellbr’, ‘hellcr’, … but not ‘hellar’, ‘heller’, …
^[0123456789]$ Any line that contains exactly one number
^[0-9]$ Any line that contains exactly one number
[A-Za-z0-9_] Any single character that is a letter, number, or underscore
[0-9a-z] Any single character that is a number, or a character between “a” and “z”.
[0-9]\] Any number followed by “]”
[0-9]* nothing or any number from 0 to infinity
[hc]+at “hat”, “cat”, “hhat”, “chat”, “hcat”, “ccchat” etc.
“[hc]?at” “hat”, “cat” and “at”

Grouping

By placing part of a regular expression inside parentheses : ERE syntax (...) or BRE syntax `(…)), you can capture a group in a regular expression. This allows you to define subexpressions to perform the pattern matching, to apply a quantifier to the entire group or to restrict alternation to part of the regex. Only parentheses can be used for grouping.

Example:

Pattern Match
(abc){2} abcabc

More features for POSIX Extended Regular Expressions:

ERE Pattern Meaning
\d any decimal digit. Equivalent to [0-9]
\D any character that is not a decimal digit. Equivalent to [^0-9]
\s any white space character Equivalent to [\t\n\r\f\v]
\S any character that is not a white space character. Equivalent to [^ \t\n\r\f\v].
\w any “word” character. Equivalent to [a-zA-Z0-9_]
\W any “non-word” character Equivalent to [^a-zA-Z0-9_]
\N a character that is not a new line
\R a newline sequence
\v a vertical white space character
\V a character that is not a vertical white space character
\b matches any pattern at the beginning or end of a word: \bbath matches ‘bathroom’, ‘roommates’, ‘bedroom’,…
\B matches pattern in the middle of a word: \Bno matches ‘annotate’, ‘unknown’, ‘hypnosis’,…

The POSIX standard defines some classes or categories of characters as shown below.

class Similar to Meaning
[:alnum:] [A-Za-z0-9] digits, upper and lowercase letters
[:alpha:] [A-Za-z] upper and lowercase letters
[:blank:] [\t] space and TAB characters only
[:cntrl:] control characters
[:digit:] [0-9] digits
[:graph:] printable characters, not including space
[:lower:] [a-z] lowercase letters
[:print:] printable characters, including space
[:punct:] punctuation (all graphic characters except letters and digits)
[:space:] [\t\n\r\f\v] blank (whitespace) characters
[:upper:] [A-Z] uppercase letters
[: Xdigit:] [0-9A-Fa-f] hexadecimal digits

The Alternation Character : |

POSIX BRE syntax does not support alternation.

In ERE or PCRE syntax, |, the alternation operator matches either the expression before or the expression after. As an example, if you want to search for the literal text “cat” or “dog”, you can separate both options as follows : cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.

ERE syntax Match
(a b
([cC]at)|([dD]og) “cat”, “Cat”, “dog”, “Dog”
a\.(\(|\)) “a.)” or “a.(”

BRE and ERE Features

Some special characters are behaving differently in basic and extended regular expressions : ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’.

Desired pattern Basic (BRE) Syntax Extended (ERE) Syntax
{N} as a literal meaning {N} \{N\}
{N} as a quantifier \{N\} {N}
literal + (plus sign) a+b=c a\+b=c
+ as a quantifier* a\+b a+b

*One or more ‘a’ characters followed by ‘b’

Note that in Linux: the ! character invokes bash’s history substitution. When followed by a string it tries to expand to the last history event that began with that string : !echo would expand to the last echo command in your history. You can refer to the following section for more details: man bash (see the QUOTING section)

PCRE (Perl Compatible Regular Expression ):

PCRE is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language PCRE has inspired the regular expression syntax of most programming languages. It has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API. The PCRE syntax is much more flexible than the POSIX regular expression syntax.

Note that the linux command grep with the -P option can use the PCRE syntax.

Let’s see some of the special features of the PCRE syntax.

Greedy Trao and lazy quantifier

The quantifiers, *, +, and ?, are all greedy: they match as much text as possible. This is a behaviour that you may not expect. To turn their greedy behaviour into a lazy behaviour, we add the character ? to the quantifier as follows: *?, +?, ??

Let’s study the greedy combination .* also called dot-star or quantified dot.

Let’s see through an example how greedy quantifiers work:

Assume that you want to extract substrings from a string using delimiters tags such as {START} and {END}.

As an example, let’s consider the following string: "{START} Richard {END} is {START} in the kitchen {END}"

We would like to extract two substrings : "{START} Richard {END}" and "{START} in the kitchen {END}"

If we apply the regex: {START}.*{END} and search the string to get a match, it returns the entire string as a match.

How does the search pattern, {START}.*{END}, work?

How to change this behaviour and prevent the parser to reach the end of the string as a first step?

As another example extracted from Python doc, let’s consider the following string: "<a> b <c>"

We would like to extract the two tags "<a>" and "<c>" from the string.

If we apply the regex, <.*>, and search the string to find a match, it returns the entire string “< a> b <c>” as a match (greedy behaviour).

Since we just want <a> and <c> to be matched, we need to add the character ? after the quantified dot (like in the previous example) to perform the match in a non-greedy way. The new search pattern, <.*?>, will find a match in **<**a**>** and **<**c**>**.

Look-around assertion (look-ahead and look-behind)

Search Pattern Name Meaning
(?=regex) Positive lookahead match something followed by the expression regex

The following Python code uses data(?=[a-z]) as a search pattern (note that the pattern includes a look-ahead assertion (?=[a-z])). Let’see what the search engine returns with the string ‘dataset’:

import re
print(re.search('data(?=[a-z])', 'dataset'))
## <re.Match object; span=(0, 4), match='data'>

Let’see what the search engine returns using the same pattern on the string ‘data123’:

import re
print(re.search('data(?=[a-z])', 'data123'))
## None
Search Pattern Name Meaning
(?!regex) Negative lookahead match something not followed by the expression regex

In data(?![a-z]), the negative lookahead assertions specifies that ‘data’ should not be followed by a lowercase alphabetic character. Let’s change the lookahead pattern into a Negative lookahead pattern in our previous examples, we can see the first assertion returns None (data is followed by s) while the second one returns a match:

import re
print(re.search('data(?![a-z])', 'dataset'))
## None
import re
print(re.search('data(?![a-z])', 'data1'))
## <re.Match object; span=(0, 4), match='data'>
Search Pattern Name Meaning
(?<=regex) Positive look-behind match something preceded by the expression regex without including it in the match

Example: Find expression A where expression B precedes (?<=B)A

Search Pattern Name Meaning
(?<!regex) Negative look-behind match something not preceded by the expression regex without including it in the match

Example: Find expression A where expression B does not precedes (?<!B)A

Unicode symbol and property support TO BE DEVELOPPED

Capturing group or not

color=(?:red|green|blue)

import re
result = re.search('data(?=[a-z])(?P<ref>.)', 'dataset')
result.group('ref')
## 's'

Let’s compare the previous search to the following example which does not contain any lookahead:

import re
result = re.search('data([a-z])(?P<ref>.)', 'dataset')
print(result.group('ref'))
## e

TODO: TO BE CONTINUED…

Regular Expression Flags

i Ignore case m for ^ and $ to match start and end of line s For . to match newline as well x Allow spaces and comments J Duplicate group names allowed U Ungreedy quantifiers (?iLmsux) Set flags within regex

Regex Sheatsheet: TO BE CONTINUED

https://remram44.github.io/regex-cheatsheet/regex.html

POSIX (BRE) POSIX extended (ERE) Perl/PCRE Python re module