Regular Expressions

A regular expression is a sequence of characters that define a search pattern. Regular expressions are used to replace substrings in a string, to test whether a string matches a search expression, to extract strings from a text…

Regular expression are used to :

Process and format user input
Extract information from files (e.g., log files)
Validate input in web forms (e.g., dates, password, phone numbers, etc.)
Prepare and Clean a dataset…

It exists different standards, conventions to define a regular expression (different syntaxes) depending on the utilities or programming language you are working with :

One of the POSIX (Portable Operating System Interface for uniX) standards is dedicated to the two following types of regular expressions : BRE (Basic Regular Expresion) and ERE (Extended Regular Expression).

Extended regular expressions support more features than basic regular expressions.

PCRE (Perl Compatible Regular Expressions) are used in most programming languages (Perl, Python, R, PHP, JavaScript, C++, C#, Java, and more) but note that each programming languages has its own abbreviations and differences.

Perl compatible regular expressions support more features than Posix regular expressions.

Some utilities use their own Regex standard, e.g. Emacs has its own regular expression syntax named (Emacs Regular Expression).

Posix Regular Expressions

The family of standards, POSIX, splits regular expression implementations into two types of regular expressions :

Basic Regular Expression or BRE
Extended Regular Expression or ERE

Most of UNIX utilities use the “basic” regular expression but a few utilities such as awk and egrep use the extended one.

Extended expressions consist in the same metacharacters as basic expressions with a few additions.

BRE is used in the following Linux commands :

sed (research and replacement tool)
grep
expr
vi (text editor)

ERE is used in :

sed with -E option
grep with -E option
egrep
awk
nawk

Regular Expression Syntax

“Each character in a regular expression (that is, each character in the string describing its pattern) is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. For example, in the regex a., a is a literal character which matches just ‘a’, while . is a metacharacter that matches every character except a newline.” (Wikipedia)

Special Characters (metacharacters) in regular expressions

There are characters which have special meaning in regex : [, ], ^, $, ., |, ?, *, +, (, ), {, }, and -.

Most of those special characters have to be escaped with a backslash symbol in order to be treated as literal characters when used in regular expressions.

Brackets characters “(” “)”, curly braces characters “{” “}”, ? and + characters are exceptions to the rule : their behaviour is different depending on the regex type: basic or extended. Refer to BRE and ERE FEATURES

Anchor Characters

In regular expressions, “anchor” characters are used to match a position before, after or between characters.

The character ^ is the starting anchor, and the character $ is the end anchor. The ^ is only an anchor if it is the first character in a regular expression. The $ is only an anchor if it is the last character. If they are not used at the proper end of the pattern, they no longer act as anchors.

Anchor	Meaning
^ (as first character)	start string or start of line depending on multiline mode
$ (as last character)	end of string or end of line depending on multiline mode

Pattern	Match
^A	“A” at the beginning of a line
A$	“A” at the end of a line
A^	“A^” anywhere on a line
$A	“$A” anywhere on a line
^^	“^” at the beginning of a line
$$	“$” at the end of a line

^ and $ lose their meaning if they are not placed respectfully at the beginning and at the end of the regular expression.

The metacharacter, ^, is also used as a special character in classes for negative conditions.

Period

A period, . matches any single character (except newline \n).

As an example, the regex defined by two consecutive periods, .., matches any group of two consecutive characters. The following strings successfully return one or more matches: "ab", "bc", "abc", 'afdc', etc. But none of the following strings return a match: "a", 'x', 'c', etc.

Python code to test if the pattern .. finds any match in the string “abcd”:

import re

pattern = '..'
test = 'abcd'

if re.match(pattern, test):
  print("Search successful.")
else:
  print("Search unsuccessful.")

Search successful.

Another Python code example using anchors to test if the pattern ‘^b…s$’ finds any match in the string ‘boats’


import re

pattern = '^b...s$'
test = 'boats'

if re.match(pattern, test):
  print("Search successful.")
else:
  print("Search unsuccessful.")

Search successful.

Same pattern tested with the string ‘boats are on shore’:


import re

pattern = '^b...s$'
test = 'boats are on the shore'

if re.match(pattern, test):
  print("Search successful.")
else:
  print("Search unsuccessful.")

Search unsuccessful.

No match is found since the tested string should count 5 characters, start with ‘b’ and ends with ‘s’…

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found. List of quantifiers: * special characters: +, *, ? * {n}, {n,} and {n,m} assuming n<m, n and m representing positive integers.

ERE Syntax	BRE Syntax	Meaning
+	\+	one or more times
*	*	zero or more times
?	\?	zero or one time
{n}	\{n\}	n times
{n,m}	\{n,m\}	between n and m times
{n,}	\{n,\}	n or more times

Examples:

ERE Pattern	Match
foob.* r	‘foobar’, ‘foobalkjdflkj9r’, ‘foobr’, …
foob.+r	‘foobar’, ‘foobalkjdflkj9r’, …, but not ‘foobr’
foob.?r	‘foobar’, ‘foobbr’. ‘foobr’, … but not ‘foobalkj9r’
fooba{2}r	‘foobaar’
fooba{2,}r	‘foobaar’, ‘foobaaar’, ‘foobaaaar’, …
fooba{2,3}r	‘foobaar’ and ‘foobaaar’ but not ‘foobaaaar’
^ *	zero or more spaces at the beginning of the line
1+1	1, 11, 111, 1111, 111111, etc.
1\\+1	1+1 since the escaped `+` is no more interpreted as a special character
ba?	“b” or “ba”

### Classes

You can specify a class of characters by enclosing a list of characters in [...] or specify a range of characters using - inside square brackets. The class match any character from the list.

class	Match
[abcd]	`a`, `b`, `c`, and `d`
[a-z]	any lowercase letter
[A-Z]	Any uppercase letter
[0-9]	Any digit
[a-e]	same as [abcde]
[tT]he	the or The
[a-z0-9]	Any lowercase letter or digit

If the first character after [ is ^ the class matches any character not in the list (negative conditions).

Pattern	Match
[cC]at	“cat” and “Cat”
hell[aei]r	‘hellar’, ‘heller’ and ‘hellir’
hell[^ aeiou]r	‘hellbr’, ‘hellcr’, … but not ‘hellar’, ‘heller’, …
^[0123456789]$	Any line that contains exactly one number
^[0-9]$	Any line that contains exactly one number
[A-Za-z0-9_]	Any single character that is a letter, number, or underscore
[0-9a-z]	Any single character that is a number, or a character between “a” and “z”.
[0-9]\]	Any number followed by “]”
[0-9]*	nothing or any number from 0 to infinity
[hc]+at	“hat”, “cat”, “hhat”, “chat”, “hcat”, “ccchat” etc.
“[hc]?at”	“hat”, “cat” and “at”

Grouping

By placing part of a regular expression inside parentheses : ERE syntax (...) or BRE syntax `(…)), you can capture a group in a regular expression. This allows you to define subexpressions to perform the pattern matching, to apply a quantifier to the entire group or to restrict alternation to part of the regex. Only parentheses can be used for grouping.

Example:

Pattern	Match
(abc){2}	abcabc

More features for POSIX Extended Regular Expressions:

ERE Pattern	Meaning
\d	any decimal digit. Equivalent to [0-9]
\D	any character that is not a decimal digit. Equivalent to [^0-9]
\s	any white space character Equivalent to [\t\n\r\f\v]
\S	any character that is not a white space character. Equivalent to [^ \t\n\r\f\v].
\w	any “word” character. Equivalent to [a-zA-Z0-9_]
\W	any “non-word” character Equivalent to [^a-zA-Z0-9_]
\N	a character that is not a new line
\R	a newline sequence
\v	a vertical white space character
\V	a character that is not a vertical white space character
\b	matches any pattern at the beginning or end of a word: \bbath matches ‘bathroom’, ‘roommates’, ‘bedroom’,…
\B	matches pattern in the middle of a word: \Bno matches ‘annotate’, ‘unknown’, ‘hypnosis’,…

The POSIX standard defines some classes or categories of characters as shown below.

class	Similar to	Meaning
[:alnum:]	[A-Za-z0-9]	digits, upper and lowercase letters
[:alpha:]	[A-Za-z]	upper and lowercase letters
[:blank:]	[\t]	space and TAB characters only
[:cntrl:]		control characters
[:digit:]	[0-9]	digits
[:graph:]		printable characters, not including space
[:lower:]	[a-z]	lowercase letters
[:print:]		printable characters, including space
[:punct:]		punctuation (all graphic characters except letters and digits)
[:space:]	[\t\n\r\f\v]	blank (whitespace) characters
[:upper:]	[A-Z]	uppercase letters
[: Xdigit:]	[0-9A-Fa-f]	hexadecimal digits

The Alternation Character : `|`

POSIX BRE syntax does not support alternation.

In ERE or PCRE syntax, |, the alternation operator matches either the expression before or the expression after. As an example, if you want to search for the literal text “cat” or “dog”, you can separate both options as follows : cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.

ERE syntax	Match
(a	b
([cC]at)\|([dD]og)	“cat”, “Cat”, “dog”, “Dog”
a\.($\|$)	“a.)” or “a.(”

BRE and ERE Features

Some special characters are behaving differently in basic and extended regular expressions : ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’.

In BRE syntax: ?, +, { and }, ( and ) have a literal meaning. To be interpreted as special, they have to be escaped with a backslash placed before.
In ERE syntax, it is exactly the opposite. ?, +, Braces and parentheses lose their special meaning if they are escaped with a backslash.

Desired pattern	Basic (BRE) Syntax	Extended (ERE) Syntax
{N} as a literal meaning	{N}	\{N\}
{N} as a quantifier	\{N\}	{N}
literal + (plus sign)	a+b=c	a\+b=c
+ as a quantifier*	a\+b	a+b

*One or more ‘a’ characters followed by ‘b’

Note that in Linux: the ! character invokes bash’s history substitution. When followed by a string it tries to expand to the last history event that began with that string : !echo would expand to the last echo command in your history. You can refer to the following section for more details: man bash (see the QUOTING section)

PCRE (Perl Compatible Regular Expression ):

PCRE is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language PCRE has inspired the regular expression syntax of most programming languages. It has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API. The PCRE syntax is much more flexible than the POSIX regular expression syntax.

Note that the linux command grep with the -P option can use the PCRE syntax.

Let’s see some of the special features of the PCRE syntax.

Greedy Trao and lazy quantifier

The quantifiers, *, +, and ?, are all greedy: they match as much text as possible. This is a behaviour that you may not expect. To turn their greedy behaviour into a lazy behaviour, we add the character ? to the quantifier as follows: *?, +?, ??

Let’s study the greedy combination .* also called dot-star or quantified dot.

Let’s see through an example how greedy quantifiers work:

Assume that you want to extract substrings from a string using delimiters tags such as {START} and {END}.

As an example, let’s consider the following string: "{START} Richard {END} is {START} in the kitchen {END}"

We would like to extract two substrings : "{START} Richard {END}" and "{START} in the kitchen {END}"

If we apply the regex: {START}.*{END} and search the string to get a match, it returns the entire string as a match.

How does the search pattern, {START}.*{END}, work?

After matching {START}, the parsers moves to the next character.
The greedy sub-pattern, .*, matches all characters to the end of the string. It stops at the very last character }.
From the end of the string, the parser backtracks and moves back to the previous characters which are respectively D, N, E and {. Then the parser stops since it gets a match with the last tag {END} from the regex.
The final match is the entire string.

How to change this behaviour and prevent the parser to reach the end of the string as a first step?

The easiest way is to make the dot-star to be “lazy” by adding a question mark ? at the end of the search sub-pattern : .*? .
.*? is called lazy since it guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed: the pattern only matches one {START}...{END} item at a time.

As another example extracted from Python doc, let’s consider the following string: "<a> b <c>"

We would like to extract the two tags "<a>" and "<c>" from the string.

If we apply the regex, <.*>, and search the string to find a match, it returns the entire string “< a> b <c>” as a match (greedy behaviour).

Since we just want <a> and <c> to be matched, we need to add the character ? after the quantified dot (like in the previous example) to perform the match in a non-greedy way. The new search pattern, <.*?>, will find a match in **<**a**>** and **<**c**>**.

Look-around assertion (look-ahead and look-behind)

Search Pattern	Name	Meaning
`(?=regex)`	Positive lookahead	match something followed by the expression `regex`

The following Python code uses data(?=[a-z]) as a search pattern (note that the pattern includes a look-ahead assertion (?=[a-z])). Let’see what the search engine returns with the string ‘dataset’:

import re
print(re.search('data(?=[a-z])', 'dataset'))

## <re.Match object; span=(0, 4), match='data'>

The lookahead assertion (?=[a-z]) specifies that ‘data’ should be followed by a lowercase alphabetic character.
In this case, what follows is the character ‘s’ which is returned as a match.

Let’see what the search engine returns using the same pattern on the string ‘data123’:

import re
print(re.search('data(?=[a-z])', 'data123'))

## None

The lookahead pattern fails in finding any match with the input since the character next to ‘data’ is ‘1’.
None is returned.

Search Pattern	Name	Meaning
`(?!regex)`	Negative lookahead	match something not followed by the expression `regex`

In data(?![a-z]), the negative lookahead assertions specifies that ‘data’ should not be followed by a lowercase alphabetic character. Let’s change the lookahead pattern into a Negative lookahead pattern in our previous examples, we can see the first assertion returns None (data is followed by s) while the second one returns a match:

import re
print(re.search('data(?![a-z])', 'dataset'))

## None

import re
print(re.search('data(?![a-z])', 'data1'))

## <re.Match object; span=(0, 4), match='data'>

Search Pattern	Name	Meaning
`(?<=regex)`	Positive look-behind	match something preceded by the expression `regex` without including it in the match

Example: Find expression A where expression B precedes (?<=B)A

Search Pattern	Name	Meaning
`(?<!regex)`	Negative look-behind	match something not preceded by the expression `regex` without including it in the match

Example: Find expression A where expression B does not precedes (?<!B)A

(?()|) Conditional (?(id/name)yes-pattern|no-pattern)
(?#...) a comment; the contents of the parentheses are simply ignored.
(?:...)

Unicode symbol and property support TO BE DEVELOPPED

Capturing group or not

(?:...) Non-capturing group. A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

color=(?:red|green|blue)

(?P<Y>...) Capturing group named Y. Similar to regular parentheses but the substring matched by the group is accessible via the symbolic group name Y

import re
result = re.search('data(?=[a-z])(?P<ref>.)', 'dataset')
result.group('ref')

## 's'

The first part of the regex, data, finds a match in dataset: the parser reads and consumes these four characters.
The next part, (?=[a-z]), is a lookahead assertion that successfully matches ‘s’, but note that the parser remains at the same position, it does not consume the ‘s’.
(?P<ref>.) matches the next single character available, which is ‘s’, and captures it in a group named ref.
The last statement, result.group('ref') returns the group named ref which contains ‘s’.

Let’s compare the previous search to the following example which does not contain any lookahead:

import re
result = re.search('data([a-z])(?P<ref>.)', 'dataset')
print(result.group('ref'))

## e

The first part of the regex, data, finds a match in dataset: the parser reads and consumes these four characters.
The next part, ([a-z]), finds a match in the next letter ‘s’. The parser reads and consumes ‘s’.
(?P<ref>.) matches the next single character available: ‘e’.
result.group('ref') returns the group named ref which contains ‘e’.

TODO: TO BE CONTINUED…

(?P=Y) Match the named group Y: A backreference to Y; it matches whatever text was matched by the earlier group named Y.
(?>...) Atomic group
(?|...) Duplicate group numbers
\Y Match the Y’th captured group
(?R) Recurse into entire pattern
(?Y) Recurse into numbered group Y
(?&Y) Recurse into named group Y
\g{Y} Match the named or numbered group Y
\g<Y> Recurse into named or numbered group Y
(?#...) A comment; the contents of the parentheses are simply ignored.

Regular Expression Flags

i Ignore case m for ^ and $ to match start and end of line s For . to match newline as well x Allow spaces and comments J Duplicate group names allowed U Ungreedy quantifiers (?iLmsux) Set flags within regex

Regex Sheatsheet: TO BE CONTINUED

https://remram44.github.io/regex-cheatsheet/regex.html

POSIX (BRE)	POSIX extended (ERE)	Perl/PCRE	Python re module

ERE syntax	Match
(a	b
([cC]at)\|([dD]og)	“cat”, “Cat”, “dog”, “Dog”
a\.(\(\|\))	“a.)” or “a.(”