A regular expression is a sequence of characters that define a search pattern. Regular expressions are used to replace substrings in a string, to test whether a string matches a search expression, to extract strings from a text…
Regular expression are used to :
It exists different standards, conventions to define a regular expression (different syntaxes) depending on the utilities or programming language you are working with :
Extended regular expressions support more features than basic regular expressions.
Perl compatible regular expressions support more features than Posix regular expressions.
The family of standards, POSIX, splits regular expression implementations into two types of regular expressions :
Most of UNIX utilities use the “basic” regular expression but a few utilities such as awk and egrep use the extended one.
Extended expressions consist in the same metacharacters as basic expressions with a few additions.
BRE is used in the following Linux commands :
sed
(research and replacement tool)grep
expr
vi
(text editor)ERE is used in :
sed
with -E
optiongrep
with -E
optionegrep
awk
nawk
“Each character in a regular expression (that is, each character in the string describing its pattern) is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. For example, in the regex a.
, a
is a literal character which matches just ‘a’, while .
is a metacharacter that matches every character except a newline.” (Wikipedia)
There are characters which have special meaning in regex : [
, ]
, ^
, $
, .
, |
, ?
, *
, +
, (
, )
, {
, }
, and -
.
Most of those special characters have to be escaped with a backslash symbol in order to be treated as literal characters when used in regular expressions.
Brackets characters “(” “)”, curly braces characters “{” “}”,
?
and+
characters are exceptions to the rule : their behaviour is different depending on the regex type: basic or extended. Refer to BRE and ERE FEATURES
In regular expressions, “anchor” characters are used to match a position before, after or between characters.
The character ^
is the starting anchor, and the character $
is the end anchor. The ^
is only an anchor if it is the first character in a regular expression. The $
is only an anchor if it is the last character. If they are not used at the proper end of the pattern, they no longer act as anchors.
Anchor | Meaning |
---|---|
^ (as first character) | start string or start of line depending on multiline mode |
$ (as last character) | end of string or end of line depending on multiline mode |
Pattern | Match |
---|---|
^A | “A” at the beginning of a line |
A$ | “A” at the end of a line |
A^ | “A^” anywhere on a line |
$A | “$A” anywhere on a line |
^^ | “^” at the beginning of a line |
$$ | “$” at the end of a line |
^
and$
lose their meaning if they are not placed respectfully at the beginning and at the end of the regular expression.
The metacharacter,
^
, is also used as a special character in classes for negative conditions.
A period, .
matches any single character (except newline \n
).
As an example, the regex defined by two consecutive periods, ..
, matches any group of two consecutive characters. The following strings successfully return one or more matches: "ab"
, "bc"
, "abc"
, 'afdc'
, etc. But none of the following strings return a match: "a"
, 'x'
, 'c'
, etc.
Python code to test if the pattern ..
finds any match in the string “abcd”:
import re
pattern = '..'
test = 'abcd'
if re.match(pattern, test):
print("Search successful.")
else:
print("Search unsuccessful.")
Search successful.
Another Python code example using anchors to test if the pattern ‘^b…s$’ finds any match in the string ‘boats’
import re
pattern = '^b...s$'
test = 'boats'
if re.match(pattern, test):
print("Search successful.")
else:
print("Search unsuccessful.")
Search successful.
Same pattern tested with the string ‘boats are on shore’:
import re
pattern = '^b...s$'
test = 'boats are on the shore'
if re.match(pattern, test):
print("Search successful.")
else:
print("Search unsuccessful.")
Search unsuccessful.
No match is found since the tested string should count 5 characters, start with ‘b’ and ends with ‘s’…
Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found. List of quantifiers: * special characters: +
, *
, ?
* {n}
, {n,}
and {n,m}
assuming n<m, n and m representing positive integers.
ERE Syntax | BRE Syntax | Meaning |
---|---|---|
+ | \+ | one or more times |
* | * | zero or more times |
? | \? | zero or one time |
{n} | \{n\} | n times |
{n,m} | \{n,m\} | between n and m times |
{n,} | \{n,\} | n or more times |
Examples:
ERE Pattern | Match |
---|---|
foob.* r | ‘foobar’, ‘foobalkjdflkj9r’, ‘foobr’, … |
foob.+r | ‘foobar’, ‘foobalkjdflkj9r’, …, but not ‘foobr’ |
foob.?r | ‘foobar’, ‘foobbr’. ‘foobr’, … but not ‘foobalkj9r’ |
fooba{2}r | ‘foobaar’ |
fooba{2,}r | ‘foobaar’, ‘foobaaar’, ‘foobaaaar’, … |
fooba{2,3}r | ‘foobaar’ and ‘foobaaar’ but not ‘foobaaaar’ |
^ * | zero or more spaces at the beginning of the line |
1+1 | 1, 11, 111, 1111, 111111, etc. |
1\\+1 | 1+1 since the escaped + is no more interpreted as a special character |
ba? | “b” or “ba” |
You can specify a class of characters by enclosing a list of characters in [...]
or specify a range of characters using -
inside square brackets. The class match any character from the list.
class | Match |
---|---|
[abcd] | a , b , c , and d |
[a-z] | any lowercase letter |
[A-Z] | Any uppercase letter |
[0-9] | Any digit |
[a-e] | same as [abcde] |
[tT]he | the or The |
[a-z0-9] | Any lowercase letter or digit |
If the first character after
[
is^
the class matches any character not in the list (negative conditions).
Pattern | Match |
---|---|
[cC]at | “cat” and “Cat” |
hell[aei]r | ‘hellar’, ‘heller’ and ‘hellir’ |
hell[^ aeiou]r | ‘hellbr’, ‘hellcr’, … but not ‘hellar’, ‘heller’, … |
^[0123456789]$ | Any line that contains exactly one number |
^[0-9]$ | Any line that contains exactly one number |
[A-Za-z0-9_] | Any single character that is a letter, number, or underscore |
[0-9a-z] | Any single character that is a number, or a character between “a” and “z”. |
[0-9]\] | Any number followed by “]” |
[0-9]* | nothing or any number from 0 to infinity |
[hc]+at | “hat”, “cat”, “hhat”, “chat”, “hcat”, “ccchat” etc. |
“[hc]?at” | “hat”, “cat” and “at” |
By placing part of a regular expression inside parentheses : ERE syntax (...)
or BRE syntax `(…)), you can capture a group in a regular expression. This allows you to define subexpressions to perform the pattern matching, to apply a quantifier to the entire group or to restrict alternation to part of the regex. Only parentheses can be used for grouping.
Example:
Pattern | Match |
---|---|
(abc){2} | abcabc |
ERE Pattern | Meaning | |
---|---|---|
\d | any decimal digit. Equivalent to [0-9] | |
\D | any character that is not a decimal digit. Equivalent to [^0-9] | |
\s | any white space character Equivalent to [\t\n\r\f\v] | |
\S | any character that is not a white space character. Equivalent to [^ \t\n\r\f\v]. | |
\w | any “word” character. Equivalent to [a-zA-Z0-9_] | |
\W | any “non-word” character Equivalent to [^a-zA-Z0-9_] | |
\N | a character that is not a new line | |
\R | a newline sequence | |
\v | a vertical white space character | |
\V | a character that is not a vertical white space character | |
\b | matches any pattern at the beginning or end of a word: \bbath matches ‘bathroom’, ‘roommates’, ‘bedroom’,… | |
\B | matches pattern in the middle of a word: \Bno matches ‘annotate’, ‘unknown’, ‘hypnosis’,… |
The POSIX standard defines some classes or categories of characters as shown below.
class | Similar to | Meaning |
---|---|---|
[:alnum:] | [A-Za-z0-9] | digits, upper and lowercase letters |
[:alpha:] | [A-Za-z] | upper and lowercase letters |
[:blank:] | [\t] | space and TAB characters only |
[:cntrl:] | control characters | |
[:digit:] | [0-9] | digits |
[:graph:] | printable characters, not including space | |
[:lower:] | [a-z] | lowercase letters |
[:print:] | printable characters, including space | |
[:punct:] | punctuation (all graphic characters except letters and digits) | |
[:space:] | [\t\n\r\f\v] | blank (whitespace) characters |
[:upper:] | [A-Z] | uppercase letters |
[: Xdigit:] | [0-9A-Fa-f] | hexadecimal digits |
|
POSIX BRE syntax does not support alternation.
In ERE or PCRE syntax, |
, the alternation operator matches either the expression before or the expression after. As an example, if you want to search for the literal text “cat” or “dog”, you can separate both options as follows : cat|dog
. If you want more options, simply expand the list: cat|dog|mouse|fish
.
ERE syntax | Match |
---|---|
(a | b |
([cC]at)|([dD]og) | “cat”, “Cat”, “dog”, “Dog” |
a\.(\(|\)) | “a.)” or “a.(” |
Some special characters are behaving differently in basic and extended regular expressions : ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’.
In BRE syntax: ?
, +
, {
and }
, (
and )
have a literal meaning. To be interpreted as special, they have to be escaped with a backslash placed before.
In ERE syntax, it is exactly the opposite. ?
, +
, Braces and parentheses lose their special meaning if they are escaped with a backslash.
Desired pattern | Basic (BRE) Syntax | Extended (ERE) Syntax |
---|---|---|
{N} as a literal meaning | {N} | \{N\} |
{N} as a quantifier | \{N\} | {N} |
literal + (plus sign) | a+b=c | a\+b=c |
+ as a quantifier* | a\+b | a+b |
*One or more ‘a’ characters followed by ‘b’
Note that in Linux: the
!
character invokes bash’s history substitution. When followed by a string it tries to expand to the last history event that began with that string :!echo
would expand to the last echo command in your history. You can refer to the following section for more details:man bash
(see the QUOTING section)
PCRE is a library written in C, which implements a regular expression engine, inspired by the capabilities of the Perl programming language PCRE has inspired the regular expression syntax of most programming languages. It has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API. The PCRE syntax is much more flexible than the POSIX regular expression syntax.
Note that the linux command
grep
with the-P
option can use the PCRE syntax.
Let’s see some of the special features of the PCRE syntax.
The quantifiers, *
, +
, and ?
, are all greedy: they match as much text as possible. This is a behaviour that you may not expect. To turn their greedy behaviour into a lazy behaviour, we add the character ?
to the quantifier as follows: *?
, +?
, ??
Let’s study the greedy combination .*
also called dot-star or quantified dot.
Let’s see through an example how greedy quantifiers work:
Assume that you want to extract substrings from a string using delimiters tags such as {START}
and {END}
.
As an example, let’s consider the following string: "{START} Richard {END} is {START} in the kitchen {END}"
We would like to extract two substrings : "{START} Richard {END}"
and "{START} in the kitchen {END}"
If we apply the regex: {START}.*{END}
and search the string to get a match, it returns the entire string as a match.
How does the search pattern, {START}.*{END}
, work?
{START}
, the parsers moves to the next character..*
, matches all characters to the end of the string. It stops at the very last character }
.D
, N
, E
and {
. Then the parser stops since it gets a match with the last tag {END}
from the regex.How to change this behaviour and prevent the parser to reach the end of the string as a first step?
The easiest way is to make the dot-star to be “lazy” by adding a question mark ?
at the end of the search sub-pattern : .*?
.
.*?
is called lazy since it guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed: the pattern only matches one {START}...{END}
item at a time.
As another example extracted from Python doc, let’s consider the following string: "<a> b <c>"
We would like to extract the two tags "<a>"
and "<c>"
from the string.
If we apply the regex, <.*>
, and search the string to find a match, it returns the entire string “< a> b <c>” as a match (greedy behaviour).
Since we just want <a>
and <c>
to be matched, we need to add the character ?
after the quantified dot (like in the previous example) to perform the match in a non-greedy way. The new search pattern, <.*?>
, will find a match in **<**a**>**
and **<**c**>**
.
Search Pattern | Name | Meaning |
---|---|---|
(?=regex) |
Positive lookahead | match something followed by the expression regex |
The following Python code uses data(?=[a-z])
as a search pattern (note that the pattern includes a look-ahead assertion (?=[a-z])
). Let’see what the search engine returns with the string ‘dataset’:
## <re.Match object; span=(0, 4), match='data'>
Let’see what the search engine returns using the same pattern on the string ‘data123’:
## None
None
is returned.Search Pattern | Name | Meaning |
---|---|---|
(?!regex) |
Negative lookahead | match something not followed by the expression regex |
In data(?![a-z])
, the negative lookahead assertions specifies that ‘data’ should not be followed by a lowercase alphabetic character. Let’s change the lookahead pattern into a Negative lookahead pattern in our previous examples, we can see the first assertion returns None
(data is followed by s
) while the second one returns a match:
## None
## <re.Match object; span=(0, 4), match='data'>
Search Pattern | Name | Meaning |
---|---|---|
(?<=regex) |
Positive look-behind | match something preceded by the expression regex without including it in the match |
Example: Find expression A where expression B precedes (?<=B)A
Search Pattern | Name | Meaning |
---|---|---|
(?<!regex) |
Negative look-behind | match something not preceded by the expression regex without including it in the match |
Example: Find expression A where expression B does not precedes (?<!B)A
(?()|)
Conditional (?(id/name)yes-pattern|no-pattern)
(?#...)
a comment; the contents of the parentheses are simply ignored.
(?:...)
(?:...)
Non-capturing group. A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.color=(?:red|green|blue)
(?P<Y>...)
Capturing group named Y. Similar to regular parentheses but the substring matched by the group is accessible via the symbolic group name Y
## 's'
data
, finds a match in dataset
: the parser reads and consumes these four characters.(?=[a-z])
, is a lookahead assertion that successfully matches ‘s’, but note that the parser remains at the same position, it does not consume the ‘s’.(?P<ref>.)
matches the next single character available, which is ‘s’, and captures it in a group named ref
.result.group('ref')
returns the group named ref
which contains ‘s’.Let’s compare the previous search to the following example which does not contain any lookahead:
## e
data
, finds a match in dataset
: the parser reads and consumes these four characters.(?P<ref>.)
matches the next single character available: ‘e’.result.group('ref')
returns the group named ref
which contains ‘e’.TODO: TO BE CONTINUED…
(?P=Y)
Match the named group Y
: A backreference to Y
; it matches whatever text was matched by the earlier group named Y
.
(?>...)
Atomic group
(?|...)
Duplicate group numbers
\Y
Match the Y’th captured group
(?R)
Recurse into entire pattern
(?Y)
Recurse into numbered group Y
(?&Y)
Recurse into named group Y
\g{Y}
Match the named or numbered group Y
\g<Y>
Recurse into named or numbered group Y
(?#...)
A comment; the contents of the parentheses are simply ignored.
i
Ignore case m
for ^ and $ to match start and end of line s
For .
to match newline as well x
Allow spaces and comments J
Duplicate group names allowed U
Ungreedy quantifiers (?iLmsux)
Set flags within regex
https://remram44.github.io/regex-cheatsheet/regex.html
POSIX (BRE) | POSIX extended (ERE) | Perl/PCRE | Python re module |
---|