Regular Expressions

A regular expression, is a pattern of characters that are used to find matches in a larger string or text. The pattern describes what a string that matches should look like.

For example, if we have the following text, “the gray hound chased the red fox for over a full mile,” we might want to find the color gray in the text. A regex to find the word “gray” in the text would simply be the characters ‘gray’.

In [4]:
#One way to do this
string="the gray hound chased the red fox for over a full mile"
string2match="gray"
for x in string.split():
    if (x==string2match):
        print("match")
        
match
In [9]:
#using regex
import re
pattern = re.compile("gray") #regex constructor
string="the grey hound chased the red fox for over a full mile"
match = re.search(pattern, string) # this returns either True or None
if match:
    print("match")
else:
    print("no match")

Literal Characters:

Regex has a number of characters that have special meaning called metacharacters—we will explore how to use them later in this chapter. The metacharacters are: ^ $ . | { } [ ] ( ) * + ? \Square Brackets [] We are going to introduce our first metacharacters, the square brackets []. The square brackets are used to provide a list of potential matching characters at a position in the search text. So, to search for 7pm, case-insensitive, we could use the regex pattern 7[Pp][Mm].
In [12]:
import re
string="I will go to the meeting  at 7pM"
pattern=re.compile('7[Pp][Mm]')
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")
match
In [14]:
import re
# how to capture 1 and 9 a.m. or 1 and 9 p.m. ?
string="I will go to the office at 6am, wanna join me?"
pattern=re.compile()
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")
no match
In [16]:
import re
# Use a range [ A - Z ]
string="I will go to the office at 6am, wanna join me?"
pattern=re.compile('[1-9][AaPp][Mm]')
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")
match

Some other characters do not have ASCII equivalents; however, you can use the Unicode value to search for those characters. For example, if we were looking for the Euro symbol €, we could use the regex pattern \u20AC. In such cases we notate our regex with r"", to distinguish it from literal '\u20AC' string.

In [24]:
import re
string="I have some €€€ to spend on vodca"
pattern=re.compile(r"\u20AC")
match = re.search(pattern, string)
if match:
    print("match")
else:
    print("no match")
match
In [25]:
import re
string="I have some $$$ to spend on vodca"
pattern=re.compile(r"\u0024")
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")
match

Negation ^ When you build a character set within the brackets, it is sometimes easier to specify what values you want to exclude. This is what the ^ metacharacter within the square bracket allows you to do.

In [58]:
import re
digit="32039"
pattern=re.compile(r"\d") #alternative of[0-9]
match = re.search(pattern, digit) 
if match:
    print("match")
else:
    print("no match")
match
In [61]:
import re
string="no digits around here"
pattern=re.compile("[^0-9]") # alternative r"\D"
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")
match

Alphanumeric (word characters) The \w metacharacter means to match any word character (any uppercase or lowercase letter, any digit, and the underscore).

It is the same as [a-zA-Z0-9_] .

The \W metacharacter negates that, and matches any non-word character (i.e.[^a-zA-Z0-9_] ).

The dot (.) The dot metacharacter matches any single character (with the exception of newline).
In [140]:
import re
pattern = re.compile("gr.y") 
string1="the grey hound chased the red fox for over a full mile" #American English
string2="the gray hound chased the red fox for over a full mile" #British
match = re.search(pattern, string1) # this returns either True or None
if match:
    print("match")
else:
    print("no match")
match
In [69]:
import re
pattern = re.compile("[gr].[y]") 
string="the greedy hound chased the red fox for over a full mile"
match = re.search(pattern, string) # this returns either True or None
if match:
    print("match")
else:
    print("no match")
no match

Curly brackets {}

The curly brackets metacharacters allow you to specify the number of times a particular character set occurs. For a simple example, let’s take our euro regex from the previous chapter and make it a more general purpose currency search.

First, we want a symbol, which in this case is the euro, or dollar sign. [\u20AC\u00A3$] followed by at least one digit, but no more than five digits, followed by a period, and two digits. \d{1,5}.\d\d

The {1,5} following the \d character indicates we need at least one, but no more than five digits after the currency symbol and before the period. Our options with the curly brackets can be:

 {n} - Preceding character must occur exactly “n” times.  {n,} - At least “n” times, but no upper limit.  {n,m} - Between “n” and “m” times.

In [1]:
import re
string="I have some €12.80 to spend on vodca"
pattern=re.compile(r"[\u20AC\u0024]\d{1,5}\.\d\d")
match = re.search(pattern, string)
if match:
    print("match")
else:
    print("no match")
match

Quantifier Symbols ( * ) Match zero or more times {0,}

( + ) Match one or more times {1,}

( ? ) Match zero or one time {0,1}

In [104]:
import re
string="<h1>This is a <strong>header</strong></h1>"
string2="<h1>This is a <i>header</i></h1>"
pattern=re.compile("<h1>.*<strong>.*</strong>.*</h1>")
match = re.search(pattern, string)
if match:
    print("match")
else:
    print("no match")
match

How to extract the result of a match?

In [133]:
# EXTRACTING THE RESULTS
import re
string="my tshirt cost 5$, my jacket was 20$ and I am 6 years old "

pattern=re.compile("\d*[\u0024]")
result=re.findall(pattern,string)
print(result)
['5$', '20$']

How to remove noisy words?

In [159]:
import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao"
pattern=re.compile("(#\w*)")
result=re.sub(pattern,'',tweet)
print(result)
Mah god, pizza is not for breakfast 

How to modify your data?

In [163]:
import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao@little_cousin"
pattern=re.compile("(#\w*)")
result=re.sub(pattern,'hashtag ',tweet)
pattern=re.compile("(@\w*)")
result=re.sub(pattern,'reference ',result)
print(result)
Mah god, pizza is not for breakfast hashtag hashtag reference 

combine regex together

In [168]:
#combine regex together
import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao@little_cousin"
pattern=re.compile("((#\w*)|(@\w*))")
result=re.sub(pattern,'',tweet)

print(result)
Mah god, pizza is not for breakfast 

References