Regular Expressions

A regular expression, is a pattern of characters that are used to find matches in a larger string or text. The pattern describes what a string that matches should look like.

For example, if we have the following text, “the gray hound chased the red fox for over a full mile,” we might want to find the color gray in the text. A regex to find the word “gray” in the text would simply be the characters ‘gray’.

#One way to do this
string="the gray hound chased the red fox for over a full mile"
string2match="gray"
for x in string.split():
    if (x==string2match):
        print("match")

match

#using regex
import re
pattern = re.compile("gray") #regex constructor
string="the grey hound chased the red fox for over a full mile"
match = re.search(pattern, string) # this returns either True or None
if match:
    print("match")
else:
    print("no match")

import re
string="I will go to the meeting  at 7pM"
pattern=re.compile('7[Pp][Mm]')
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")

match

import re
# how to capture 1 and 9 a.m. or 1 and 9 p.m. ?
string="I will go to the office at 6am, wanna join me?"
pattern=re.compile()
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")

no match

import re
# Use a range [ A - Z ]
string="I will go to the office at 6am, wanna join me?"
pattern=re.compile('[1-9][AaPp][Mm]')
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")

match

Some other characters do not have ASCII equivalents; however, you can use the Unicode value to search for those characters. For example, if we were looking for the Euro symbol €, we could use the regex pattern \u20AC. In such cases we notate our regex with r"", to distinguish it from literal '\u20AC' string.

import re
string="I have some €€€ to spend on vodca"
pattern=re.compile(r"\u20AC")
match = re.search(pattern, string)
if match:
    print("match")
else:
    print("no match")

match

import re
string="I have some $$$ to spend on vodca"
pattern=re.compile(r"\u0024")
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")

match

Negation ^ When you build a character set within the brackets, it is sometimes easier to specify what values you want to exclude. This is what the ^ metacharacter within the square bracket allows you to do.

import re
digit="32039"
pattern=re.compile(r"\d") #alternative of[0-9]
match = re.search(pattern, digit) 
if match:
    print("match")
else:
    print("no match")

match

import re
string="no digits around here"
pattern=re.compile("[^0-9]") # alternative r"\D"
match = re.search(pattern, string) 
if match:
    print("match")
else:
    print("no match")

match

Alphanumeric (word characters) The \w metacharacter means to match any word character (any uppercase or lowercase letter, any digit, and the underscore).

It is the same as [a-zA-Z0-9_] .

The \W metacharacter negates that, and matches any non-word character (i.e.[^a-zA-Z0-9_] ).

import re
pattern = re.compile("gr.y") 
string1="the grey hound chased the red fox for over a full mile" #American English
string2="the gray hound chased the red fox for over a full mile" #British
match = re.search(pattern, string1) # this returns either True or None
if match:
    print("match")
else:
    print("no match")

match

import re
pattern = re.compile("[gr].[y]") 
string="the greedy hound chased the red fox for over a full mile"
match = re.search(pattern, string) # this returns either True or None
if match:
    print("match")
else:
    print("no match")

no match

Curly brackets {}

The curly brackets metacharacters allow you to specify the number of times a particular character set occurs. For a simple example, let’s take our euro regex from the previous chapter and make it a more general purpose currency search.

First, we want a symbol, which in this case is the euro, or dollar sign. [\u20AC\u00A3$] followed by at least one digit, but no more than five digits, followed by a period, and two digits. \d{1,5}.\d\d

The {1,5} following the \d character indicates we need at least one, but no more than five digits after the currency symbol and before the period. Our options with the curly brackets can be:

 {n} - Preceding character must occur exactly “n” times.  {n,} - At least “n” times, but no upper limit.  {n,m} - Between “n” and “m” times.

import re
string="I have some €12.80 to spend on vodca"
pattern=re.compile(r"[\u20AC\u0024]\d{1,5}\.\d\d")
match = re.search(pattern, string)
if match:
    print("match")
else:
    print("no match")

match

Quantifier Symbols ( * ) Match zero or more times {0,}

( + ) Match one or more times {1,}

( ? ) Match zero or one time {0,1}

import re
string="<h1>This is a <strong>header</strong></h1>"
string2="<h1>This is a <i>header</i></h1>"
pattern=re.compile("<h1>.*<strong>.*</strong>.*</h1>")
match = re.search(pattern, string)
if match:
    print("match")
else:
    print("no match")

match

How to extract the result of a match?¶

# EXTRACTING THE RESULTS
import re
string="my tshirt cost 5$, my jacket was 20$ and I am 6 years old "

pattern=re.compile("\d*[\u0024]")
result=re.findall(pattern,string)
print(result)

['5$', '20$']

How to remove noisy words?¶

import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao"
pattern=re.compile("(#\w*)")
result=re.sub(pattern,'',tweet)
print(result)

Mah god, pizza is not for breakfast

How to modify your data?¶

import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao@little_cousin"
pattern=re.compile("(#\w*)")
result=re.sub(pattern,'hashtag ',tweet)
pattern=re.compile("(@\w*)")
result=re.sub(pattern,'reference ',result)
print(result)

Mah god, pizza is not for breakfast hashtag hashtag reference

combine regex together¶

#combine regex together
import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao@little_cousin"
pattern=re.compile("((#\w*)|(@\w*))")
result=re.sub(pattern,'',tweet)

print(result)

Mah god, pizza is not for breakfast

Regular Expressions

Literal Characters:

How to extract the result of a match?¶

How to remove noisy words?¶

How to modify your data?¶

combine regex together¶

References