A regular expression, is a pattern of characters that are used to find matches in a larger string or text. The pattern describes what a string that matches should look like.
For example, if we have the following text, “the gray hound chased the red fox for over a full mile,” we might want to find the color gray in the text. A regex to find the word “gray” in the text would simply be the characters ‘gray’.
#One way to do this
string="the gray hound chased the red fox for over a full mile"
string2match="gray"
for x in string.split():
if (x==string2match):
print("match")
#using regex
import re
pattern = re.compile("gray") #regex constructor
string="the grey hound chased the red fox for over a full mile"
match = re.search(pattern, string) # this returns either True or None
if match:
print("match")
else:
print("no match")
import re
string="I will go to the meeting at 7pM"
pattern=re.compile('7[Pp][Mm]')
match = re.search(pattern, string)
if match:
print("match")
else:
print("no match")
import re
# how to capture 1 and 9 a.m. or 1 and 9 p.m. ?
string="I will go to the office at 6am, wanna join me?"
pattern=re.compile()
match = re.search(pattern, string)
if match:
print("match")
else:
print("no match")
import re
# Use a range [ A - Z ]
string="I will go to the office at 6am, wanna join me?"
pattern=re.compile('[1-9][AaPp][Mm]')
match = re.search(pattern, string)
if match:
print("match")
else:
print("no match")
Some other characters do not have ASCII equivalents; however, you can use the Unicode value to search for those characters. For example, if we were looking for the Euro symbol €, we could use the regex pattern \u20AC. In such cases we notate our regex with r"", to distinguish it from literal '\u20AC' string.
import re
string="I have some €€€ to spend on vodca"
pattern=re.compile(r"\u20AC")
match = re.search(pattern, string)
if match:
print("match")
else:
print("no match")
import re
string="I have some $$$ to spend on vodca"
pattern=re.compile(r"\u0024")
match = re.search(pattern, string)
if match:
print("match")
else:
print("no match")
Negation ^ When you build a character set within the brackets, it is sometimes easier to specify what values you want to exclude. This is what the ^ metacharacter within the square bracket allows you to do.
import re
digit="32039"
pattern=re.compile(r"\d") #alternative of[0-9]
match = re.search(pattern, digit)
if match:
print("match")
else:
print("no match")
import re
string="no digits around here"
pattern=re.compile("[^0-9]") # alternative r"\D"
match = re.search(pattern, string)
if match:
print("match")
else:
print("no match")
Alphanumeric (word characters) The \w metacharacter means to match any word character (any uppercase or lowercase letter, any digit, and the underscore).
It is the same as [a-zA-Z0-9_] .
The \W metacharacter negates that, and matches any non-word character (i.e.[^a-zA-Z0-9_] ).
import re
pattern = re.compile("gr.y")
string1="the grey hound chased the red fox for over a full mile" #American English
string2="the gray hound chased the red fox for over a full mile" #British
match = re.search(pattern, string1) # this returns either True or None
if match:
print("match")
else:
print("no match")
import re
pattern = re.compile("[gr].[y]")
string="the greedy hound chased the red fox for over a full mile"
match = re.search(pattern, string) # this returns either True or None
if match:
print("match")
else:
print("no match")
Curly brackets {}
The curly brackets metacharacters allow you to specify the number of times a particular character set occurs. For a simple example, let’s take our euro regex from the previous chapter and make it a more general purpose currency search.
First, we want a symbol, which in this case is the euro, or dollar sign. [\u20AC\u00A3$] followed by at least one digit, but no more than five digits, followed by a period, and two digits. \d{1,5}.\d\d
The {1,5} following the \d character indicates we need at least one, but no more than five digits after the currency symbol and before the period. Our options with the curly brackets can be:
{n} - Preceding character must occur exactly “n” times. {n,} - At least “n” times, but no upper limit. {n,m} - Between “n” and “m” times.
import re
string="I have some €12.80 to spend on vodca"
pattern=re.compile(r"[\u20AC\u0024]\d{1,5}\.\d\d")
match = re.search(pattern, string)
if match:
print("match")
else:
print("no match")
Quantifier Symbols ( * ) Match zero or more times {0,}
( + ) Match one or more times {1,}
( ? ) Match zero or one time {0,1}
import re
string="<h1>This is a <strong>header</strong></h1>"
string2="<h1>This is a <i>header</i></h1>"
pattern=re.compile("<h1>.*<strong>.*</strong>.*</h1>")
match = re.search(pattern, string)
if match:
print("match")
else:
print("no match")
# EXTRACTING THE RESULTS
import re
string="my tshirt cost 5$, my jacket was 20$ and I am 6 years old "
pattern=re.compile("\d*[\u0024]")
result=re.findall(pattern,string)
print(result)
import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao"
pattern=re.compile("(#\w*)")
result=re.sub(pattern,'',tweet)
print(result)
import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao@little_cousin"
pattern=re.compile("(#\w*)")
result=re.sub(pattern,'hashtag ',tweet)
pattern=re.compile("(@\w*)")
result=re.sub(pattern,'reference ',result)
print(result)
#combine regex together
import re
tweet="Mah god, pizza is not for breakfast #are_U_Seriously#lmfao@little_cousin"
pattern=re.compile("((#\w*)|(@\w*))")
result=re.sub(pattern,'',tweet)
print(result)