diff --git a/18_Day/18_regular_expression.md b/18_Day/18_regular_expression.md index 63b7824..b5371c9 100644 --- a/18_Day/18_regular_expression.md +++ b/18_Day/18_regular_expression.md @@ -19,52 +19,62 @@ ![30DaysOfPython](../images/30DaysOfPython_banner3@2x.png) - [📘 Day 18](#%f0%9f%93%98-day-18) - - [Regular Expression](#regular-expression) - - [Import re module](#import-re-module) - - [re functions](#re-functions) + - [Regular Expressions](#regular-expression) + - [The *re* Module](#The-re-module) + - [Functions in *re* Module](#functions-in-re-module) - [Match](#match) - [Search](#search) - - [Searching all matches using findall](#searching-all-matches-using-findall) - - [Replacing a substring](#replacing-a-substring) - - [Spliting text using RegEx split](#spliting-text-using-regex-split) - - [Writing RegEx pattern](#writing-regex-pattern) - - [Square Bracket](#square-bracket) + - [Searching for All Matches Using *findall*](#searching-for-all-matches-using-findall) + - [Replacing a Substring](#replacing-a-Substring) + - [Splitting Text Using RegEx Split](#splitting-text-using-regex-split) + - [Writing RegEx Patterns](#writing-regex-patterns) + - [Square Brackets](#square-brackets) - [Escape character(\\) in RegEx](#escape-character-in-regex) - [One or more times(+)](#one-or-more-times) - [Period(.)](#period) - - [Zero or more times(*)](#zero-or-more-times) + - [Zero or more times(\*)](#zero-or-more-times) - [Zero or one times(?)](#zero-or-one-times) - - [Quantifier in RegEx](#quantifier-in-regex) + - [Quantifiers in RegEx](#quantifiers-in-regex) - [Cart ^](#cart) - [💻 Exercises: Day 18](#%f0%9f%92%bb-exercises-day-18) # 📘 Day 18 -## Regular Expression -A regular expression or RegEx is a small programming language that helps to find pattern in data. A RegEx can be used to check if some pattern exists in a different data type. To use RegEx in python first we should import the RegEx module which is *re*. -### Import re module +## Regular Expressions + +A regular expression or RegEx is a special text string that helps to find patterns in data. A RegEx can be used to check if some pattern exists in a different data type. To use RegEx in python first we should import the RegEx module which is called *re*. + +### The *re* Module + After importing the module we can use it to detect or find patterns. + ```py import re ``` -### re functions -To find a pattern we use different set of *re* functions that allows to search a string for match. -* *re.match()*:searches only in the beginning of the first line of the string and return match object if found, else return none. -* *re.search*:Returns a Match object if there is a match anywhere in the string including or in multiline string. -* *re.findall*:Returns a list containing all matches -* *re.split*: Returns a list where the string has been split at each match -* *re.sub*: Replaces one or many matches with a string + +### Functions in *re* Module + +To find a pattern we use different set of *re* character sets that allows to search for a match in a string. +* *re.match()*: searches only in the beginning of the first line of the string and returns matched objects if found, else returns none. +* *re.search*: Returns a match object if there is one anywhere in the string, including multiline strings. +* *re.findall*: Returns a list containing all matches +* *re.split*: Takes a string, splits it at the match points, returns a list +* *re.sub*: Replaces one or many matches within a string #### Match + ```py # syntac re.match(substring, string, re.I) # substring is a string or a pattern, string is the text we look for a pattern , re.I is case ignore ``` + ```py -txt = 'I love to teach python or javaScript' -# It return an object with span, and match +import re + +txt = 'I love to teach python and javaScript' +# It returns an object with span, and match match = re.match('I love to teach', txt, re.I) print(match) # # We can get the starting and ending position of the match as tuple using span @@ -76,19 +86,23 @@ print(start, end) # 0, 15 substring = txt[start:end] print(substring) # I love to teach ``` -As you can see from the above example, the pattern we are looking for or the substring *I love to teach* is the beginning of the text. The match function only returns an object if the text starts with the pattern. + +As you can see from the example above, the pattern we are looking for (or the substring we are looking for) is *I love to teach*. The match function returns an object **only** if the text starts with the pattern. #### Search + ```py # syntax re.match(substring, string, re.I) # substring is a pattern, string is the text we look for a pattern , re.I is case ignore flag ``` ```py -txt = '''Python is the most beautiful language that a human begin has ever created. +import re + +txt = '''Python is the most beautiful language that a human being has ever created. I recommend python for a first programming language''' -# It return an object with span, and match +# It returns an object with span and match match = re.search('first', txt, re.I) print(match) # # We can get the starting and ending position of the match as tuple using span @@ -100,13 +114,15 @@ print(start, end) # 100 105 substring = txt[start:end] print(substring) # first ``` -As you can see search is much better than match because it can look for the pattern through out the text. Search return returns a match object right way a first match found. A much better *re* function is *findall*. This function check the pattern through the string and returns all the matches as a list. -#### Searching all matches using findall +As you can see, search is much better than match because it can look for the pattern throughout the text. Search returns a match object with a first match that was found, otherwise it returns _None_. A much better *re* function is *findall*. This function checks for the pattern through the whole string and returns all the matches as a list. + +#### Searching for All Matches Using *findall* + *findall()* returns all the matches as a list ```py -txt = '''Python is the most beautiful language that a human begin has ever created. +txt = '''Python is the most beautiful language that a human being has ever created. I recommend python for a first programming language''' # It return a list @@ -114,11 +130,12 @@ matches = re.findall('language', txt, re.I) print(matches) # ['language', 'language'] ``` -As you can see, the word language found two times in the string. Let's practice more -Let's look for the word both Python and python in the string +As you can see, the word language was found two times in the string. Let's practice some more. +Now we will look for both Python and python words in the string: + ```py -txt = '''Python is the most beautiful language that a human begin has ever created. +txt = '''Python is the most beautiful language that a human being has ever created. I recommend python for a first programming language''' # It returns list @@ -126,9 +143,11 @@ matches = re.findall('python', txt, re.I) print(matches) # ['Python', 'python'] ``` -Since we are using *re.I* both lowercase and uppercase are included but if we don't have the flag, we write our pattern differently. Let's see that + +Since we are using *re.I* both lowercase and uppercase letters are included. If we don't have that flag, then we will have to write our pattern differently. Let's check it out: + ```py -txt = '''Python is the most beautiful language that a human begin has ever created. +txt = '''Python is the most beautiful language that a human being has ever created. I recommend python for a first programming language''' matches = re.findall('Python|python', txt) @@ -139,49 +158,60 @@ matches = re.findall('[Pp]ython', txt) print(matches) # ['Python', 'python'] ``` -#### Replacing a substring + +#### Replacing a Substring + ```py -txt = '''Python is the most beautiful language that a human begin has ever created. +txt = '''Python is the most beautiful language that a human being has ever created. I recommend python for a first programming language''' match_replaced = re.sub('Python|python', 'JavaScript', txt, re.I) -print(match_replaced) # JavaScript is the most beautiful language that a human begin has ever created. +print(match_replaced) # JavaScript is the most beautiful language that a human being has ever created. # OR match_replaced = re.sub('[Pp]ython', 'JavaScript', txt, re.I) -print(match_replaced) # JavaScript is the most beautiful language that a human begin has ever created. +print(match_replaced) # JavaScript is the most beautiful language that a human being has ever created. ``` -Let's add one more example, the following string is really hard to read unless we remove the % symbol. Replacing the % with a empty string will clean the text. + +Let's add one more example. The following string is really hard to read unless we remove the % symbol. Replacing the % with an empty string will clean the text. + ```py txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. -T%he%re i%s n%o%th%ing as m%ore r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple. +T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple. I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. -D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher.''' +D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?''' matches = re.sub('%', '', txt) -print(matches) # ['Python', 'python'] +print(matches) ``` ```sh I am teacher and I love teaching. -There is nothing as more rewarding as educating and empowering people. +There is nothing as rewarding as educating and empowering people. I found teaching more interesting than any other jobs. -Does this motivate you to be a teacher. +Does this motivate you to be a teacher? ``` -## Spliting text using RegEx split + +## Splitting Text Using RegEx Split + ```py txt = '''I am teacher and I love teaching. -There is nothing as more rewarding as educating and empowering people. +There is nothing as rewarding as educating and empowering people. I found teaching more interesting than any other jobs. -Does this motivate you to be a teacher.''' -print(re.split('\n', txt)) +Does this motivate you to be a teacher?''' +print(re.split('\n', txt)) # splitting using \n - end of line symbol ``` ```sh -['I am teacher and I love teaching.', 'There is nothing as more rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher.'] +['I am teacher and I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?'] ``` -## Writing RegEx pattern + +## Writing RegEx Patterns + To declare a string variable we use a single or double quote. To declare RegEx variable *r''*. The following pattern only identifies apple with lowercase, to make it case insensitive either we should rewrite our pattern or we should add a flag. + ```py +import re + regex_pattern = r'apple' txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. ' matches = re.findall(regex_pattern, txt) @@ -190,7 +220,7 @@ print(matches) # ['apple'] # To make case insensitive adding flag ' matches = re.findall(regex_pattern, txt, re.I) print(matches) # ['Apple', 'apple'] -# or we can use set of characters method +# or we can use a set of characters method regex_pattern = r'[Aa]pple' # this mean the first letter could be Apple or apple matches = re.findall(regex_pattern, txt) print(matches) # ['Apple', 'apple'] @@ -198,71 +228,75 @@ print(matches) # ['Apple', 'apple'] ``` * []: A set of characters * [a-c] means, a or b or c - * [a-z] means, any letter a to z - * [A-Z] means, any character A to Z + * [a-z] means, any letter from a to z + * [A-Z] means, any character from A to Z * [0-3] means, 0 or 1 or 2 or 3 - * [0-9] means any number 0 to 9 - * [A-Za-z0-9] any character which is a to z, A to Z, 0 to 9 + * [0-9] means any number from 0 to 9 + * [A-Za-z0-9] any single character, that is a to z, A to Z or 0 to 9 * \\: uses to escape special characters - * \d mean:match where the string contains digits (numbers from 0-9) - * \D mean: match where the string does not contain digits + * \d means: match where the string contains digits (numbers from 0-9) + * \D means: match where the string does not contain digits * . : any character except new line character(\n) * ^: starts with - * r'^substring' eg r'^love', a sentence which starts with a word love - * r'[^abc] mean not a, not b, not c. + * r'^substring' eg r'^love', a sentence that starts with a word love + * r'[^abc] means not a, not b, not c. * $: ends with - * r'substring$' eg r'love$', sentence ends with a word love + * r'substring$' eg r'love$', sentence that ends with a word love * *: zero or more times - * r'[a]*' means a optional or it can be occur many times. + * r'[a]*' means a optional or it can occur many times. * +: one or more times - * r'[a]+' mean at least once or more times -* ?: zero or one times - * r'[a]?' mean zero times or once + * r'[a]+' means at least once (or more) +* ?: zero or one time + * r'[a]?' means zero times or once * {3}: Exactly 3 characters -* {3,}: At least 3 character +* {3,}: At least 3 characters * {3,8}: 3 to 8 characters * |: Either or - * r'apple|banana' mean either of an apple or a banana + * r'apple|banana' means either apple or a banana * (): Capture and group ![Regular Expression cheat sheet](../images/regex.png) -Let's use example to clarify the above meta characters +Let's use examples to clarify the meta characters above + ### Square Bracket + Let's use square bracket to include lower and upper case + ```py regex_pattern = r'[Aa]pple' # this square bracket mean either A or a txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. ' matches = re.findall(regex_pattern, txt) print(matches) # ['Apple', 'apple'] ``` + If we want to look for the banana, we write the pattern as follows: + ```py -regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket mean either A or a +regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket means either A or a txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. ' matches = re.findall(regex_pattern, txt) print(matches) # ['Apple', 'banana', 'apple', 'banana'] ``` + Using the square bracket and or operator , we manage to extract Apple, apple, Banana and banana. ### Escape character(\\) in RegEx + ```py regex_pattern = r'\d' # d is a special character which means digits -txt = 'This regular expression example was made in December 6, 2019.' +txt = 'This regular expression example was made on December 6, 2019.' matches = re.findall(regex_pattern, txt) print(matches) # ['6', '2', '0', '1', '9'], this is not what we want - -regex_pattern = r'\d+' # d is a special character which means digits, + mean one or more -txt = 'This regular expression example was made in December 6, 2019.' -matches = re.findall(regex_pattern, txt) -print(matches) # ['6', '2019'] ``` + ### One or more times(+) + ```py regex_pattern = r'\d+' # d is a special character which means digits, + mean one or more times -txt = 'This regular expression example was made in December 6, 2019.' +txt = 'This regular expression example was made on December 6, 2019.' matches = re.findall(regex_pattern, txt) -print(matches) # ['6', '2019'] +print(matches) # ['6', '2019'] - now, this is better! ``` ### Period(.) @@ -277,60 +311,74 @@ matches = re.findall(regex_pattern, txt) print(matches) # ['and banana are fruits'] ``` -### Zero or more times(*) + +### Zero or more times(\*) + Zero or many times. The pattern could may not occur or it can occur many times. + ```py -regex_pattern = r'[a].*' # . any character, + any character one or more times +regex_pattern = r'[a].*' # . any character, * any character zero or more times txt = '''Apple and banana are fruits''' matches = re.findall(regex_pattern, txt) print(matches) # ['and banana are fruits'] ``` -### Zero or one times(?) -Zero or one times. The pattern could may not occur or it may occur once. + +### Zero or one time(?) + +Zero or one time. The pattern may not occur or it may occur once. + ```py txt = '''I am not sure if there is a convention how to write the word e-mail. Some people write it email others may write it as Email or E-mail.''' -regex_pattern = r'[Ee]-?mail' # ? means optional +regex_pattern = r'[Ee]-?mail' # ? means here that '-' is optional matches = re.findall(regex_pattern, txt) print(matches) # ['e-mail', 'email', 'Email', 'E-mail'] ``` + ### Quantifier in RegEx -We can specify the length of the substring we look for in a text, using a curly bracket. Lets imagine, we are interested in substring that their length are 4 characters + +We can specify the length of the substring we are looking for in a text, using a curly bracket. Lets imagine, we are interested in a substring with a length of 4 characters: + ```py -txt = 'This regular expression example was made in December 6, 2019.' +txt = 'This regular expression example was made on December 6, 2019.' regex_pattern = r'\d{4}' # exactly four times matches = re.findall(regex_pattern, txt) print(matches) # ['2019'] -txt = 'This regular expression example was made in December 6, 2019.' +txt = 'This regular expression example was made on December 6, 2019.' regex_pattern = r'\d{1, 4}' # 1 to 4 matches = re.findall(regex_pattern, txt) print(matches) # ['6', '2019'] ``` + ### Cart ^ + * Starts with ```py -txt = 'This regular expression example was made in December 6, 2019.' +txt = 'This regular expression example was made on December 6, 2019.' regex_pattern = r'^This' # ^ means starts with +matches = re.findall(regex_pattern, txt) print(matches) # ['This'] ``` * Negation + ```py -txt = 'This regular expression example was made in December 6, 2019.' +txt = 'This regular expression example was made on December 6, 2019.' regex_pattern = r'[^A-Za-z ]+' # ^ in set character means negation, not A to Z, not a to z, no space matches = re.findall(regex_pattern, txt) -print(matches) # ['e-mail', 'email', 'Email', 'E-mail'] +print(matches) # ['6,', '2019.'] ``` ## 💻 Exercises: Day 18 - 1. What is the most frequent word in the following paragraph ? + + 1. What is the most frequent word in the following paragraph? ```py paragraph = 'I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love. ``` @@ -358,20 +406,26 @@ print(matches) # ['e-mail', 'email', 'Email', 'E-mail'] (1, 'Python'), (1, 'If')] ``` -2. The position of some particles on the horizontal x-axis -12, -4, -3 and -1 in the negative direction, 0 at origin, 4 and 8 in the positive direction. Extract these numbers and find the distance between the two furthest particles. + +2. The position of some particles on the horizontal x-axis -12, -4, -3 and -1 in the negative direction, 0 at origin, 4 and 8 in the positive direction. Extract these numbers from this whole text and find the distance between the two furthest particles. + ```py points = ['-1', '2', '-4', '-3', '-1', '0', '4', '8'] sorted_points = [-4, -3, -1, -1, 0, 2, 4, 8] distance = 12 ``` -3. Write a pattern which identify if a string is a valid python variable + +3. Write a pattern which identifies if a string is a valid python variable + ```sh is_valid_variable('first_name') # True is_valid_variable('first-name') # False is_valid_variable('1first_name') # False is_valid_variable('firstname') # True ``` + 4. Clean the following text. After cleaning, count three most frequent words in the string. + ```py sentence = '''%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s mo@tivate yo@u to be a tea@cher!?'''