7 |
Perl Pattern Matching |
With great resolve Pickard looked at the ambassador and said, "A lie of omission is still a lie." He stormed out of the room and beamed up to the Enterprise. When he arrive on the ship, the crew was awaiting him. One look and they all realized that not everything went well on the planet below
|
Now we want to search through this file for the word 'Enterprise'. We can accomplish this readily using the following snippet of code:
open ST, "$/home/ruben/startrek"; while(< ST >){ $line = $_; print "Found it!\n" if ($line =~ m!Enterprise!); }We can now see the two elements of pattern matching in their natural environment. First we see the bindery operator =~ which is the operator which binds a string to a pattern. Secondly we see the matching quote m!Enterprise!.
When we use the operator =~, the string on the left of the operator is searched for the pattern describe in the matching quote on the right. The return value of the bindery operator is a boolean value. Perl has two bindery operators.
=~ | Does it match? or is the pattern present? |
!~ | Does it not match? or Is the pattern Absent |
The quote mechanism can look like any of the follow. These all have the same meaning accept for the quote character (which is a metacharacter by definition.
m!Enterprise! | ! is the quote character. |
m|Enterprise| | | is the quote character. |
/Enterpriser/ | Standard Matching quote character is /. m is not needed |
The pattern string in interpolated in the quotes as with any double quoted string. In this program, for example, we see that in both cases we get a successful match.
#!/usr/bin/perl $pat = "Bicycle"; $string = "Mary likes a bicycle"; if( $string =~ m!$pat!i ){ print "Found Match\n"; }else{ print "Uh Oh!\n"; } if( $string =~ m!Bicycle!i ){ print "Found Match\n"; }else{ print "Uh Oh!\n"; } |
Note that the i after the pattern tells Perl to match regardless of the case of the letters. This is a switch to turn on case insensitive matching. In addition to the regular interpolation, Perl provides a series of powerful metacharacters to extend it's pattern matching capabilities. These metacharacters together formulate a whole division of programming called regular expressions. Regular expressions are not unique to Perl. They were originally designed for Unix and were used in the ed editor, in the shell, and extended later to sed, grep, VI and awk, among other places.
Completely mastering regular expressions can take years of practice. Even so, the novice programmer can master quickly a subset of regex to produce powerful and useful code. Let's look first at 10 metacharacters in Perl regex and see how they are used with simple m// syntax.
. | Match any character | m/b./ matches ba bb bc ... Note that any character means it doesn't care about what character it matches and retains no memory of what it matched. It doesn't care what character is in front of it. It does not match a linefeed |
* | Match the previous character zero or more times | m/br*/ will match ba bb br and even b. Note this is not
the same behaviour seen on the command line. This is used mostly with a dot. ".*" behaves similarly to * in the born shell. |
? | Matches the previous character zero or one times. | m/br?/ matches br, brrr, bragley or b |
+ | Match 1 or more of the previous character | m/br+/ matches brrrrr brrring but not b |
\ | Escapes from the default meaning as usual | m/\./ matches b. bbb. . etc |
^ | Matches the beginning of the line | m/^Br/ matches Br but not rBr |
$ | Matches the end of the line | Similar to VI where 1,$s/ra/ar/ goes from line one to the end. m/ben$/ matches Ruben not Rubin |
\d | matches digits | m/\d+/ matches 123 not abc |
\s | Matches whitespace | m/b\s+b/ Match "b b" but not "bb" |
\w | Matches word characters | Matches alphanumerics and underscores m/\w+/ matches May_var |
\b | Matches the word boundary | m/\bring/ matches ring, not bring |
Here is a little program that you can use to help you understand how the metacharacters work. Experimentation and use is the quickest route to mastering these metacharacters.
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "baaaa"; $four = "bagly"; $five = "agly"; @array = ($zero, $one, $two, $three, $four, $five); for $tmp (@array){ if($tmp =~ m/(bb+)/){ print "$tmp matches\n"; print "$1 is the actual matched pattern\n\n"; } } |
Perl's regular expression engine also takes switches after the ending quote character. We saw that m/enterprise/i will match without regard to case. 'i' is case incentive matching. Use can use the acronym misx:
m | multiline searching | This switch changes this default behaviour to match the \n. $ and ^ change behaviour and mark lines instead of entire string If we use the string $starwars = "Once upon a time\n in a galaxy far far away"; and run the match $starwars =~ m/time$/m we match: 'time'. m/time$/ is an empty match |
i | Match case insensitive | Enterprise, enterPrise and enterprise are all viewed as the same pattern. |
s | Match strings | Changes the default behavior of the . metacharacter\n The dot doesn't match a \n normally to facilitate matching standard input. This switch changes this behaviour and the . does match \n. If we use the string $starwars = "Once upon a time\n in a galaxy far far away"; and run the match $starwars =~ m/time..far$/s we match: 'time\n a Galaxy far'. m/time$/ is an empty match |
x | Permits white space in the regular expression string to easy legibility of the code |
Permits things like m!\s.+away \.!x |
For example if we have a matching quote string of m!Galaxy.*far! we can use this quote instead:
m!Galaxy.{0,}far!
The quantifier tries to match the last character and perform repeated matches of it according to the following rules:
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six); for $tmp (@array){ if($tmp =~ m/(mo{0})/s){ print "$tmp matches\n"; print "\'$1\' is the actual matched pattern\n\n"; } } |
ruben@ruben:/home/ruben/perl_course > ./file63.pl
moooooooo matches
'm' is the actual matched pattern
Once upon a time
in a galaxy far far away matches
'm' is the actual matched pattern
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six); for $tmp (@array){ if($tmp =~ m/(mo{0,})/s){ print "$tmp matches\n"; print "\'$1\' is the actual matched pattern\n\n"; } } |
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six); for $tmp (@array){ if($tmp =~ m/(mo{0,2})/s){ print "$tmp matches\n"; print "\'$1\' is the actual matched pattern\n\n"; } } |
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six); for $tmp (@array){ if($tmp =~ m/(mo{1,3})/s){ print "$tmp matches\n"; print "\'$1\' is the actual matched pattern\n\n"; } } |
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six); for $tmp (@array){ if($tmp =~ m/(mo{1,})/s){ print "$tmp matches\n"; print "\'$1\' is the actual matched pattern\n\n"; } } |
If the parenthesis are used with the explicit quantifier, the quantifier represents the grouping in the parenthesis. Otherwise, it only represents the last character matched.
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy far far away"; $seven = "Once upon a time\n in a galaxy far far far far far far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six, $seven); for $tmp (@array){ if($tmp =~ m/((far ){1,})/s){ print "$tmp matches\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." fars in our string\n\n"; } } |
ruben@JBSapphire:~/perl_course>./file68.pl Once upon a time in a galaxy far far away matches 'far far ' and 'far ' are the actual matched pattern There are 2 fars in our string Once upon a time in a galaxy far far far far far far far away matches 'far far far far far far far ' and 'far ' are the actual matched pattern There are 7 fars in our stringIn this example, we have nested parenthesis. I wanted to match 'far ' as many times as it appears and needed to use the the internal parenthesis to accomplish this. The external parenthesis assign $1 since they are seen first, and the internal $2. Notice that if we change the {1,} to {4,} that string 6 is not a match and string 7 is! Try it on your own
'|'
symbol in your pattern. It looks for the first
pattern, and if it can't find it, it then looks for the second patterns. Without parenthesis,
it defaults to matching a single character. If you use parenthesis, it considers them a single
unit generically called an atom in regex speak.Hence:
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy far far away"; $seven = "Once upon a time\n in a galaxy far far far far far far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six, $seven); for $tmp (@array){ if($tmp =~ m/((far )|(far far))/s){ print "$tmp matches\n"; print "\'$1\' and \'$2\' and \'$3\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." fars in our string\n\n"; } } |
ruben@JBSapphire:~/perl_course>file69.pl Once upon a time in a galaxy far far away matches 'far ' and 'far ' and '' are the actual matched pattern There are 1 fars in our string Once upon a time in a galaxy far far far far far far far away matches 'far ' and 'far ' and '' are the actual matched pattern There are 1 fars in our stringBut with a little change:
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy far far away"; $seven = "Once upon a time\n in a galaxy far far far far far far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six, $seven); for $tmp (@array){ if($tmp =~ m/((Once )|(far far))/s){ print "$tmp matches\n"; print "\'$1\' and \'$2\' and \'$3\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our string\n\n"; } } |
ruben@JBSapphire:~/perl_course>file69a.pl Once upon a time in a galaxy far far away matches 'Once ' and 'Once ' and '' are the actual matched pattern There are 1 words in our string Once upon a time in a galaxy far far far far far far far away matches 'Once ' and 'Once ' and '' are the actual matched pattern There are 1 words in our stringor
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time\n in a galaxy close close away"; $seven = "Once upon a time\n in a galaxy far far far far far far far away"; @array = ($zero, $one, $two, $three, $four, $five, $six, $seven); for $tmp (@array){ if($tmp =~ m/((close close )|(far far))/s){ print "$tmp matches\n"; print "\'$1\' and \'$2\' and \'$3\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our string\n\n"; } } |
ruben@JBSapphire:~/perl_course>file69b.pl Once upon a time in a galaxy close close away matches 'close close ' and 'close close ' and '' are the actual matched pattern There are 2 words in our string Once upon a time in a galaxy far far far far far far far away matches 'far far' and '' and 'far far' are the actual matched pattern There are 2 words in our string
Notice and account for the different order that the patterns are assigned when printed.
#!/usr/bin/perl $zero = "bbbb"; $one = "b"; $two = "brrrr"; $three = "moooooooo"; $four = "bagly"; $five = "agly"; $six = "Once upon a time ....\n in a galaxy close close away"; @array = ($zero, $one, $two, $three, $four, $five, $six); for $tmp (@array){ if($tmp =~ s/(close close )/far far /){ print "$tmp is the matched variable\n"; print "\'$1\' is the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our string\n\n"; } } |
is and example of a simple match. In the second half of the substitution, some of the things we can do wrong is to try to include parenthesis around far:
Some things you can put there are regular double quote stuff like:
The general rule for substitution is:
s/PATTERN/REPLACEMENT/egimosx
substitution has different switches then plain matches. This table explains the use of the switches.
s///i | Case insensitive Matching | Just like with plain matching |
s///g | Global replacement. | Replaces every pattern matched in the bound variable. Similar to VI |
s///e | evaluate the second half of the substitution string. | This can be most useful, but can also be a security risk. It works similarly as the function eval (perldoc -f eval). Any correct perl syntax can be put on the right side and it is evaluated on the fly by Perl. |
s///m | Multiple Lines. | As before, changes the default behavior of '^' and '$' to stop with\n. |
s///s | Single line | As before, it changes the behavior of '.' so that it matches linefeeds. |
s///x | Allows white space in your string | Works like matches |
s///o | Compile regex once. | Normally, the regular expression engine will evaluate a pattern on the left. Before doing so, it interpolates any scalars that might be included within it. If you are running it in a loop or under other conditions, the pattern will keep being re-evaluated and the scalars re accessed. If you don't want this to happen, with it's overhead to your program, use can use the s///o switch to evaluate the pattern only once. If the scalars change, the pattern will not change. |
#!/usr/bin/perl $zero = "far"; $one = "close"; $two = "far far "; $three = "Solar System"; $four = "TIME"; $five = "print $tmp"; $six = "Once upon a time ....\n in a galaxy close close close close close away"; @array = ($zero, $one, $two, $three, $four, $five, $six); for $tmp (@array){ if($tmp =~ s/((close ){1,})/far far /){ print "This is the FIRST altered string:\n$tmp\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; } } @array = &setright; for $tmp (@array){ if($tmp =~ s/((close ){1,})/$two/){ print "This is the SECOND altered string:\n$tmp\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; } } @array = &setright; for $tmp (@array){ if($tmp =~ s/((close ){1,})/"$zero " x5/e){ print "This is the THIRD altered string:\n$tmp\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; } } @array = &setright; for $tmp (@array){ if($tmp =~ s/((close ){1,})/"$zero " x5/){ print "This is the FORTH altered string:\n$tmp\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; } } @array = &setright; for $tmp (@array){#Notice it is not easy to predict WHEN this will print if($tmp =~ s/((close ){1,})/"$zero " x5; print "$tmp\n";/e){ print "This is the FIFTH altered string:\n$tmp\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; } } @array = &setright; $six = $array[6]; for $tmp (@array){ if($tmp =~ s/(($six){1,})/"$zero " x5/e){ print "This is the SIXTH altered string:\n$tmp\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; } } @array = &setright; for $tmp (@array){ $i++; $place = $tmp; print "This is the SEVENTH sub $i string before attempting to alter it:\n$place\n"; $tmp =~ s/(($place){1,})/"$zero " x5/oe; print "This is the SEVENTH sub $i the string after we attempted to alter:\n$tmp\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; }# NOTICE THAT $1 and $2 seem to remain 'far' after the initial match #and $place within the regex does not alter $i = 0; @array = &setright; for $tmp (@array){ $i++; $place = $tmp; $tmp =~ s/(($place){1,})/"$zero " x5/e; print "This is the EIGHT sub $i altered string:\n$tmp\n"; print "\'$1\' and \'$2\' are the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; }# NOTICE THAT $1 and $2 change $i = 0; @array = &setright; for $tmp (@array){ $i++; $place = $tmp; $tmp =~ s/($one)/$zero/g; print "This is the NINTH sub $i altered string:\n$tmp\n"; print "\'$1\' is the actual matched pattern\n"; @fars = split /\s+/, $1; print "There are " . @fars ." words in our matched string\n\n"; } sub setright{ my $zero = "far"; my $one = "close"; my $two = "far far "; my $three = "Solar System"; my $four = "TIME"; my $five = "print $tmp"; my $six = "Once upon a time ....\n in a galaxy close close close close close away"; my @array = ($zero, $one, $two, $three, $four, $five, $six); return @array; } |
s{[abcd]}{efg} will replace any of the letters a or b or c or d with the string efg. You can also use a hyphen in your case to expraplote the characters such as m/[a-z]/ which will match all characters between a and z, but not numbers or capital letters, extended ascii characters etc.
You can invere the logic of the match and permit a match of anything but the letters in your class by begging the class with a carot '^'. m([^.\-&\\]) will match anything but the period, hyphen, ampestand or slash. Yoou might see such code in security functions. Since '-' has special meaning within the class brackets, backslash out of it if you wish to match it.
You now have a fairly good introduction to pattern matching in perl for your general needs. Perl regular expressions is much more detailed than what is covered in this section. The two most important documents with your Perl distribution is man perlre and man perlop. Very complex things can be done with regex in Perl, including look aheads, returned lists, and a host of special variables which alter how your pattern behaves. In truth, Perl, regular expressions is so extenssive that a full course can be given on the subject.
Beaware that many perl function interplay with regular expression. Two of the most important ones is split and grep.
spit is defined as @array = split PATTERN, $scalar. split is used extensively for data manipulation. It is often the case that you receive data as some deliminated string. The Unix /etc/passwd file which difines users is an example. It is a colon deliminated text file. If you want to open at and assign each users record to a database in memory, split is the way to go. Try this program uses many of the programming techniques that we learned so far. Can you alter this program to send an email to each user on the list?
|