Regular Expressions and Matching (Modern Perl 2011-2012)
文章推薦指數: 80 %
Perl's text processing power comes from its use of regular expressions. A regular expression (regex or regexp) is a pattern which describes characteristics ...
ModernPerl:2011-2012edition
TableofContents
Index
ModernPerlBooks
Contents
Preface
ThePerlPhilosophy
PerlandItsCommunity
ThePerlLanguage
Operators
Functions
→RegularExpressions&Matching←
Objects
StyleandEfficacy
ManagingRealPrograms
PerlBeyondSyntax
WhattoAvoid
What'sMissing
TableofContents
Index
Thisbookisfree!
VisitModernPerltodownloadyourowncopyofthisbook.Youcanalsobuyaprintedcopy!
ModernPerlatPowell'sModernPerlatB&NModernPerlatAmazon
©2010-2012chromatic
PublishedbyOnyxNeon
RegularExpressionsandMatching
Perl'stextprocessingpowercomesfromitsuseofregularexpressions.Aregularexpression(regexorregexp)isapatternwhichdescribescharacteristicsofapieceoftext.Aregularexpressionengineinterpretspatternsandappliesthemtomatchormodifypiecesoftext.
Perl'scoreregexdocumentationincludesatutorial(perldocperlretut),areferenceguide(perldocperlreref),andfulldocumentation(perldocperlre).JeffreyFriedl'sbookMasteringRegularExpressionsexplainsthetheoryandthemechanicsofhowregularexpressionswork.Whilemasteringregularexpressionsisadauntingpursuit,alittleknowledgewillgiveyougreatpower.
Literals
Regexescanbeassimpleassubstringpatterns:
my$name='Chatfield';
say'Foundahat!'if$name=~/hat/;
Thematchoperator(m//,abbreviated//)identifiesaregularexpression—inthisexample,hat.Thispatternisnotaword.Insteaditmeans"thehcharacter,followedbytheacharacter,followedbythetcharacter."Eachcharacterinthepatternisanindivisibleelement,oratom.Itmatchesoritdoesn't.
Theregexbindingoperator(=~)isaninfixoperator(Fixity)whichappliestheregexofitssecondoperandtoastringprovidedbyitsfirstoperand.Whenevaluatedinscalarcontext,amatchevaluatestoatruevalueifitsucceeds.Thenegatedformofthebindingoperator(!~)evaluatestoatruevalueunlessthematchsucceeds.
Rememberindex!
Theindexbuiltincanalsosearchforaliteralsubstringwithinastring.Usingaregexengineforthatislikeflyingyourautonomouscombathelicoptertothecornerstoretobuycheese—butPerlallowsyoutodecidewhatyoufindmostmaintainable.
Thesubstitutionoperator,s///,isinonesenseacircumfixoperator(Fixity)withtwooperands.Itsfirstoperandisaregularexpressiontomatchwhenusedwiththeregexbindingoperator.Thesecondoperandisasubstringusedtoreplacethematchedportionofthefirstoperandusedwiththeregexbindingoperator.Forexample,tocurepeskysummerallergies:
my$status='Ifeelill.';
$status=~s/ill/well/;
say$status;
Theqr//OperatorandRegexCombinations
Theqr//operatorcreatesfirst-classregexes.Interpolatethemintothematchoperatortousethem:
my$hat=qr/hat/;
say'Foundahat!'if$name=~/$hat/;
...orcombinemultipleregexobjectsintocomplexpatterns:
my$hat=qr/hat/;
my$field=qr/field/;
say'Foundahatinafield!'
if$name=~/$hat$field/;
like($name,qr/$hat$field/,
'Foundahatinafield!');
Likeis,withMorelike
Test::More'slikefunctionteststhatthefirstargumentmatchestheregexprovidedasthesecondargument.
Quantifiers
Regularexpressionsgetmorepowerfulthroughtheuseofregexquantifiers,whichallowyoutospecifyhowoftenaregexcomponentmayappearinamatchingstring.Thesimplestquantifieristhezerooronequantifier,or?:
my$cat_or_ct=qr/ca?t/;
like('cat',$cat_or_ct,"'cat'matches/ca?t/");
like('ct',$cat_or_ct,"'ct'matches/ca?t/");
Anyatominaregularexpressionfollowedbythe?charactermeans"matchzerooroneofthisatom."Thisregularexpressionmatchesifzerooroneacharactersimmediatelyfollowaccharacterandimmediatelyprecedeatcharacter,eithertheliteralsubstringcatorct.
Theoneormorequantifier,or+,matchesonlyifthereisatleastoneofthequantifiedatom:
my$some_a=qr/ca+t/;
like('cat',$some_a,"'cat'matches/ca+t/");
like('caat',$some_a,"'caat'matches/");
like('caaat',$some_a,"'caaat'matches");
like('caaaat',$some_a,"'caaaat'matches");
unlike('ct',$some_a,"'ct'doesnotmatch");
Thereisnotheoreticallimittothemaximumnumberofquantifiedatomswhichcanmatch.
Thezeroormorequantifier,*,matcheszeroormoreinstancesofthequantifiedatom:
my$any_a=qr/ca*t/;
like('cat',$any_a,"'cat'matches/ca*t/");
like('caat',$any_a,"'caat'matches");
like('caaat',$any_a,"'caaat'matches");
like('caaaat',$any_a,"'caaaat'matches");
like('ct',$any_a,"'ct'matches");
Assillyasthisseems,itallowsyoutospecifyoptionalcomponentsofaregex.Useitsparingly,though:it'sabluntandexpensivetool.Mostregularexpressionsbenefitfromusingthe?and+quantifiersfarmorethan*.Precisionofintentoftenimprovesclarity.
Numericquantifiersexpressspecificnumbersoftimesanatommaymatch.{n}meansthatamatchmustoccurexactlyntimes.
#equivalenttoqr/cat/;
my$only_one_a=qr/ca{1}t/;
like('cat',$only_one_a,"'cat'matches/ca{1}t/");
{n,}matchesanatomatleastntimes:
#equivalenttoqr/ca+t/;
my$some_a=qr/ca{1,}t/;
like('cat',$some_a,"'cat'matches/ca{1,}t/");
like('caat',$some_a,"'caat'matches");
like('caaat',$some_a,"'caaat'matches");
like('caaaat',$some_a,"'caaaat'matches");
{n,m}meansthatamatchmustoccuratleastntimesandcannotoccurmorethanmtimes:
my$few_a=qr/ca{1,3}t/;
like('cat',$few_a,"'cat'matches/ca{1,3}t/");
like('caat',$few_a,"'caat'matches");
like('caaat',$few_a,"'caaat'matches");
unlike('caaaat',$few_a,"'caaaat'doesn'tmatch");
Youmayexpressthesymbolicquantifiersintermsofthenumericquantifiers,butmostprogramsusetheformerfarmoreoftenthanthelatter.
Greediness
The+and*quantifiersaregreedy,astheytrytomatchasmuchoftheinputstringaspossible.Thisisparticularlypernicious.Consideranaïveuseofthe"zeroormorenon-newlinecharacters"patternof.*:
#apoorregex
my$hot_meal=qr/hot.*meal/;
say'Foundahotmeal!'
if'Ihaveahotmeal'=~$hot_meal;
say'Foundahotmeal!'
if'one-shot,piecemealwork!'=~$hot_meal;
Greedyquantifiersstartbymatchingeverythingatfirst,andbackoffacharacteratatimeonlywhenit'sobviousthatthematchwillnotsucceed.
The?quantifiermodifierturnsagreedy-quantifierparsimonious:
my$minimal_greedy=qr/hot.*?meal/;
Whengivenanon-greedyquantifier,theregularexpressionenginewillprefertheshortestpossiblepotentialmatchandwillincreasethenumberofcharactersidentifiedbythe.*?tokencombinationonlyifthecurrentnumberfailstomatch.Because*matcheszeroormoretimes,theminimalpotentialmatchforthistokencombinationiszerocharacters:
say'Foundahotmeal'
if'ilikeahotmeal'=~/$minimal_greedy/;
Use+?tomatchoneormoreitemsnon-greedily:
my$minimal_greedy_plus=qr/hot.+?meal/;
unlike('ilikeahotmeal',$minimal_greedy_plus);
like('ilikeahotmeal',$minimal_greedy_plus);
The?quantifiermodifieralsoappliestothe?(zerooronematches)quantifieraswellastherangequantifiers.Ineverycase,itcausestheregextomatchaslittleoftheinputaspossible.
Thegreedypatterns.+and.*aretemptingbutdangerous.AcruciverbalistAcrosswordpuzzleafficionado.whoneedstofillinfourboxesof7Down("Richsoil")willfindtoomanyinvalidcandidateswiththepattern:
my$seven_down=qr/l$letters_only*m/;
She'llhavetodiscardAlabama,Belgium,andBethlehemlongbeforetheprogramsuggestsloam.Notonlyarethosewordstoolong,butthematchesstartinthemiddleofthewords.Aworkingunderstandingofgreedinesshelps,butthereisnosubstituteforthecopioustestingwithreal,workingdata.
RegexAnchors
Regexanchorsforcetheregexenginetostartorendamatchatanabsoluteposition.Thestartofstringanchor(\A)dictatesthatanymatchmuststartatthebeginningofthestring:
#alsomatches"lammed","lawmaker",and"layman"
my$seven_down=qr/\Al${letters_only}{2}m/;
Theendoflinestringanchor(\Z)requiresthatamatchendattheendofalinewithinthestring.
#alsomatches"loom",butanobviousimprovement
my$seven_down=qr/\Al${letters_only}{2}m\Z/;
Thewordboundaryanchor(\b)matchesonlyattheboundarybetweenawordcharacter(\w)andanon-wordcharacter(\W).UseananchoredregextofindloamwhileprohibitingBelgium:
my$seven_down=qr/\bl${letters_only}{2}m\b/;
Metacharacters
Perlinterpretsseveralcharactersinregularexpressionsasmetacharacters,charactersrepresentsomethingotherthantheirliteralinterpretation.Metacharactersgiveregexwielderspowerfarbeyondmeresubstringmatches.Theregexenginetreatsallmetacharactersasatoms.
The.metacharactermeans"matchanycharacterexceptanewline".Rememberthatcaveat;manynovicesforgetit.Asimpleregexsearch—ignoringtheobviousimprovementofusinganchors—for7Downmightbe/l..m/.Ofcourse,there'salwaysmorethanonewaytogettherightanswer:
formy$word(@words)
{
nextunlesslength($word)==4;
nextunless$word=~/l..m/;
say"Possibility:$word";
}
Ifthepotentialmatchesin@wordsaremorethanthesimplestEnglishwords,youwillgetfalsepositives..alsomatchespunctuationcharacters,whitespace,andnumbers.Bespecific!The\wmetacharacterrepresentsallalphanumericcharacters(UnicodeandStrings)andtheunderscore:
nextunless$word=~/l\w\wm/;
The\dmetacharactermatchesdigits(alsointheUnicodesense):
#notarobustphonenumbermatcher
nextunless$number=~/\d{3}-\d{3}-\d{4}/;
say"Ihaveyournumber:$number";
Usethe\smetacharactertomatchwhitespace,whetheraliteralspace,atabcharacter,acarriagereturn,aform-feed,oranewline:
my$two_three_letter_words=qr/\w{3}\s\w{3}/;
NegatedMetacharacters
Thesemetacharactershavenegatedforms.Use\Wtomatchanycharacterexceptawordcharacter.Use\Dtomatchanon-digitcharacter.Use\Stomatchanythingbutwhitespace.Use\Btomatchanywhereexceptawordboundary.
CharacterClasses
Whennoneofthosemetacharactersisspecificenough,specifyyourowncharacterclassbyenclosingtheminsquarebrackets:
my$ascii_vowels=qr/[aeiou]/;
my$maybe_cat=qr/c${ascii_vowels}t/;
InterpolationHappens
Withoutthosecurlybraces,Perl'sparserwouldinterpretthevariablenameas$ascii_vowelst,whicheithercausesacompile-timeerroraboutanunknownvariableorinterpolatesthecontentsofanexisting$ascii_vowelstintotheregex.
Thehyphencharacter(-)allowsyoutospecifyacontiguousrangeofcharactersinaclass,suchasthis$ascii_letters_onlyregex:
my$ascii_letters_only=qr/[a-zA-Z]/;
Toincludethehyphenasamemberoftheclass,moveittothestartorend:
my$interesting_punctuation=qr/[-!?]/;
...orescapeit:
my$line_characters=qr/[|=\-_]/;
Usethecaret(^)asthefirstelementofthecharacterclasstomean"anythingexceptthesecharacters":
my$not_an_ascii_vowel=qr/[^aeiou]/;
MetacharactersinCharacterClasses
Useacaretanywherebutthefirstpositiontomakeitamemberofthecharacterclass.Toincludeahypheninanegatedcharacterclass,placeitafterthecaretorattheendoftheclass,orescapeit.
Capturing
Regularexpressionsallowyoutogroupandcaptureportionsofthematchforlateruse.ToextractanAmericantelephonenumberoftheform(202)456-1111fromastring:
my$area_code=qr/\(\d{3}\)/;
my$local_number=qr/\d{3}-?\d{4}/;
my$phone_number=qr/$area_code\s?$local_number/;
Noteespeciallytheescapingoftheparentheseswithin$area_code.ParenthesesarespecialinPerl5regularexpressions.Theygroupatomsintolargerunitsandalsocaptureportionsofmatchingstrings.Tomatchliteralparentheses,escapethemwithbackslashesasseenin$area_code.
NamedCaptures
Perl5.10addednamedcaptures,whichallowyoutocaptureportionsofmatchesfromapplyingaregularexpressionandaccessthemlater,suchasfindingaphonenumberinastringofcontactinformation:
if($contact_info=~/(?
延伸文章資訊
- 1Perl的基本語法
Regular Expression通常是用來尋找特定的字串樣式(pattern),也就是所謂格式辨認(pattern-matching)的功能。 它的運算子是『=~』和『!~』,可以把它念做m...
- 2Perl Regular Expression - Perl Tutorial
A regular expression is a pattern that provides a flexible and concise means to match the string ...
- 3精簡扼要的Perl 課程講義(六):常規表達式(Regular ...
精簡扼要的Perl 課程講義(六):常規表達式(Regular Expression) ... 若比對成功,則print print "It matches\n" if $string =~ /...
- 4Perl - Regular Expressions - Tutorialspoint
The Match Operator. The match operator, m//, is used to match a string or statement to a regular ...
- 5perlrequick - Perl regular expressions quick start
The operator =~ associates the string with the regex match and produces a true value if the regex...