data/TWiki/RegularExpression.txt,v
changeset 0 414e01d06fd5
equal deleted inserted replaced
-1:000000000000 0:414e01d06fd5
       
     1 head	1.8;
       
     2 access;
       
     3 symbols;
       
     4 locks; strict;
       
     5 comment	@# @;
       
     6 
       
     7 
       
     8 1.8
       
     9 date	2007.01.16.04.12.04;	author TWikiContributor;	state Exp;
       
    10 branches;
       
    11 next	1.7;
       
    12 
       
    13 1.7
       
    14 date	2006.04.01.05.55.09;	author TWikiContributor;	state Exp;
       
    15 branches;
       
    16 next	1.6;
       
    17 
       
    18 1.6
       
    19 date	2006.02.01.12.01.17;	author TWikiContributor;	state Exp;
       
    20 branches;
       
    21 next	1.5;
       
    22 
       
    23 1.5
       
    24 date	2003.04.15.05.19.25;	author PeterThoeny;	state Exp;
       
    25 branches;
       
    26 next	1.4;
       
    27 
       
    28 1.4
       
    29 date	2003.03.22.05.12.00;	author PeterThoeny;	state Exp;
       
    30 branches;
       
    31 next	1.3;
       
    32 
       
    33 1.3
       
    34 date	2002.11.23.05.52.00;	author PeterThoeny;	state Exp;
       
    35 branches;
       
    36 next	1.2;
       
    37 
       
    38 1.2
       
    39 date	2000.08.23.06.58.32;	author PeterThoeny;	state Exp;
       
    40 branches;
       
    41 next	1.1;
       
    42 
       
    43 1.1
       
    44 date	2000.08.18.08.47.58;	author PeterThoeny;	state Exp;
       
    45 branches;
       
    46 next	;
       
    47 
       
    48 
       
    49 desc
       
    50 @none
       
    51 @
       
    52 
       
    53 
       
    54 1.8
       
    55 log
       
    56 @buildrelease
       
    57 @
       
    58 text
       
    59 @%META:TOPICINFO{author="TWikiContributor" date="1164227726" format="1.1" version="8"}%
       
    60 ---+!! Regular Expressions
       
    61 
       
    62 %TOC%
       
    63 ---++ Introduction
       
    64 
       
    65 Regular expressions (REs), unlike simple queries, allow you to search for text which matches a particular pattern.
       
    66 
       
    67 REs are similar to (but more poweful than) the "wildcards" used in the command-line interfaces found in operating systems such as Unix and MS-DOS. REs are used by sophisticated search engines, as well as by many Unix-based languages and tools ( e.g., =awk=, =grep=, =lex=, =perl=, and =sed= ).
       
    68 
       
    69 ---++ Examples
       
    70 
       
    71 | =compan(y|ies)= | Search for *company*, *companies* |
       
    72 | =(peter|paul)= | Search for *peter*, *paul* |
       
    73 | =bug*= | Search for *bug*, *bugg*, *buggg* or simply *bu* (a star matches *zero* or more instances of the previous character) |
       
    74 | =bug.*= | Search for *bug*, *bugs*, *bugfix* (a dot-star matches zero or more instances of *any* character) |
       
    75 | =[Bb]ag= | Search for *Bag*, *bag* |
       
    76 | =b[aiueo]g= | Second letter is a vowel. Matches *bag*, *bug*, *big* |
       
    77 | =b.g= | Second letter is any letter. Matches also *b&g* |
       
    78 | =[a-zA-Z]= | Matches any one letter (but not a number or a symbol) |
       
    79 | =[^0-9a-zA-Z]= | Matches any symbol (but not a number or a letter) |
       
    80 | =[A-Z][A-Z]*= | Matches one or more uppercase letters |
       
    81 | =[0-9]{3}-[0-9]{2}-[0-9]{4}= | US social security number, e.g. *123-45-6789* |
       
    82 | =PNG;Chart= | Search for topics containing the words *PNG* _and_ *Chart*. The =";"= _and_ separator is TWiki-specific and is not a regular expression; it is a useful facility that is enabled when regular expression searching is enabled. |
       
    83 
       
    84 ---++ Searches with "and" combinations
       
    85 
       
    86    * TWiki extends the regular expressions with an _and_ search. The delimiter is a semicolon =;=. Example search for "form" _and_ "template": =form;template=
       
    87 
       
    88    * Use Google if your TWiki site is public. Example search for "form" _and_ "template" at TWiki.org: =site:twiki.org +form +template=
       
    89 
       
    90 ---++ Advanced
       
    91 
       
    92 Here is stuff for our UNIX freaks: (copied from 'man egrep')
       
    93 
       
    94 <blockquote>
       
    95 A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.
       
    96 
       
    97 The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash.
       
    98 
       
    99 A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit.
       
   100 
       
   101 Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example.
       
   102 
       
   103 Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [<nop>[:alnum:]<nop>] means [0-9A-Za-z], except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is independent of locale and character set. (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket list.) Most metacharacters lose their special meaning inside lists. To include a literal ] place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last.
       
   104 
       
   105 The period . matches any single character. The symbol \w is a synonym for [<nop>[:alnum:]<nop>] and \W is a synonym for [^[:alnum]<nop>].
       
   106 
       
   107 The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. The symbols \&lt; and \&gt; respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word.
       
   108 
       
   109 A regular expression may be followed by one of several repetition operators:
       
   110 | ? | The preceding item is optional and matched at most once. |
       
   111 | * | The preceding item will be matched zero or more times. |
       
   112 | + | The preceding item will be matched one or more times. |
       
   113 | {n} | The preceding item is matched exactly n times. |
       
   114 | {n,} | The preceding item is matched n or more times. |
       
   115 | {n,m} | The preceding item is matched at least n times, but not more than m times. |
       
   116 
       
   117 Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.
       
   118 
       
   119 Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.
       
   120 
       
   121 Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules.
       
   122 
       
   123 The backreference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized subexpression of the regular expression.
       
   124 </blockquote>
       
   125 
       
   126 __Related Links:__ 
       
   127    * http://perldoc.perl.org/perlretut.html - Regular expressions tutorial
       
   128    * http://www.perl.com/doc/manual/html/pod/perlre.html - Perl regular expressions
       
   129 
       
   130 __Related Topics:__ UserDocumentationCategory
       
   131 @
       
   132 
       
   133 
       
   134 1.7
       
   135 log
       
   136 @buildrelease
       
   137 @
       
   138 text
       
   139 @d1 1
       
   140 a1 1
       
   141 %META:TOPICINFO{author="TWikiContributor" date="1111929255" format="1.0" version="7"}%
       
   142 d68 4
       
   143 @
       
   144 
       
   145 
       
   146 1.6
       
   147 log
       
   148 @buildrelease
       
   149 @
       
   150 text
       
   151 @d1 1
       
   152 a1 1
       
   153 %META:TOPICINFO{author="TWikiContributor" date="1111929255" format="1.0" version="6"}%
       
   154 d28 1
       
   155 a28 1
       
   156 	* TWiki extends the regular expressions with an _and_ search. The delimiter is a semicolon =;=. Example search for "form" _and_ "template": =form;template=
       
   157 d30 1
       
   158 a30 1
       
   159 	* Use Google if your TWiki site is public. Example search for "form" _and_ "template" at TWiki.org: =site:twiki.org +form +template=
       
   160 @
       
   161 
       
   162 
       
   163 1.5
       
   164 log
       
   165 @none
       
   166 @
       
   167 text
       
   168 @d1 68
       
   169 a68 66
       
   170 %META:TOPICINFO{author="PeterThoeny" date="1050383965" format="1.0" version="1.5"}%
       
   171 ---+!! Regular Expressions
       
   172 
       
   173 %TOC%
       
   174 ---++ Introduction
       
   175 
       
   176 Regular expressions (REs), unlike simple queries, allow you to search for text which matches a particular pattern.
       
   177 
       
   178 REs are similar to (but more poweful than) the "wildcards" used in the command-line interfaces found in operating systems such as Unix and MS-DOS. REs are used by sophisticated search engines, as well as by many Unix-based languages and tools ( e.g., =awk=, =grep=, =lex=, =perl=, and =sed= ).
       
   179 
       
   180 ---++ Examples
       
   181 
       
   182 | =compan(y&#124;ies)= | Search for *company*, *companies* |
       
   183 | =(peter&#124;paul)= | Search for *peter*, *paul* |
       
   184 | =bug*= | Search for *bug*, *bugg*, *buggg* or simply *bu* (a star matches *zero* or more instances of the previous character) |
       
   185 | =bug.*= | Search for *bug*, *bugs*, *bugfix* (a dot-star matches zero or more instances of *any* character) |
       
   186 | =[Bb]ag= | Search for *Bag*, *bag* |
       
   187 | =b[aiueo]g= | Second letter is a vowel. Matches *bag*, *bug*, *big* |
       
   188 | =b.g= | Second letter is any letter. Matches also *b&amp;g* |
       
   189 | =[a-zA-Z]= | Matches any one letter (but not a number or a symbol) |
       
   190 | =[^0-9a-zA-Z]= | Matches any symbol (but not a number or a letter) |
       
   191 | =[A-Z][A-Z]*= | Matches one or more uppercase letters |
       
   192 | =[0-9]{3}-[0-9]{2}-[0-9]{4}= | US social security number, e.g. *123-45-6789* |
       
   193 | =PNG;Chart= | Search for topics containing the words *PNG* _and_ *Chart*. The =";"= _and_ separator is TWiki-specific and is not a regular expression; it is a useful facility that is enabled when regular expression searching is enabled. |
       
   194 
       
   195 ---++ Searches with "and" combinations
       
   196 
       
   197 	* TWiki extends the regular expressions with an _and_ search. The delimiter is a semicolon =;=. Example search for "form" _and_ "template": =form;template=
       
   198 
       
   199 	* Use Google if your TWiki site is public. Example search for "form" _and_ "template" at TWiki.org: =site:twiki.org +form +template=
       
   200 
       
   201 ---++ Advanced
       
   202 
       
   203 Here is stuff for our UNIX freaks: (copied from 'man egrep')
       
   204 
       
   205 <blockquote>
       
   206 A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.
       
   207 
       
   208 The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash.
       
   209 
       
   210 A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit.
       
   211 
       
   212 Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example.
       
   213 
       
   214 Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [<nop>[:alnum:]<nop>] means [0-9A-Za-z], except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is independent of locale and character set. (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket list.) Most metacharacters lose their special meaning inside lists. To include a literal ] place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last.
       
   215 
       
   216 The period . matches any single character. The symbol \w is a synonym for [<nop>[:alnum:]<nop>] and \W is a synonym for [^[:alnum]<nop>].
       
   217 
       
   218 The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. The symbols \&lt; and \&gt; respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word.
       
   219 
       
   220 A regular expression may be followed by one of several repetition operators:
       
   221 | ? | The preceding item is optional and matched at most once. |
       
   222 | * | The preceding item will be matched zero or more times. |
       
   223 | + | The preceding item will be matched one or more times. |
       
   224 | {n} | The preceding item is matched exactly n times. |
       
   225 | {n,} | The preceding item is matched n or more times. |
       
   226 | {n,m} | The preceding item is matched at least n times, but not more than m times. |
       
   227 
       
   228 Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.
       
   229 
       
   230 Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.
       
   231 
       
   232 Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules.
       
   233 
       
   234 The backreference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized subexpression of the regular expression.
       
   235 </blockquote>
       
   236 @
       
   237 
       
   238 
       
   239 1.4
       
   240 log
       
   241 @none
       
   242 @
       
   243 text
       
   244 @d1 1
       
   245 a1 1
       
   246 %META:TOPICINFO{author="PeterThoeny" date="1048309920" format="1.0" version="1.4"}%
       
   247 d15 2
       
   248 a16 1
       
   249 | =bug*= | Search for *bug*, *bugs*, *bugfix* |
       
   250 @
       
   251 
       
   252 
       
   253 1.3
       
   254 log
       
   255 @none
       
   256 @
       
   257 text
       
   258 @d1 1
       
   259 a1 1
       
   260 %META:TOPICINFO{author="PeterThoeny" date="1038030720" format="1.0" version="1.3"}%
       
   261 a4 1
       
   262 
       
   263 d7 1
       
   264 a7 1
       
   265 Regular expressions (REs), unlike simple queries, allow you to search for text which matches a particular pattern. 
       
   266 d23 1
       
   267 a23 1
       
   268 | =PNG;Chart= | Search for topics containing the words *PNG* _and_ *Chart*. This is not a regular expression! But a useful facility that is enabled when regular expression searching is enabled. |
       
   269 d42 1
       
   270 a42 1
       
   271 Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example. 
       
   272 @
       
   273 
       
   274 
       
   275 1.2
       
   276 log
       
   277 @none
       
   278 @
       
   279 text
       
   280 @d1 7
       
   281 d12 50
       
   282 a61 1
       
   283 *Examples*
       
   284 d63 1
       
   285 a63 161
       
   286 <TABLE>
       
   287   <TR>
       
   288 	 <TD>
       
   289 		compan(y|ies)
       
   290 	 </TD><TD>
       
   291 		Search for _company_ , _companies_
       
   292 	 </TD>
       
   293   </TR><TR>
       
   294 	 <TD>
       
   295 		(peter|paul)
       
   296 	 </TD><TD>
       
   297 		Search for _peter_ , _paul_
       
   298 	 </TD>
       
   299   </TR><TR>
       
   300 	 <TD>
       
   301 		bug*
       
   302 	 </TD><TD>
       
   303 		Search for _bug_ , _bugs_ , _bugfix_
       
   304 	 </TD>
       
   305   </TR><TR>
       
   306 	 <TD>
       
   307 		[Bb]ag
       
   308 	 </TD><TD>
       
   309 		Search for _Bag_ , _bag_
       
   310 	 </TD>
       
   311   </TR><TR>
       
   312 	 <TD>
       
   313 		b[aiueo]g
       
   314 	 </TD><TD>
       
   315 		Second letter is a vowel. Matches _bag_ , _bug_ , _big_
       
   316 	 </TD>
       
   317   </TR><TR>
       
   318 	 <TD>
       
   319 		b.g
       
   320 	 </TD><TD>
       
   321 		Second letter is any letter. Matches also _b&g_
       
   322 	 </TD>
       
   323   </TR><TR>
       
   324 	 <TD>
       
   325 		[a-zA-Z]
       
   326 	 </TD><TD>
       
   327 		Matches any one letter (not a number and a symbol)
       
   328 	 </TD>
       
   329   </TR><TR>
       
   330 	 <TD>
       
   331 		[^0-9a-zA-Z]
       
   332 	 </TD><TD>
       
   333 		Matches any symbol (not a number or a letter)
       
   334 	 </TD>
       
   335   </TR><TR>
       
   336 	 <TD>
       
   337 		[A-Z][A-Z]*
       
   338 	 </TD><TD>
       
   339 		Matches one or more uppercase letters
       
   340 	 </TD>
       
   341   </TR><TR>
       
   342 	 <TD>
       
   343 		[0-9][0-9][0-9]-[0-9][0-9]- <br> [0-9][0-9][0-9][0-9]
       
   344 	 </TD><TD VALIGN="top">
       
   345 		US social security number, e.g. 123-45-6789
       
   346 	 </TD>
       
   347   </TR>
       
   348 </TABLE>
       
   349 
       
   350 Here is stuff for our UNIX freaks: <BR>
       
   351 (copied from 'man grep')
       
   352 
       
   353 <pre>
       
   354 	  \c	A backslash (\) followed by any special character is  a
       
   355 			 one-character  regular expression that matches the spe-
       
   356 			 cial character itself.  The special characters are:
       
   357 
       
   358 					+	 `.', `*', `[',  and  `\'  (period,  asterisk,
       
   359 						  left  square  bracket, and backslash, respec-
       
   360 						  tively), which  are  always  special,  except
       
   361 						  when they appear within square brackets ([]).
       
   362 
       
   363 					+	 `^' (caret or circumflex), which  is  special
       
   364 						  at the beginning of an entire regular expres-
       
   365 						  sion, or when it immediately follows the left
       
   366 						  of a pair of square brackets ([]).
       
   367 
       
   368 					+	 $ (currency symbol), which is special at  the
       
   369 						  end of an entire regular expression.							  
       
   370 
       
   371 	  .	 A `.' (period) is a  one-character  regular  expression
       
   372 			 that matches any character except NEWLINE.
       
   373  
       
   374 	  [string]
       
   375 			 A non-empty string of  characters  enclosed  in  square
       
   376 			 brackets  is  a  one-character  regular expression that
       
   377 			 matches any one character in that string.  If, however,
       
   378 			 the  first  character of the string is a `^' (a circum-
       
   379 			 flex or caret), the  one-character  regular  expression
       
   380 			 matches  any character except NEWLINE and the remaining
       
   381 			 characters in the string.  The  `^'  has  this  special
       
   382 			 meaning only if it occurs first in the string.  The `-'
       
   383 			 (minus) may be used to indicate a range of  consecutive
       
   384 			 ASCII  characters;  for example, [0-9] is equivalent to
       
   385 			 [0123456789].  The `-' loses this special meaning if it
       
   386 			 occurs  first (after an initial `^', if any) or last in
       
   387 			 the string.  The `]' (right square  bracket)  does  not
       
   388 			 terminate  such a string when it is the first character
       
   389 			 within it (after an initial  `^',  if  any);  that  is,
       
   390 			 []a-f]  matches either `]' (a right square bracket ) or
       
   391 			 one of the letters a through  f  inclusive.	The  four
       
   392 			 characters  `.', `*', `[', and `\' stand for themselves
       
   393 			 within such a string of characters.
       
   394 
       
   395 	  The following rules may be used to construct regular expres-
       
   396 	  sions:
       
   397 
       
   398 	  *	 A one-character regular expression followed by `*'  (an
       
   399 			 asterisk)  is a regular expression that matches zero or
       
   400 			 more occurrences of the one-character  regular  expres-
       
   401 			 sion.	If  there  is  any choice, the longest leftmost
       
   402 			 string that permits a match is chosen.
       
   403 
       
   404 	  ^	 A circumflex or caret (^) at the beginning of an entire
       
   405 			 regular  expression  constrains that regular expression
       
   406 			 to match an initial segment of a line.
       
   407 
       
   408 	  $	 A currency symbol ($) at the end of an  entire  regular
       
   409 			 expression  constrains that regular expression to match
       
   410 			 a final segment of a line.
       
   411 
       
   412 	  *	 A  regular  expression  (not  just	a	one-
       
   413 			 character regular expression) followed by `*'
       
   414 			 (an asterisk) is a  regular  expression  that
       
   415 			 matches  zero or more occurrences of the one-
       
   416 			 character regular expression.	If  there  is
       
   417 			 any  choice, the longest leftmost string that
       
   418 			 permits a match is chosen.
       
   419 
       
   420 	  +	 A regular expression followed by `+' (a  plus
       
   421 			 sign)  is  a  regular expression that matches
       
   422 			 one or more occurrences of the  one-character
       
   423 			 regular  expression.  If there is any choice,
       
   424 			 the longest leftmost string  that  permits  a
       
   425 			 match is chosen.
       
   426 
       
   427 	  ?	 A regular expression followed by `?' (a ques-
       
   428 			 tion  mark)  is  a  regular  expression  that
       
   429 			 matches zero or one occurrences of  the  one-
       
   430 			 character  regular  expression.	If there is
       
   431 			 any choice, the longest leftmost string  that
       
   432 			 permits a match is chosen.
       
   433 
       
   434 	  |	 Alternation:	 two	 regular	 expressions
       
   435 			 separated  by  `|'  or NEWLINE match either a
       
   436 			 match for  the  first  or  a  match  for  the
       
   437 			 second.
       
   438 
       
   439 	  ()	A regular expression enclosed in  parentheses
       
   440 			 matches a match for the regular expression.
       
   441 
       
   442 	  The order of precedence of operators at the same parenthesis
       
   443 	  level  is  `[ ]'  (character  classes),  then  `*'  `+'  `?'
       
   444 	  (closures),then  concatenation,  then  `|'  (alternation)and
       
   445 	  NEWLINE.
       
   446 </pre>
       
   447 d65 2
       
   448 @
       
   449 
       
   450 
       
   451 1.1
       
   452 log
       
   453 @none
       
   454 @
       
   455 text
       
   456 @d1 3
       
   457 a3 1
       
   458 Regular expressions allow more specific queries then a simple query.
       
   459 @