[regex-tutorial]: Part 2

Gerd Ewald Mon, 20 May 2002 07:15:12 -0700

Hi Batsmen,

this is the second part of the regex tutorial. This time we will learn
something about special meta characters which anchor the search
pattern like line and word boundaries. Furthermore you will be able to
use alternatives in search patterns.


The third part is in preparation. To let you know what comes next in
*Part 3*: it will explain quantifiers, groups, subpatterns.

But let's start with part 2 which will be online soon
at http://www.silverstones.com/thebat/Regex.html and
www.pro-privacy.de

=====================

4. Complex Patterns

Ok, that was an easy start! But it wasn't very interesting, was it?
But if simple search patterns were all that "Regular Expressions"
offer, it wouldn't be worth a tutorial.

So, there has to be more! Okay, let's get going with the more
complicated stuff:

4.1 Line Boundaries

Instead of having a regex look for text anywhere in the string we can
force it to search in specific parts of the string. These "anchored"
patterns have their own metacharacters: ^ and $ The circumflex ^ means
that the search pattern is anchored to the start of the line; the
dollar $ means that the regex will look for the pattern at the end of
a line (Yes, dear experts, for now, let's take a string as one line.
Ok?)

Example: "^give or take" This pattern will only be matched if 'give'
is at the beginning of a line and is followed by 'or take'.

Or: "This is the end$" is only matched if it appears at the end of the
line. It doesn't matter what comes first: 'This is the end' has to be
the end of the line!

You can use these two metacharacters to speed up the regex. I admit,
it is not all that important when you use regex in TB! because you
won't be working with large amounts of data. But on the other hand: it
can't hurt anyone ;-) Why does the regex work faster if you use the
circumflex or the dollar, you ask? Ok, let's use our example regex
"^give or take" on the string 'Once upon a time': the regex machine
checks whether the first thing it finds is the beginning of the line.
This returns TRUE. Next it checks the following character whether it
is a 'g'. The search process is cancelled at once because this returns
FALSE! Now what would have happened without the circumflex? The regex
machine would have checked the second, third, fourth etc. character to
match the search pattern, only to find out that the search pattern
doesn't exist in that string. The longer the string, the more time the
regex machine takes to fail ;-)

4.2 Word Boundaries

But there is more that regexian offers. Word boundaries! Some people
forget about this because they think there is another way to define
word boundaries. Believe me, there is, but it's nowhere near as easy
as this!

"\b" makes the regex searching for the pattern at word boundaries:
"\bgive or take".

Hey, we know this one, don't we? That is our first example again! The
pattern that was found in 'You have to forgive or take the
consequences!' but now won't be found thanks to the word boundary
metacharacter.

I remember a discussion in one of the German TB-lists where someone
asked why this metacharacter is necessary, because a word could be
recognized by surrounding spaces. This is not a good idea: words could
end at question marks, exclamation marks, a full stop.... A regex like
"ain " would indeed match 'Again a good idea' but wouldn't find 'Oh
no, not again.' You can avoid that when you use "\b" instead.

Of course, this metacharacter can be negated, as can the others: "\B"
which means that the regex should match characters everywhere in a
string other than at word boundaries.

Another example should explain this: "Re\B." The regex has to match
the characters 'Re' as long as they are not a word boundary, followed
by any other character (the dot). Now, we have the string: 'Re: or
Reply:'. Try it in the regex tester. What happens? The result is
'Rep'. Replace \B by \b and the regex matches 'Re:'. Everything clear
now?

4.3 Alternatives

You remember the first example in this tutorial "give or take"? When I
introduced it I made the redundant remark that this regex wouldn't
match 'give' OR 'take'. Well, this remark wasn't really redundant: I
needed something to start this chapter, some kind of transition <bg>.
Because this is the chapter that explains how we can use the OR; how
alternative patterns are defined.

To search for alternative patterns, regexian offers a special
metacharacter: it is the vertical bar or may be better known as
pipe-symbol "|". So, what would have been necessary to search for
'give' or 'take'? "give|take". The regex checks whether it matches
'give'. If not it checks the string for 'take'.

What happens if the string contains both alternatives? Well, to be
honest, when I started with regex I was convinced that the first
alternative in the regular expression would be matched. But no! The
regex will match the alternative that comes first in the string! Let's
get into details with an example:

Given the regex "this|the|that" and the string 'the hand that signed
this paper' (Ok, ok. You didn't really expect sample strings from
Shakespeare or Yeats, did you?) What does the regex return? 'the' is
the answer! Try it in the regex-tester!

You may combine alternatives as you have seen in the last example.
Just have a look at the following "^re:|^aw:|^fws:". This means that
in all three alternatives the regex has to match the beginning of the
line first. Some characters follow and each alternative ends with a
colon. Yep, you are right: there must be a way to simplify this one.
And like in Mathematics you can use brackets to make the regex shorter
"^(re|aw|fwd):".

Well, those simplifications do not necessarily make it easier to read:
"th(is|e|at)" would be a correct and simple alternative to the first
example in this chapter but it is not exactly an easy-to-read example.
;-)

4.4 Special Character Groups and Classes

We have already introduced some of the special search patterns for
groups and classes of characters. I would like to present some others
with varying significance.

In almost every real regex you find the character class "\s" . It
represents so-called whitespace characters, that is any character
which produces white space on the screen: space, tabs, newline,
carriage return, line feed. It's ok if you just remember that any void
space in a string will be matched. And, of course, you may negate this
pattern: "\S" matches any character that does not appear as white
space in the string.

"\A" is a seldom used search pattern: it matches the beginning of the
string. This is not the beginning of the line; no, to search for that
we would have used ^. Later when we talk about options like multiline
you will see where you can use this one. "\Z" is related to "\A": \Z
matches the end of the string and again I can only say: "This is not
end of line" because that would have been $. You will see the
difference when we talk about options. Sorry, but you have to be
patient :-)

4.5 Overview and Summary

This chapter explained some more possibilities in defining search patterns:
 
· line boundaries are matched by circumflex ^ (beginning of a line)
  and dollar $ (for end of a line).

· Word boundaries are matched by "\b". It searches for characters that
  appear at the beginning or the end of a word. "\B" represents
  characters that do not appear at the end or the beginning of words.
 
· Search patterns can contain alternative characters to match. The
  alternatives are separated by a vertical bars "|". Characters that
  appear at the same place in each alternative can be placed before or
  after brackets that enclose the alternatives. "^(Re|Aw|Fwd)" All
  alternatives must appear at the beginning of the line in a string to
  be matched.

· Spaces or tabs are so-called whitespace characters for which a
  special search pattern exists: \s The negation is \S

· Beginning and end of a string: \A and \Z

Exercise:

1. Given the regular expression "(R.:$|^R.:)" and the string 'Ra: or
Re:'. What does the regex match?

2. I want to match 'Re:' at the beginning of a line even if it comes
with a reply counter e.g. 'Re[2]:'. With what we have learned about
regular expressions so far: what is the regex for doing that?

3. Let's try to DIY a regex that matches 'Re' at the beginning of a
line or ')' at the end of a line.

4. What do these search patterns mean?
a)      "^"
b)      "^x$"
c)      "^$"

First exercise: the regex matches 'Ra:'. We expected that, didn't we?
The regex matches the alternative which comes first in the string.

Ooops, the solution of the second exercise already looks quite
professional, doesn't it: "^(Re|Re\[\d\]):" Ok, may be you have
something different; something that looks a bit simplified like:
"^Re(|\[\d\]):". It is a good example because simplified version shows
an absolute void as the first alternative in the brackets - the '|'
symbol has nothing to the left of it other than the open bracket that
starts the "sub-string".

Third exercise:
"(Re|\)$)" is one solution. You didn't forget to escape the bracket,
did you? Fine, well done *g*. Now, if you can, try this one in the
regex-tester with the following string: 'Re[2]: bladibla (was: more
bla)'. You will see that the regex exactly matches just 'Re' because
at this point the regex machine returned TRUE for the match. If the
beginning of the string is changed to something else only then will
the regex match the bracket.


Fourth exercise:

The first pattern searches for any text that has the beginning of a
line or that starts at the beginning of a line. This would include any
text - even a void line would be matched.

The second pattern just looks for a single x character that is alone
in a line.

Last, but not least, the third pattern: it searches for lines that
have a beginning and an end, but nothing else: these are void lines!

=============

CU soon.

-- 
Best regards,
 Gerd 
======================================
Tutorial for using PGP with TheBat! www.pro-privacy.de
----------------------------------------------------------------------------
Never hit a guy with glasses; always use your fists!
----------------------------------------------------------------------------
now playing: WDR2 :-)


________________________________________________________
Current Ver: 1.60m
FAQ        : http://faq.thebat.dutaint.com 
Unsubscribe: mailto:[EMAIL PROTECTED]
Archives   : http://tbudl.thebat.dutaint.com
Moderators : mailto:[EMAIL PROTECTED]
TBTech List: mailto:[EMAIL PROTECTED]
Bug Reports: https://bt.ritlabs.com

[regex-tutorial]: Part 2

Reply via email to