[regex-tutorial]: Part 3

Gerd Ewald Sun, 26 May 2002 00:54:07 -0700

Hi everyone,

this is the third part, which was the most difficult to translate. Ask
Marck, without him it wouldn't be like that. Thank you again, Marck
:-) There are some very long regular expressions in this part. They
might be wrapped due to layout reasons. Sorry for that!


The fourth part will be delayed because I won't be at home for a week
and no chance to work on part 4. So, sorry for that, but you have to be
patient. Anyway, I think part 3 is quite difficult and you will need
some time to work through.

Good luck.

===Start====

5. Special Elements  - Part 1

Everything we've had so far hasn't been too difficult. But this
chapter is heavy stuff. Please, do me a favour: read this chapter
carefully. Be patient! Try everything with the regex tester; get
familiar with the elements in this chapter: they are the essential for
creating proper regex. Although this may be a bit more complicated
than the chapters before, it is certainly more interesting ;-)

5.1 Quantifier

We already know to define patterns for matching single characters,
groups of characters, character classes or ranges of characters. We
can use alternatives in our search patterns. But something of
absolutely vital interest is missing - the ability to define
repetitions.

You remember the example that was a regex to search for the European
formatted date:

"\d\d\.\d\d\.\d\d\d\d"

For every single digit we wrote "\d". Isn't there another way, much
simpler than repeating the metacharacter as often as the regex wants
to find the character? Yes, there is! There are quantifiers!

+ * ? are the most important quantifiers. 


The "+"-character means that the character preceding the plus-sign has
to appear at least once at the specific point of the string. "fo+l"
matches 'fool', 'fol' and 'foooool'. "Re:\s+", for example, means that
at least one whitespace has to follow 'Re:' to be matched.

I hear some of you experts: yes, the usage of quantifiers is not only
restricted to characters. You can use them to repeat metacharacters,
character classes and some other elements we are yet to learn. ;-)

The star "*" represents any number of occurrences of the preceding
character at the specific point in the string. 'Any' really means
'any', even if the character doesn't appear at all. Ooops, what's the
use of that?

Well, let's have a look at the following example:  "Re:\s*\w+" 

Huh, that already looks as cryptic as those regex the experts use <g>.
What does this regex mean?

Search for a 'Re' followed by a colon. Then any number of whitespace
characters may appear - even no spaces at all. What for? In proper
subject lines there should be a space. But imagine we would like to
match any subject string even if someone modified it manually and
deleted the space. We have to tell the regex that there might or might
not be a space. Anyway, both possibilities should be found. This can
be done with the star as quantifier. Well, finally, there has to be at
least one alphanumeric character.

Caution: the meaning of this quantifier is sometimes misinterpreted.
Look at the following task: a regex has to be defined that matches
only lines of a string with only digits in it. One solution I saw was:
"^[0-9]*$"

But this regex matches void lines as well; the star stands for 'no
digit' as well as for 'any digit'. So the regex machine returns TRUE
when no digit is in a line. If you want to make sure that there is at
least one digit in a line you have to use the plus-sign: "^[0-9]+$".

The question mark means that the preceding character may appear once
or not at all at the specific point of the string. A bit like the star
only that the number of occurrence has the maximum '1'. "h..?s"
matches 'hers', 'hips' and 'his' or 'has'. Within 'house' it matches
'hous'; within 'hose' it matches 'hos'.

There is another way to define repetitions: "{x,y}" This is a way to
explicitly define how many repetitions of the preceding characters you
want. In this formula 'x' denotes the minimum number and 'y' the
maximum number necessary for the preceding character. "\d{2,4}" means
that only two to four digits in a row are matched.

If you omit the second number 'y' but leave the comma in the curly
brackets "{x,}", then there is no upper limit and the minimum is
x-times the preceding character. "\w{3,}" matches any string with at
least three word-characters.

If you omit not only the second number but the comma as well "{x}",
then this means the exact number of appearances of the preceding
character. "\d{6}" matches exactly six digits. This quantifier gives
us a new way to write our regex that matches European formatted dates
: "\d{2}\.\d{2}\.\d{4}"

The three quantifiers I introduced at the beginning of this chapter
are simply special ways to write one of the following regex:
{0,1} = ?
{1,} = +
{0,} = *

Before I can tell you more about quantifiers and what has to be kept
in mind when using them, I have to introduce parentheses (round
brackets) as a grouping device.


5.2 Grouping of Elements, Subpattern and Quantifiers again

Grouping of Elements 

In the chapter about alternatives, the parentheses crossed our way for
the first time. They were used as they are in maths: common parts of
the pattern are written outside the round brackets.

Now we will learn something new: we can use the parentheses to group
parts of the regex to be dealt with as a single element of the
pattern. A following quantifier is applied to the grouped part of the
regex. E.g.: "foo(bar)?" matches 'foo' and 'foobar'

Another example:
"Re\s*(\[\d+\])?:" There it is again, the reply counter in a subject
line. This time it looks already quite professional. First of all we
look for 'Re'. After any number of whitespaces (or none at all) digits
in square brackets may follow. This part is grouped. Finally there has
to be a colon.

Let's have a closer look at the regex: why is it defined in that way?

First the whitespaces: we don't know whether the author of the subject
line inadvertently added one or more spaces after the 'Re'. Even if he
did nothing and left the string untouched we want the Regex to match
the string. Well, I agree, there shouldn't be any space, but you never
know � ;-) That's why we use "\s*" at this point.

Then the digits in square brackets: we allow any number of digits in
the square brackets by using the plus-sign as quantifier. But there
has to be at least one digit! Because there is no upper limit for this
character, the way to infinity is free <vbg>.

Finally the counter '[#]' itself: this part is grouped. This element
need not appear in the string to result in a successful match. That is
why we use the question mark.

The regex therefore will match:
'Re:'
'Re [1]:'
'Re[123]:'

It will not match 'Re[]:'. Something to think about and to try on your
own: what has to be changed so that the regex matches this one?

Ok, here is the solution: replace the '+'-sign in the square bracket
with a star: "Re\s*(\[\d*\])?:"

Within 'Re [1]: [3]:' it matches 'Re [1]:'. It does not match the
second reply counter. Ok, if we want to find such awful subject lines
we have to work on our regex a bit more: it should match any number of
counters that may have colons and -you never know - that may or may
not be followed by spaces. Finally the last character has to be at
least one colon: "Re\s*(\[\d+\]:*\s*)*:+"

Well, it is possible for a subject to begin like that although there
is only a small probability that it will really happen. I can envisage
many of combinations of reply counters. The regex does not match all
of them. If you want to have the regex match other combinations, go
ahead, try it! Test it with a regex of your own making, but: there is
one major point you should keep in mind. There is no perfect Regex.
The more you try to improve the regex to match even more possibilities
and combinations of characters, the more complicated the result will
be. You will have to pay for this kind of perfectionism: either you
won't be able to read your regex anymore or the Regex will become
buggy whenever you make even the smallest change to it. It is easier
to live with some erroneous matches and to sort them out manually than
to create the perfect Regex. Jeffrey Friedl published a regex to match
email-addresses in "Mastering Regular Expressions": it is more than
6000 bytes. It was a good example of being too perfect, as he stated.

Ok, back to the job-in-hand: let's have another example of how to
group elements. We had a pattern to match European formatted dates:
"\d{2}\.\d{2}\.\d{4}" As you can see, the beginning "\d{2}\." is
repeated. Right, so this can be simplified: "(\d{2}\.){2}\d{4}" The
first part, now grouped in parentheses, has to appear twice. This is
for example '01.02.'. This is not an optimal version of the search
pattern: day and month numbers still have to be two digit numbers and
silly values for both are still allowed. But wait; you will get your
chance. Let us learn some more elements before you are given the job
of optimising the pattern in an exercise <g>.


Subpattern

Grouping with parentheses has another effect in regexian that is
widely used in a lot of regular expressions in TB. Characters that
were found due to a grouped pattern or element are stored in a
temporary variable for further use. These variables are known as a
subpattern (SubPatt in TB). We should have a look at an example to
help us understand that:

'[EMAIL PROTECTED]'

We use the regex "(\w+)\.(\w+)@.*". The first parentheses matches
'bill', the second one 'door'. These two are each now stored
respectively in subpattern 1 and subpattern 2.

Or:

"(\d+\.)(\d+\.)" When the string is '22.05.' then '22.' is stored in
subpattern 1 and '02.' in subpattern 2.

How do I find out which is the first subpattern? Well, in our simple
examples it is obvious: everything that is matched by the first pair
of round brackets goes to subpattern 1, the second pair returns
subpattern 2, etc But what if the regex looks like:
"Re\s*(\[(\d+)\])*:" The part that is enclosed by the first opening
bracket and its corresponding closing bracket is stored in subpattern
1. The part that is enclosed by the second pair starting at the second
opening bracket is stored in subpattern 2. With 'Re [4]:' our example
would result in: Subpattern 1 = '[4]' Subpattern 2 = '4'

Important: each opening bracket creates a new variable or subpattern,

What does the regex-machine store in a subpattern when a quantifier is
applied on a grouped element? Example: "(\d{2}\.){2}\d{4}"

If the string is '23.05.2002' the first pattern is matched at '23.'.
Now the regex machine goes on to find the same pattern in the string a
second time. If successful the matched characters are stored in the
same subpattern. In other words: the second match overwrites the first
one. In our example the subpattern will show '05.'

The regex-tester shows the contents of each subpattern: with every
subpattern it will offer another tab panel. That one with '0' on it
shows the whole match, while that with '1' on it shows the match of
the first subpattern, etc.


And Quantifier again

Ok, now let's move on to some special behaviour relating to
quantifiers, Some of them have a 'human' peculiarity: they are greedy!
You don't believe that? Well, look at the following string <g>:

"The abbreviation 'ISP' stands for 'Internet Service Provider'."

We want a regex that finds the text that is enclosed by inverted
commas and stores it in a subpattern:

"(.*)'(.*)'.*"

Nothing difficult really: find everything that comes before an
inverted comma, then everything in between and finally everything that
follows�

And? Did you try it on the regex-tester? What is in subpattern 2?
"Internet Service Provider". Ooops, I expected "ISP" because it comes
first in the string. :-o It is quite obvious that the first group (.*)
greedily matched most of the string and left only what was at least
necessary for subpattern 2 to match the whole string. Furthermore, the
last element ".*" in the regex allowed 'nothing' or void to follow.
Keeping this in mind: this part leads to a successful match even if
nothing is to be matched. The star stands for as many appearances as
there are or none at all!

Ok, here's another example:
We want to extract as many parts of an email-address as possible.
We've already got a solution for the first part, the name; but that
wasn't a good one because it only allowed word characters. We have to
make this more generic. Let's take (.*) for the first part. The second
part is some text delimited by a dot. But this may appear more than
once before the @-sign ends the name section. The Regex should
therefore find the following examples of addresses:

'[EMAIL PROTECTED]'
'[EMAIL PROTECTED]'
'[EMAIL PROTECTED]'

So, the regex starts with "(.*)\.?(.*)*@". After that any text may
follow, possibly delimited by more dots. We will ignore this for the
example and go for extracting only that text that comes last after the
last dot, so that the regex does not get too complicated. This should
be done with "(.*)\.(.*)"

"(.*)\.(.*)*@(.*)\.(.*)" 

What do we expect in the subpatterns when '[EMAIL PROTECTED]'

Subpattern 1 = '12-34' ?
Subpattern 2 = '.abc' or '.def' or 'abc.def' ?
Subpattern 3 = 'mail' ?
Subpattern 4 = 'com' ?

Ask the regex-tester:

Subpattern 1 = '12-34.abc'
Subpattern 2 = ' def '
Subpattern 3 = 'mail'
Subpattern 4 = 'com'

Subpattern 1 contains almost the all of the first part, subpattern 2
only the last three characters before the @. Of course, we expected
that, didn't we? We already know that the star is greedy: it stored as
many characters as it could into the first subpattern.

Caution: not only stars, I mean star-signs are mean and greedy <vbg>,
the plus-sign is as well! Don't forget that!

Let's take another string to test the regex:
'[EMAIL PROTECTED]'. Now the star in the third parentheses
"(.*)" is greedy and 'eats' almost everything after the @ up to the
last dot, storing 'mail.test' and not 'mail'.

How can we avoid that? We are going to learn another meaning of the
question mark (Calm down, this is only the second one. There are many
more to come and you will eventually come to understand why a regex is
full of these funny question marks *g*): just add a question mark to
the greedy pattern and you make the pattern less greedy.

Let's do that. We add a ?-sign to the first pattern: 
"(.*?)\.(.*)@(.*)\.(.*)"

Subpattern 1= '12-34'
Subpattern 2= 'abc.def'
Subpattern 3= 'mail.test'
Subpattern 4='com'

For a better understanding I shall try to explain what the
regex-machine does: the regex-machine does not restrict the greediness
of the (.*). In the moment it discovers the pattern (.*?) the
following happens: it stores as much as possible into this subpattern.
Then it steps back one character at a time to find a point where a
successful match is found.

I'm going to explain it using our example regex "(.*?)\.(.*)*" and the
string '12-34.abc.def'. The Regex machine stores '12-34.abc' into the
first subpattern. This is the maximum that the Regex allows because a
dot and some text follow this string. But now the machine realizes
that there is a question mark, which suppresses the greediness of the
first subpattern. Thus, it steps back one character before the 'c' and
checks whether or not the Regex leads to a successful match. No, it
does not. So, again, take one step back and a check again. Still no
hit. Back again to a position before the 'a'. And now the machine
realizes that this would lead to a successful hit because of the
preceding dot. The machine takes the position exactly before the first
dot. In reality, it would have to do some more back-stepping to find
out that this position is the last one possible with the minimum of
characters for a successful match. But I reckon we've looked deep
enough in to the way it works for now.

Back to our first example where we wanted to match text between
inverted commas. The regex was "(.*)'(.*)'.*" and the text "The
abbreviation 'ISP' stands for 'Internet Service Provider'." Let's
alter the Regex to "(.*?)'(.*?)'.*"

Both grouped elements need a question mark otherwise "ISP' stands for
'Internet Service Provider" would be stored in the second pattern. To
add a question mark in the second element alone wouldn't help very
much because the first (.*) remains greedy.
 
5.3 Overview and Summary

This was a quite difficult section. Not only for you to read and
understand. No, it was even difficult to write and create the text,
from which I hope you got some idea. This section covers one of the
basic elements of regexian that you will need in every Regex.

The following elements were presented:

� Characters that repeat preceding characters are called quantifiers:
    + the preceding character must appear at least once
    ? the preceding character may appear once or never
    * the preceding character may appear in any amount of times or
      never

� There are quantifiers that allow to define exact ranges of the
  frequency of the preceding character: {x,y} the preceding character
  has to appear at least x-times but not more than y-times. One may omit
  parts of the range: {x,} stands for at least x-times with no maximum.
  {x} means exactly x-times.

� Parentheses are used to group multiple character sequences into
  patterns so that we can apply quantifiers to them. "(ab)+" means that
  the combination of 'ab' has to appear at least once to be matched.

� Patterns in parentheses are stored in variables for further use.
  These variables are called subpatterns in TB. In the case of multiple
  parentheses where groups are grouped, the outer subpattern contains
  all inner subpatterns. Furthermore, the first opening round bracket
  creates the first subpattern, the second defines the second subpattern
  and so on..

� Quantifiers with no upper limit may be greedy in some search
  patterns. + and * after a dot make the regex take in as much as it can
  to lead to a successful match. (.*)(.*) will include the whole match
  in subpattern 1 and nothing in subpattern 2.

� A greedy pattern can be made ungreedy by adding a question mark to
  it (.*?) In the first step it still will match all that is possible
  but then it will do some backstepping to give back one character at a
  time until the minimum characters that constitute a successful match
  are reached.


1: The last regex we created for searching European format dates was:
"(\d{2}\.){2}\d{4}" It wasn't perfect because it didn't allow single
digit days or months nor two digit years to be matched (D.M.YY or any
other combination). That's worth an making into exercise, isn't it?

2: You've got the solution for question 1? Ok, that solution is quite
interesting but now we can try to write an improved Regex for matching
European formatted dates. If possible we would like to allow only
combinations of digits that look like a terrestrial date. Well, we do
not want to exaggerate: it's ok if the Regex matches February, 29th
(29.02.) even if it isn't a leap year ;-).The only important points
are: it should be in the format DD.MM.YYYY or D.M.YY or any
combination and it should be restricted to dates that exist.

3: Imagine you receive bug-reports via an on-line system. The reports
are standardized and all have the same format (more or less). We need
a regex that extracts the more important information. The reports look
like:
Sender: [EMAIL PROTECTED]
Date: TT.MM.JJJJ
Report-no.: xyz123

Please try to define a regex that extracts the following parts into
subpatterns: first name, last name, agency, date, report-no.

4. Write a regex that matches the time in the form hh:mm:ss. Make sure
that only valid combinations are returned.

Problem 1:
"\d{1,2}\.\d{1,2}\.(\d{4}|\d{2})"
You created something else? Doesn't matter, it may be a correct
solution: there is often more than one way to do it!
"(\d?\d\.){2}(\d{4}|\d{2})" is in my opinion an elegant solution. A
not so good idea is something like "\d{2,4}" for matching the year: it
allows three digit years.

Problem 2:
This is a bit tricky. In these cases I like to divide the problem into
smaller chunks. Which days are possible:
a) 01-09, the preceding zero could be missing.
b) 10-29, all months of a year have at least 29 days. Ok, there is one
   error we are allowed to make: February only has 29 days in leap years.
   We will assume this is ok, otherwise it might be almost impossible to
   create the Regex.
c) 30, all months except February d) 31, only January, March, May,
   July, August, October, December.

Possible numbers for months are 01-10 (the preceding zero might be
missing) and 11, 12. We want to allow two or four digit years. In case
of four digit years we only accept those that start with 19xx or 20xx

Ok, now we have what we need. Let's start:

Case a) and b) combined with the allowed months gives us:
"(0?[1-9]|[12][0-9])\.(0?[1-9]|1[0-2])\."

Case c) with all possible months:
"30\.((0?[13-9])|(1[0-2]))\."

And finally case d) with possible months: 
"31\.(0?[13578]|1[02])\."

Now the years:
"(\d{2}|(19|20)\d{2})"

The first three parts have to be alternatives whereas the pattern for
years is mandatory. To avoid that the Regex matches within a longer
sequence of digits to find something that only looks like a date, we
envelope the whole Regex with \b metacharacters. That should give

"\b(((0?[1-9]|[12][0-9])\.(0?[1-9]|1[0-2])\.)|(30\.((0?[13-9])|(1[0-2]))\.)|(31\.(0?[13578]|1[02])\.))(\d{2}|(19|20)\d{2})\b"

[Note: the regex is wrapped due to layout reasons. All must be used as
a single long line!]

Incredible: that's a cracker! You found something different? Even
something better? Well, I think that is 'normal'. You can always write
a regex in another way to give the same result. And of course: you can
improve almost every Regex. My Regex only shows one way to approach
the problem: the way I like to do it. I hope you were able to follow
my thinking.

Problem 3.
This is not very difficult. Again, divided into chunks of the whole
problem:
First name and last name can be extracted from the mail-address.
"Sender:\s*(.*?)\.(.*?)@(.*?)\.\w+\s*" should be sufficient. The
question mark in the second subpattern might be redundant because the
@-character follows anyway. But it won't hurt anyone, would it?

Date: phew, we are in luck. The format is mandatory. We don't have to
use the killer regex of problem 2 ;-):
"Date:\s*((\d{1,2}\.){2}\d{4})\s*"

And now the report number:
"Report-no.:\s*(.*)"

To make sure that the regex checks the whole string we add \A at the
beginning and \Z at the end.

"\ASender:\s*(.*?)\.(.*?)@(.*?)\.\w+\s*Date:\s*((\d{1,2}\.){2}\d{4})\s*Report-no.:\s*(.*)\Z"

[Note: the regex is wrapped due to layout reasons. All must be used as
a single long line!]

Subpattern 1,2,3,4 and 6 will contain the information we wanted.


Problem 4.
I think we have already had some practise at dividing bigger problems
into smaller ones. The time-problem is another one. It should be mere
routine now. And, it is much easier than it looks at first sight,
because the format is fixed!

Hours are from 00 to 19 and 20 to 23 (24 equals 00!!):
"([01][0-9]|2[0-3]):"

Minutes and seconds have the same format and the same combinations of
digits, 00 to 59:
"([0-5][0-9]:){2}"

Altogether, enclosed by word boundary (\b) metacharacters:
"\b([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\b"

===END===


Hope you enjoyed it! :-)

CU in some days!

-- 
Best regards,
 Gerd 
======================================
Tutorial for using PGP with TheBat! www.pro-privacy.de
----------------------------------------------------------------------------
Fifty percent of people have a below-average understanding of statistics.
----------------------------------------------------------------------------
now playing: WDR2 :-)


________________________________________________________
Current Ver: 1.60m
FAQ        : http://faq.thebat.dutaint.com 
Unsubscribe: mailto:[EMAIL PROTECTED]
Archives   : http://tbudl.thebat.dutaint.com
Moderators : mailto:[EMAIL PROTECTED]
TBTech List: mailto:[EMAIL PROTECTED]
Bug Reports: https://bt.ritlabs.com

[regex-tutorial]: Part 3

Reply via email to