[regex-tutorial]: Part 1

Gerd Ewald Mon, 13 May 2002 12:37:25 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Bats(wo)men,


some days ago Daniel Grunberg asked for an English version of a
tutorial on regular expressions (TBTECH,
<mid:[EMAIL PROTECTED]>) which I published on
www.pro-privacy.de for the German beginners list.

First thing I did was a mail to Marck to find out whether there is
some interest in a translation.

Well, here it is. At least the first part. Marck checked the text and
transformed it into something you can read. Thank you, Marck! (My
translation was something between the following text and a translation
altavista did, hehe).

The whole tutorial will be subdivided in five parts. It will take some
time to prepare the next part, so you have to wait one or two weeks
for the next part to be published. Sorry! Anyway, we decided to
publish it in parts, so you can start learning regexian and you have a
chance to ask questions for better understanding.

Any part is posted to TBUDL using a special subject ("[regex
tutorial]") so that those of you who don't want to read it may define
a filter to kill the mail. Please use the same prefix in your subject
for any reply.

The tutorial is published on www.pro-privacy.de (look there for
"special") and on Marck's official FAQ at
http://www.silverstones.com/thebat/FAQ.html


Ok, that's it. Let's start. I hope you will enjoy the tutorial :-)


========START OF PART 1 ==============

1. Introduction

Whenever I came across something interesting in a mail that was
created with TheBat! like "cleaned" Subject-strings or automagically
deleted PGP-lines, I would ask in one of the mailing lists: "How did
you do that?". Quite often I would receive a reply like "You will need
a regex for that!" And sometimes the result was something like:

%QUOTES="%SETPATTREGEXP=""(?is)(-----BEGIN PGP
SIGNED.*?\n(Hash:.*?\n)?\s*)?(.*?)(^(- --|--\n|-----BEGIN PGP
SIGNATURE)|\z)""%REGEXPBLINDMATCH=""%text""%SUBPATT=""3"""

This is only a simple example of those cryptic looking combinations of
TB!-Macros and regular expressions which are simply called "regex" by
the TB-experts. To me it seemed a random sequence of characters; as if
a cat walked across my keyboard. Awkward, arbitrary and cryptic, that
at least was my impression until Januk Aggarwal (special thanks to
him) gave me a short introduction to regex at TBTECH and my workmate
Alfred Rübartsch gave me a copy of Jeffrey Friedls excellent book
"Mastering Regular Expressions".

Although I entered the fascinating world of Regular Expression with
the help of these two, I am still not an expert in the "regexian"
language. Anyway, as an advanced beginner, I have dared to write this
tutorial to hopefully explain some things and give a good start in
"Regular Expressions" to other beginners.

This tutorial is meant to bring you into closer contact with the regex
topic. Well, let's see how it works; let's see whether we will be able
to explain the "regex"-example above by the time we come to the end of
this tutorial.


2. Regular Expressions

2.1. What does "Regular Expression" mean?

Regex are not only used in TB! You can find them in quite a lot of
different UNIX-tools (e.g. grep), in some programming languages like
PERL (Practical Extraction and Report Language, sometimes called
'Pathologically Eclectic Rubbish Lister' <bg>) and even my editor
UltraEdit uses them.

Laura Lemay wrote in her book "PERL in 21 days" that the term "Regular
Expression" makes no sense at first sight (to be honest: even at
second sight it still makes no sense to me), because these are not
real expressions and furthermore no one really can explain why they
are "regular"! Well, let's ignore this; let's simply accept that the
term "Regular Expression" has its origin in formal algebra and that
they are indeed part of Mathematics.

The easiest and most convenient way to define "Regular Expression" is
to say: "They are search patterns to match characters in strings."

Those of you who have tried to find files using the DOS command line
or the search function in the Explorer may have used patterns like:

dir *.doc
copy *.??t c:\temp

These examples show patterns that consist of letters, stars, question
marks and other characters to define which files should be listed or
copied. In the first example only files that have the suffix "doc"
should be listed. In the second example only files that have a
three-letter suffix and a "t" as last character in the suffix should
be copied.

But these regex are merely wildcards! In no way as mighty as "Regular
Expressions". One can't compare them to real regex, which offer much
more than wildcards for characters.


3. Simple Patterns

To explain some regular expressions and to understand the examples
given in this tutorial we have to define how the regex will appear. I
will envelope the regular expression in quotation marks ("). If you
want to test the regex you will have to copy the part between the
"-characters. Testing regular expressions? Yes, sure, this is
possible.

You have to download a DLL written by Dirk Heiser

(http://www.Dirk-Heiser.de/RegExTest/RegExTest_V0.3beta.zip)

and copy it into your TB-directory. Then, when you open the TB-help,
you will find a tabfolder called RegEx. Or, if you are using the
CHM-Version of the help (this probably applies only to the German
version), you can use this DLL by creating a link on your desktop
which opens the DLL:

"%windir%\system32\rundll32.exe <your_path>regextest.dll, Run"

Please, I really recommend that you download this utility. It will
make it so much easier to follow the tutorial.


3.1 Simple Known Characters

Ok, let's start with simple search patterns: "give or take"

Yes, you won't believe it, this is already a regular expression: it
matches the string 'give or take' in a text. Exactly these characters!
And no, this does not mean that this pattern matches either 'give' or
'take'. The regular expression only matches if the characters in
quotation marks appear somewhere in the text!

Regular expressions are stubborn and stupid: they will look for
exactly what they are told to search for. They are case sensitive and
they are not interested in word boundaries unless told to be so. For
example, our first regex will find the characters in the following
string: 'You have to forgive or take the consequences!'


3.2 Search Patterns for Metacharacters

Regular expressions can search for any character - alphanumeric,
hexadecimal, binary numbers, etc..

A small but important exception are those characters that have a
special meaning in regular expressions, the metacharacters.

Metacharacters are:

* + ? . ( ) [ ] { } \ / | ^ $

(Hi experts: Yes, you are right! I stretched the truth!! These are not
actually all metacharacters. But trust me, just assume that I am right
for now. We will see later why I prefer to define the above as
metacharacters).

I will explain these metacharacters later in the tutorial, step by
step, as many of them as necessary. Just one thing for now. If you
want to search for those characters as they stand you have to tell the
regex that you want to do so. The regular expression has to be told
that you don't mean to use a metacharacter but want to search for it
literally. So you have to "escape" or "mask" the character with
another character (which of course is a metacharacter in itself <g>):
it is the backslash "\"

If you want to match a question mark the regex has to be "\?". If it
is a slash you're after, you have to enter "\/". And, although it
looks queer, if you want to find a backslash you need to type two of
them "\\"


3.3 Simple Unknown Characters

The first metacharacter we are going to learn is a dot "." It
represents exactly one unknown character we want to match, no matter
what this character might be (Hello experts: let's come to exceptions
later. Ok?)

"M.ller" will match 'Miller', 'Meller' or 'Millerton' but not 'Milton
Keynes'. "h..s" matches 'hips' or 'hers'. And within the word 'house',
the same regex will match 'hous'.

Later we will learn about some more metacharacters; ones that will
allow us to look for more than one unknown character without repeating
the dot over and over.


3.4 Groups of Characters and Character-Classes

Some metacharacters define groups of characters, making a very
powerful tool. There is a wide variety of these groups. Let's start
with the easy ones:

"\d" symbolizes a digit. "\d\d" searches for any sequence of two
digits.

"\w" stands for any letter or the underscore character. This group is
called 'alphanumeric characters'.

With what we already know we can create our first more complicated
looking regex:

"Re \[\d\]:" searches for the string 'Re' followed by a space, an
opening square bracket, any digit, a closing square bracket and
finally a colon in a text. Ooops, that looks like a Subject-line which
was created by someone who forgot the %SINGLERE in his reply template
;-)

There are -of course- metacharacters which have the opposite meaning:
"\W" and "\D"

\W is stands for any non-alphanumeric character and \D means any
character that is not a digit.

Another elegant method to define your own group of characters is to
use the square brackets [ ] which stands for 'character classes'. With
square brackets, the regex will search for exactly one character, no
matter how many characters are in between these brackets: "[AEX]".
This combination will match any one-letter string that must be one of
A, E or X.

You may even define ranges of characters. You don't have to type in
every character of the range, no; regexian makes it easy for you: just
enter the first character of the range, a hyphen "-" and the last
letter: "[e-z]" means that all letters from e to z should be matched.
"[AEXe-z]" is a combination of both: a one-letter string with one of
A,E,X or any letter within the range e to z.

This is a powerful tool in regexian: "[0-1][0-9]\/[0-3][0-9]\/" will
match only a MM/DD/ formatted date. Other combinations which are not a
date (e.g. 35/47/) won't be found. (Yeah, you're right! My regex will
match 19/39/ which isn't a terrestrial date at all. We will get this
one later once we have learned some more elements....)

You can negate character classes with one keystroke. Just add a "^"
after the opening square bracket and that's it. 'Find any character as
long as it isn't 1,2,3 or 4!' in regexian is: "[^1-4]". Oh, we should
remember this one for later. This funny '^' character has a totally
different meaning when not in square brackets!


3.5 Overview and Summary

What did we learn in this chapter?

· regular expressions search for any character. "er" looks for the
  exactly these letters in that order. All regex are case sensitive
  unless told not to be so.

· Regexes use characters with a special meaning: metacharacters. To
  find them literally they must be escaped. This is done with a
  preceding backslash: * + ? . ( ) [ ] { } \ / | ^ $

· a dot "." is used to a single unknown character. It is a
  metacharacter.

· There are metacharacters which symbolize groups of characters like
  \d for digits ([0-9]) \D for non-digits ([^0-9]) \w for alphanumeric
  characters ([a-zA-Z0-9_]) \W for non-alphanumeric characters
  ([^a-zA-Z0-9_])

· It is possible to define your own set of character-classes by using
  square brackets e.g. "[A-Z]". A ^ as first character in the square
  bracket negates the class.


What does each regex match?
"\d\d\.\d\d\.\d\d\d\d"
"\w\w\w, \d\d \w\w\w \d\d\d\d"
".. \[[0-9]\]:"
"[a-zA-Z]"

First example: first it will match two digits. Next comes the
backslash and a dot. That means, the dot is escaped and is no
metacharacter. So the two digits has to be followed by a dot. Again
two digits and a dot. And finally four digits! This is the European
format of a date DD.MM.YYYY

In the second example the regex searches for three alphanumeric
characters followed by a comma, a space, two digits, another space.
Next come three alphanumeric characters, a space and finally four
digits. Phew, what could this be? Well, it looks like a format for
dates again, but this time in an Anglo-American format: Tue, 19 Feb
2002. Well, like the first example, this regex is not perfect. It only
matches dates with two-digit days. We will see later how we can modify
the regex to find one- or two-digit days

Third pattern: the regex looks for two characters and a space. The
next character is a square bracket. Then a square bracket follows
which isn't escaped by a preceding backslash: this defines group of
characters! Any digit in the range 0 to 9 is going to match the
string. A square bracket again and a colon. This combination would
match 'Re [2]:'

In the last example the regex looks for only one character. Any letter
is allowed, even capital letters. Why isn't "\w" used? Well, that
would include the underscore and perhaps the author doesn't want to
match that character ;-)

=============END OF PART 1 =====================


- --
Best regards,
 Gerd
======================================
Tutorial for using PGP with TheBat! www.pro-privacy.de
- ----------------------------------------------------------------------------
Cats teach that not everything in nature has a function.
- ----------------------------------------------------------------------------
now playing: WDR2 :-)

-----BEGIN PGP SIGNATURE-----
Version: PGP 6.5.8 ckt
Comment: Key-ID: 0x0FB66C7D

iQA/AwUBPOAH6H70/g0Ptmx9EQL3VgCeOdkNt9ON4MoffdGH180ZLFYyd7AAnif5
3TSRyuF22rsXiyjIX8ar6bUz
=VfbS
-----END PGP SIGNATURE-----


________________________________________________________
Current Ver: 1.60k
FAQ        : http://faq.thebat.dutaint.com 
Unsubscribe: mailto:[EMAIL PROTECTED]
Archives   : http://tbudl.thebat.dutaint.com
Moderators : mailto:[EMAIL PROTECTED]
TBTech List: mailto:[EMAIL PROTECTED]
Bug Reports: https://bt.ritlabs.com

[regex-tutorial]: Part 1

Reply via email to