[regex-tutorial]: Part 5

Gerd Ewald Mon, 22 Jul 2002 12:57:36 -0700

Hi Batsmen,

here it is: the last part of the regex tutorial. It took some time but
finally we (Marck and I) made it. Again: many thanks to Marck who made
it possible to publish this tutorial in English.


You will find this part online at
http://www.pro-privacy.de/regex/en/part1.htm
in a few days as well as a PDF-version for download.

Have fun :-))

===Start===
7. How to use Regular Expressions in TB

Finally, we can try to use our new language in TB. First of all we
have to know which tools are available to work with regular
expressions. These tools are TB's macros.

7.1 Macros

Not all of TB's macros support the use of regex. Most of the macros
have nothing to do with regex, but you can use regex on them to
extract or modify the information. And that is one feature of TB that
makes it so powerful.

The first macro we will look at is: %REGEXPTEXT="regex" What does it
do? It searches for the pattern "regex" within the original text of a
mail and returns the matched characters. The syntax is quite
straightforward, look at the following example:

 %REGEXPTEXT="[\d\.]+"
This macro used in a quick template and applied to a mail returns
digits and dots.

Let's have a look at a fairly similar macro: %REGEXPQUOTES="regex"

This macro does exactly the same as the first one except that the
returned text is not plain text but quoted text.

That was nice and easy. But when it comes to the extraction of text
from the header of a mail (kludges) or address book entries we need to
combine some macros:

The first one we will need for that is %SETPATTREGEXP. It is used to
define the search pattern in the way %SETPATTREGEXP="regex". "regex"
is the regular expression you created to match the text.

The second one is %REGEXPMATCH. Again, this is easily defined:
%REGEXPMATCH="string" with "string" being any text. It can be a
template, which means that any generic text can be used, so almost any
TB macro can be used to provide the text here.

The definition of a regex through %SETPATTREGEXP is valid unless it is
overwritten by a second appearance of a %SETPATTREGEXP. This means you
can use the same pattern on several different generic texts in one go.

Before we have a look at another example I have to correct something.
Did I say the syntax is quite easy earlier in this chapter? Well,
that's true as long as one only looks at one macro. But let's see how
this changes when we let the macro parse some text:

We already know the macro %REGEXPQUOTES. This could be written in a
different way. Let's assume that we receive Mails from a feedback
form. Part of the content is "newsletter: yes" or "newsletter: no". We
would like to create an autoresponder that uses exactly this
information in a reply template, for example: "Thank you for filling
out our feedback form. You entered 'newsletter: yes/no'. Are you
sure?" You can create more sophisticated text and a better filter to
use different templates for the reply, but for the moment let's stick
to this example;-).

The macro %QUOTES defines what text is to be used as quoted text in a
reply. The only problem is that we have to tell %QUOTES which text
should be used. After that we can copy it to the reply template, add
our standard text and save it.

Ok, first the regex: "^newsletter:\s*(yes|no)". This has to be defined
by %SETPATTREGEXP="^newsletter:\s*(yes|no)". We already know that
%REGEXPMATCH applies the search pattern on any generic text, so we
need a macro that provides the original text of the mail and that is
%TEXT. Now we have to put it all together and create a template that
uses the macros in the correct order.

The only thing that makes it difficult to use these macros are the
"-characters which are used as delimiters for the definition part. In
%SETPATTREGEXP the search pattern is defined between these and in
%QUOTES the text that will be inserted as quoted is defined. Once you
start to combine the macros you have to tell TB which "-character is
delimiter of which macro: the first macro must know whether the second
"-character is the end of the macro or the beginning of the second
macro. The same applies at the end of the second macro and so on. This
can be achieved by doubling the "-character (escaping) or using
different delimiters.

Simply, this looks like:

%M1="%M2=""Def2""%M3=""Def3""". This is getting a bit confusing and
hard to follow, so we could instead say:

%M1="%M2='Def2'%M3='Def3'". The example above would look like:

%QUOTES="%SETPATTREGEXP='^newsletter:\s*(yes|no)'%REGEXPMATCH='%TEXT'"

This example could be written in a simpler way:
%REGEXPQUOTES="^newsletter:\s*(yes|no)", but this is because we
extracted text out of the original text with %TEXT.

Next comes a macro combination that allows the extraction of several
parts of the text. We know that we could define subpatterns in the
regex by grouping sections with parentheses. We must now find a way to
address them within TB.

TB provides a macro for this %REGEXPBLINDMATCH="string". But this does
not return anything useful. Of course, we wanted to extract parts of
the text not the whole text itself. So we still need a macro that
allows us to tell the macro which of the subpatterns are to be used.
And this is %SUBPATT="n". 'n' denotes the n-th subpattern in the
regex.

Now this combination will be quite difficult to read and understand.
So I will explain it using an example and will generate the whole
macro combination bit by bit. After that I will combine everything.

>From the original date of a mail we want to extract the year, two
digits only, and use it as quoted text. The date is provided by
%ODATE. The regex is "\d{2}(\d{2})\b". That means we want to extract
only two digits if they are preceded by two digits and followed a word
boundary. Thus the first macro is: %SETPATTREGEXP="\d{2}(\d{2})\b".

The text that is used to find the date is defined using the macro
%REGEXPBLINDMATCH="%ODATE". We are looking for the first subpattern,
so %SUBPATT="1".

Now we put all together, we don't forget to use the alternate
'-characters:

%QUOTES="%SETPATTREGEXP='\d{2}(\d{2})\b'%-
%REGEXPBLINDMATCH='%ODATE'%SUBPATT='1'"

[Note: the regex is split using the %- macro and can be entered as two lines!]

Another example? There is a regex for reply templates that modifies
the name of the recipient. Instead of 'Gerd Ewald' we would like to
have 'Gerd Ewald at TBUDL…..' Well, we could download this regex
somewhere, but let us try to create it ourselves.

%OFROMNAME will give us the name.

The reply address is given by %OREPLYADDR. We will extract the list's
name with a regex. Usually the name of the list precedes the
@-character: %SETPATTREGEXP="(.*?)\@"

This is used in combination with %REGEXPBLINDMATCH="%OREPLYADDR" of
which we only want subpattern one : %SUBPATT="1"

The result is then the contents of the TO-field. Watch out, before you
can enter text this field has to be cleared. This is done by an
initial assignment which is void.

%TO=""%TO='"%OFROMNAME at %-
%SETPATTREGEXP=_(.*?)\@_%-
%REGEXPBLINDMATCH=_%OREPLYADDR-%-
%SUBPATT=_1_" <%OREPLYADDR>'

[Note 1: the regex is split using the %- macro and can be entered as
seen!]
[Note 2: the regex makes use of a feature of recent versions of TB
where any character may be used as a quoting delimiter, in this case
the underscore and single quote as well as double quote. Users of
earlier versions will have to resort to using the clumsier double
delimiter syntax]

The original reply address has to be added enclosed in "<>"-characters
at the end.

As you can see, the syntax is quite easy and stereotypical. The only
difficult thing is to find out which macro provides the necessary
information and how to extract it with the regex.

Here another example that is available at Marck's FAQ-page
(www.silverstones.com)

%WRAPPED='Historians believe that on %ODATE%-
%SETPATTREGEXP="(?m-s)Date\:\s*?((.*?[\d]{4})\s*?([\d]{0,2}\:%-
[\d]{0,2}\:[\d]{0,2})\s*?(.*))"%-
%REGEXPBLINDMATCH="%HEADERS" , at %SUBPATT="3"[GMT%SUBPATT="4"]%-
(which was %OTIME where I live) you wrote:'%-

Here, once again, the %- macro is used to make the whole combination
easier to read. This has no special meaning except that it tells TB
that the following line should be treated as a continuation of the
first line. The %WRAPPED means that the result of the macro
combination will be word wrapped at the defined column in TB.

What does the macro do?

The first part "%WRAPPED='Historians believe that on %ODATE%-" is just
some kind of a link up: on every reply the date of the original mail
should be added to the text 'Historians believe that on '.

The second part contains the regex that is much more interesting to us
(I deleted the %- macro to show the regex in one line):

"(?m-s)Date\:\s*?((.*?[\d]{4})\s*?([\d]{0,2}\:[\d]{0,2}\:[\d]{0,2})\s*?(.*))"

The option multiline is switched on and DotAll is switched off: (?m-s)
Then the regex looks for 'Date:', which may be followed by any number
of whitespaces. Due to the greediness of the star a question mark
follows. The author escaped the colon with a backslash that isn't
necessary. I don't know why he did that but it won't cause problems,
so we'll leave it alone.

Now the first parenthesis follows. There is no need to group this part
and I assume it is done for easier reading. You may delete it but then
bear in mind that the total number of subpatterns has changed.

The second parenthesis looks for anything that consists of four
digits. We know that the regex will look in the kludges (%HEADERS) for
the date. So we guess that the author will look for something like
'year'. This may be followed by whitespaces.

Now we come to the third parenthesis. This is the one the author
needs. He searches for three numbers with zero, one or two digits.
These numbers are separated with colons. That is obviously the time.
Whitespace may follow and with the fourth subpattern all of the rest
is matched: this is nomore than the GMT-information.

A closer look on the regex shows that it is applied to the header
lines and only that only subpattern three and four are really needed.

The result could be: 

'Historians believe that on Sonntag, 7. April 2002 , at 11:22:59[GMT
+0200](which was 11:22 where I live) you wrote:'

It works although the layout would need a bit DIY.


7.2 Other Possibilities to Use Regular Expressions in TB

There are other possibilities for using regex in TB than macros.

For example the text search option for in the mail editor. It is
especially useful to search for strings in long mails with the special
features that regex offers.

[picture in PDF-version and on line only]


This window may be opened with Ctrl-F or using the 'Edit Find' menu
entry in the mail editor. Just enter the regex in the text line. Don't
forget to check the 'regular expressions' box in the Options section.

In almost the same way I can search for text within stored mail, I can
search text mails in folders using regex. Just press F7 while in
folder view. This opens a search window, which offers the facility to
search for text in mails. In the 'Options' tab panel you can enter the
regex in the 'Search for' field.

[picture in PDF-version and on line only]

Go to the 'Advanced' tab panel and check "Regular Expressions".

[picture in PDF-version and on line only]

You can use regex in filter conditions to optimise the organisation of
your inbox. This is a field where regex are as efficient as in macros.
Go to the 'Account, Sorting Office/Filters' menu item. Open the filter
definition, go to the 'Options' tab panel and check 'Regular
Expressions'.

[picture in PDF-version and on line only]

7.3 Overview and Summary

What did we learn in this final chapter?

There are several ways to use regular expressions in TB, which are:

· TB offers macros that can use regular expressions to find, extract
  and modify mail text: o %REGEXPTEXT="regex": returns the matched
  string within a mail as text

  o %REGEXPQUOTES="regex" returns the matched string within a mail as
    quoted text

  o %REGEXPMATCH="string" defines the generic text in which the regex
    should match the specified string and return the matched text. Any
    macro or text may be used for 'string'

  o %REGEXPBLINDMATCH="string" is used to define the generic text in
    which the regex should match the specified string. It does not
    return any text. The %SUBPATT is needed to return the text. Any
    macro may be used for 'string'.

  o %SETPATTREGEXP="regex" defines the regex for %REGEXPMTACH and
    %REGEXPBLINDMATCH. The definition is valid if not overridden by a
    subsequent %SETPATTREGEXP

  o %SUBPATT="n" returns the n-th subpattern when used with
    %SETPATTREGEXP and %REGEXPBLINDMATCH

· You can use regular expressions to look for specific messages as
  well as to search strings within mails. Furthermore you can use them
  for defining filters.

Exercise 1:
You remember the regex we wrote to clean the subject line?
"^Re(.*?):\s*(.*?)\s*(\(was:.*\))*$" . Try to improve this one:
instead of '(was:xyz)' PGP-users will find '(PGP Decrypted)'. The
regex should find these kinds of subject as well. Furthermore the
regex should be available within a reply template.


Exercise 2:
In the last chapter I described a macro that modifies the TO-address
for mailing lists:

%TO=""%TO='"%OFROMNAME at %-
%SETPATTREGEXP=_(.*?)\@_%-
%REGEXPBLINDMATCH=_%OREPLYADDR-%-
%SUBPATT=_1_" <%OREPLYADDR>'

Try to change it in such a way that it is no longer necessary to use
%REGEXPBLINDMATCH and %SUBPATT but %REGEXPMATCH. You will need to
modify the regex. Hint wanted? Ok: The subpattern was created because
otherwise the @-character would have been included in the match. The
only thing you have to do is to find a regex that does not match the
@-character and has no subpattern.


Solution 1:
Well, that is not too difficult. You only expand the last part of the
regex with an alternative

"^Re(.*?):\s*(.*?)\s*(\was:.*\)|\(PGP Decrypted\))*$"

But now we have a look at the template. We would like to create a new
subject. The macro we need is %SUBJECT Because we use it when we reply
to a message and we want to have a proper subject line it should start
with: %SUBJECT="Re

Then we add the regex: 

%SUBJECT="Re: %SETPATTREGEXP=""^Re(.*?):\s*(.*?)\s*(\was:.*\)|\(PGP
Decrypted\))*$"" [Note: the regex is wrapped due to layout reasons.
All must be used as a single long line!]


We will apply it to the original subject %OFULLSUBJ and need the
second subpattern. %SUBPATT="2"

%SUBJECT="Re: %SETPATTREGEXP=""^Re(.*?):\s*(.*?)\s*(\was:.*\)|\(PGP
Decrypted\))*$""%REGEXPBLINDMATCH=""%OFULLSUBJ""%SUBPATT=""2"""
[Note: the regex is wrapped due to layout reasons. All must be used as
a single long line!]

Ok, that's it. Additional exercise: what if the subject does not have
any 'Re' but 'AW', 'FWD' or anything else? Go, try to add further
alternatives at the start of the regex.


Solution 2: 
A positive lookahead assertion will help ".*?(?=\@)" The assertion
will look for the @-character but won't include it in the match.
Therefore, the template is easier to write:

%TO=""%TO='"%OFROMNAME at %-
%SETPATTREGEXP=_.*?(?=\@)_%-
%REGEXPMATCH=_%-
%OREPLYADDR_" <%OREPLYADDR>'

8. Final Conclusion

Now let's try to explain the example that was given in chapter 1.

%QUOTES="%SETPATTREGEXP=""(?is)(-----BEGIN PGP
SIGNED.*?\n(Hash:.*?\n)?\s*)?(.*?)(^(- --|--\n|-----BEGIN PGP
SIGNATURE)|\z)""%REGEXPBLINDMATCH=""%text""%SUBPATT=""3"""

It starts with %QUOTES=. The text that is matched with the following
regex is to be used as quoted text.

"%SETPATTREGEXP="" defines the regex:

(?is)(-----BEGIN PGP SIGNED.*?\n(Hash:.*?\n)?\s*)?(.*?)(^(-
--|--\n|-----BEGIN PGP SIGNATURE)|\z)
[Note: the regex is wrapped due to layout reasons. All must be used as
a single long line!]

You know already why there are doubled "-characters: it is to escape
them so that they are not taken as part of another macro by mistake,
(although you also know there are better ways of writing that too).

"(?is)" is the options setting: ignore case and assume the whole text
as one single line, furthermore let the dot match newline characters.

"(-----BEGIN PGP SIGNED.*?\n(Hash:.*?\n)?\s*)?"
This opens the first subpattern. The regex says: find five hyphens
followed by the string BEGIN PGP SIGNED. This may be followed by any
character sequence or none at all (.*?). Due to the greediness of .*
it is restricted by a question mark. Next is a following new line
(\n).

The new line starts with the string 'Hash:', any character sequence
and ends with a new line again. This is the second subpattern and it
may appear once or never. Any number of whitespace characters may
follow the second subpattern. Then the first subpattern is fully
defined by the final parenthesis. Again this is followed by a question
mark: that means that the first subpattern may appear only once or not
at all.

These lines are created by PGP or GnuPG when a message is clear
signed. The text is standard and therefore it is easy to define the
regex. But the author of that macro combination not only wanted to use
it on PGP-signed messages: he or she wanted to use it even on text
that hasn't been touched by PGP and therefore do not have these lines.

"(.*?)"
This is the important third subpattern: the unmodified message text
itself. The preceding regex was necessary to locate and isolate this
subpattern. The regex just says: "Find anything, no matter what, but
don't be greedy."

Now the alternation starts: 

"(^(- --|--\n|-----BEGIN PGP SIGNATURE)|\z)"

Subpattern 4 starts and looks for a beginning of a line. Anything we
now define in this subpattern has to be at the beginning of the line
"(^". Then subpattern 5 follows: "(- --|--\n|-----BEGIN PGP
SIGNATURE)"

It consists of three alternatives: 
"- --" resp. "--\n" or "-----BEGIN PGP SIGNATURE"

The first alternative is well known once you have seen a clear-signed
PGP message : it is the modified signature separator that PGP uses
with the extra hyphen and space as an indicator to show where it
inserted its own lines. Quite unfortunate really, but we won't discuss
it here. Just let's take it as is.

The second alternative is the original signature separator. That means
that this will be found if the text had no contact with PGP. Actually,
it's not quite right, because the proper cut mark is
dash-dash-space-newline, so this regex should be:
"(^(- --|--\s\n|-----BEGIN PGP SIGNATURE)|\z)"

The third alternative is necessary to look f"(^(- --|--\s\n|-----BEGIN
PGP SIGNATURE)|\z)"or lines that contain the PGP-created hash (ok, ok,
there is only a part of the hash, but this is a regex tutorial and not
a PGP tutorial. If you need that one go to www.pro-privacy.de ;-)).
This is the end of subpattern 5's definition.

The second alternative of subpattern 4 "\z)" searches for the end of
the string as a counterpart to subpattern 5's search for the beginning
of a line. Therefore there doesn't have to be a signature separator or
a PGP-hash: The mail just has to end somewhere…

To be honest: the author looks for this funny ending of the mail only
because of the fact that the proper text of the mail should be easily
located and extracted. There is no further interest in these parts.

Now the next macro follows: %REGEXPBLINDMATCH=""%text"", which lets
the machine apply the regex to the text.

The %SUBPATT=""3""" macro returns the proper part of the mail to the
%QUOTES variable.

That's it.

A tutorial that is entirely written without direct feedback was
something new to me: you don't notice when it gets too complicated or
too academic. I tried to avoid both and I tried to concentrate on
those elements of regular expressions that are most useful. I really
hope I was successful and that it wasn't boring ;-)

The tutorial isn't a perfect and full description of regexian. If I
wanted to offer that I could have copied J. Friedl's book into TB's
help file. No, the tutorial was meant to give an idea, an initial help
to get started. Like any other language you will only learn the
vocabulary by doing and using it. If I was able to give you a hand to
get started I'm content!

I would like to thank those who helped convert my ideas into something
readable and useful. My special thanks go to Marck who was very
patient and who improved my translation. Thanks to (in alphabetical
order):

Januk Aggarwal
Bert Bohla
Dirk Heiser
Hanja Nowicka
Peter Palmreuther
Marck D. Pearlstone
Stefan Peukert
Alfred Rübartsch
Andreas Rumpenhorst
Ingrid Spitzer 
Carsten Thönges
Karin Uhlig
Arnd Wichmann

===END===


-- 
Best regards,
 Gerd 
======================================
Tutorial for using PGP with TheBat! www.pro-privacy.de
----------------------------------------------------------------------------
We must interpret a bad temper as a sign of inferiority.
     Alfred Adler, Father of individual psychology (1870-1937)
----------------------------------------------------------------------------
now playing: Jethro Tull - Wind Up


________________________________________________________
Current Ver: 1.61
FAQ        : http://faq.thebat.dutaint.com 
Unsubscribe: mailto:[EMAIL PROTECTED]
Archives   : http://tbudl.thebat.dutaint.com
Moderators : mailto:[EMAIL PROTECTED]
TBTech List: mailto:[EMAIL PROTECTED]
Bug Reports: https://www.ritlabs.com/bt/

[regex-tutorial]: Part 5

Reply via email to