My appologies for taking so long to respond. I've been busy with conferences.
If you don't like the regular expression syntax, then they can just as easily be expressed as English prose: * A EuropeanNumber is a sequence of one or more groups of one or more class EN characters. The groups are separated by a single class ES or CS character. * A SequenceOfEuropeanNumbers is a sequence or one or more EuropeanNumber that are separated, preceeded and followed by zero or more class ET characters. * An ArabicNumber is a sequence of one or more groups of one or more class AN characters. The groups are separated by a single class CS character. * A EuroArabicNumber is a sequence of one or more groups of one or more class EN or AN characters. The groups are separated by a single class CS character. Since the the report claims that rules W2-7 are so the "text is next parsed for numbers." Then it only makes sense to give a grammar for what those numbers are as defining it this way does. The existing definitions are not such a clear grammar. (Note, my previous e-mail had I typo, I should have said "(EN+ sep-by (ES|CS)) bracket-by ET*" not "((EN NSM)+ sep-by ((ES|CS)) bracket-by ET*". The stray NSM was an abortive attempt at including W1 with W2-7. It is possible, but I think it clutters up the core definition.) As to why using regular expressions is better, note that these regular expressions are not the perversions that Perl calls regular expressions, but rather the very well behaved regular expressions from theoretical computer science and thus yield themselves to very efficient, constant space, single pass implementations. In fact, I would posit that when phrased this way, it makes it easy to combine all the X, W, N and I rules into a single pass algorithm that degenerates into the "test for right-to-left characters" optimization (mentioned in section 5.1) when there are no right-to-left characters. This is something that not even the C++ and Java reference implementations do (though it appears that the C++ implementation of the W rules was originally derived from a regular expression as it uses state tables, but if so it is undocumented). (Which by the way they have not been proven to be equivalent, they have merely been tested. Proof is a much more complicated formalism.) On Fri, Sep 10, 2010 at 8:50 PM, Khaled Hosny <[email protected]> wrote: > On Fri, Sep 10, 2010 at 05:00:21PM -0700, Asmus Freytag wrote: >> PS: Personally, I don't find the presentation in terms of the >> regular expressions any more intuitive than the original. > > Some people, when confronted with a problem, think "I know, > I'll use regular expressions." Now they have two problems. > --Jamie Zawinski > > -- > Khaled Hosny > Arabic localiser and member of Arabeyes.org team > Free font developer >

