Re: How trim: Bug in RegExp engine

Marielle Lange Mon, 24 Oct 2005 15:40:25 -0700

The other main issue is that Rev does not support all the finenuances of Perl-style RegEx, though the docs say it does.

A problem is that their documentation doesn't match what theirfunctions. A table that summarizes the regular expression codes foundin about all programs that implement regular expressions can be seenat : http://revolution.lexicall.org/wiki/tiki-index.php?page=RegularExpressions


What is missing in rev doc:
{} The braces force the preceding character to match a
          specific number of times.
          Ex:  (rat){3}    matches ratratrat
           rat{3}    matches rattt  rat{2,5} matches ratt or
          rattt or ratttt or rattttt (Between 2 and 5 t s)

Though this is implemented:
put "_" & replaceText("AAAAAAA","A{3}","")   -> A
put "_" & replaceText("AAAAAAA","A{4}","")   -> AAA
put "_" & replaceText("AAAAAAA","A{5}","")   -> AA
put "_" & replaceText("AAAAAAA","A{6}","")   -> A

There is an error in their documentation:
[ABC]|[XYZ] matches “AY” or “CX”, but not “AA” or “ZB”.
should be:

[ABC][XYZ] matches “AY” or “CX”, but not “AA” or “ZB”. (i.e.,inappropriate to exemplify "|")

Hopefully, the function behaves normally:

put "AYCXAA" into tTExt; put replacetext(tText, "[ABC][XYZ]", "") -> AAput "AYCXAA" into tTExt; put replacetext(tText, "[ABC]|[XYZ]", "") -> empty


The correct example is
(AY|CX)   matches “AY” or “CX”
or a more telling one
(mouse|mice) matches mouse or mice.

I don't remember the details, but I ran into problems trying to uselook-around features, for instance. I've come to the conclusionthat I should try a simple version of what I want first in theMessage Box, then put it into my script.

I was surprised to see Mark use \s and \S as they are not mentionedin the documentation (which hasn't been updated to follow updates inthe function in version 2.5). Full information about these specialcodes can be found below.

Interestingly, start of text can also be represented by \A and \Z .They work in revolution and produce still another behaviour.Honestly, I was pleased to read that regular expressions had beenimproved (version 2.6?)... but there are obviously some more problemsto fix.


put "_" & replaceText(" A C","^ *","")  -> _A C
put "_" & replaceText("A C","^ *","")   -> _C

put "_" & replaceText(" A C","\A ","")   -> _A C   (space before A C)
put "_" & replaceText("A C","\A ","")     -> _A C   (no space)
put "_" & replaceText("A C","\A ","")     -> _A C   (no space)
put "_" & replaceText("A C","\A *","")    -> _
put "_" & replaceText(" A C","\A *","")    -> _

I tried the edge of word (\B) and this seems to behave strangely aswell:


put "_" & replaceText(" A C","\B *","")   -> _A C
put "_" & replaceText(" A C","\b *","")   -> _

------------------------------------------------------------------------------------------------


 \b and \B    NaV. \b matches the empty string at the

edge of a word; \B matches the empty string if not atthe edge of

              a word.

Ex: \bcomput will match "computer" or "computing", butnot"supercomputer" since there is no spaces orpunctuation between"super" and "computer". \Bcomput will not match"computer" or

              "computing", unless it is part of a bigger word such as
              "supercomputer" or "recomputing".

 \w and \W    NaV. \w matches word-constituent

characters (letters, "_", & digits); \W matchescharacters that

              are not word-constituent

Ex: a\wz matches "abz", "aTz", "a5z", "a_z", or anythree-character

             string starting with "a", ending with "z", and whose
             second character was either a letter (upper-or
             lower-case), a number, or the underscore.
             a\Wz would not match "abz", "aTz", "a5z", or "a_z". It
             would match "a%z", "a z", "a?z" or any three-character
             string starting with "a" and ending with "z" and whose
             second character was not a letter, number, or
             underscore. (This means the second character must
             either be a symbol or a whitespace character.)

 \d and \D    NaV. \d matches any digit. \D matches any
                  character except a digit.

Ex: a\Dz matches "abz", "aTz" or "a%z", not "a2z","a5z" or "a9z".\D+ matches any non-null string which contains nonumeric characters.


 \s and \S    NaV. \s matches exactly one character of

whitespace. (Whitespace is defined as spaces, tabs,newlines, orany character which would not use ink if printed on aprinter.) \S

              matches any character that is not whitespace.

Ex: a\sz would match any three-character stringstarting with "a" and endingwith "z" and whose second character was a space, tab,or newline.a\Sz would match any three-character stringstarting with "a" andending with "z" whose second character was not aspace, tab ornewline. (Thus, the second character could be aletter, number or

                  symbol.)

\nnn NaV. This is used for specifying control charactersthat have no typedequivalent. For example, \007 would find all subjectswith an embedded ASCII"bell" character. (The bell is specified by an ASCIIvalue of 7.) You will

              rarely need to use the octal metacharacter.

 \A and \Z    Beginning and End of string. (equivalents of ^and $)

--------------------------------------------------------------------------------

Marielle Lange (PhD),  Psycholinguist

Alternative emails: [EMAIL PROTECTED], [EMAIL PROTECTED]

Homepagehttp://homepages.lexicall.org/mlange/

Easy access to lexical databases                    http://lexicall.org

Supporting Education Technologists http://revolution.lexicall.org



_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: How trim: Bug in RegExp engine

Reply via email to