[Patch] Addition of Clustering (ie non backref'd grouping)

2001-02-08 Thread Michael McCallum

Hi all,

I have started using the regexp package because is nice and lightweight but 
found I could not use clustering. I think it might be a Perl extension to re 
but found it was easy to implement in this package.

This allows the use of the following style matching.

(?:\w+(?:\s\w+)+
Mary had a little lamb

This will match with the only paren (0) returning the full string.

A better example is domain names (simplified here not sure if it complies 
with the relevant RFC.)...

([a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*)
www.test.com
jakarta.apache.org

Will both match. with paren 0 having the full string.

Now take the above expression and add the protocol...

(:?\w+://)?([a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*)
http://www.test.com
Paren 0 = http://www.test.com
Paren 1 = www.test.com

Anyway there are about 10 tests in RETest.txt that demostrate this.

regards,
Michael


p.s. I think that I striped a bunch of spaces from the end of lines so there 
are a bunch of extra line in the patch. Not very familiar with using diff :)

? bin
? Clustering.patch
? Clustering2.patch
? build/run-tests.sh
Index: docs/RETest.txt
===
RCS file: /home/cvspublic/jakarta-regexp/docs/RETest.txt,v
retrieving revision 1.1
diff -r1.1 RETest.txt
886a887,980
> 
> #149
> (?:a)
> a
> YES
> a
> 
> #150
> (?:a)
> aa
> YES
> a
> 
> #151
> (?:\w)
> abc
> YES
> a
> 
> #152
> (?:\w\s\w)+
> a b c
> YES
> a b
> 
> #153
> (a\w)(?:,(a\w))+
> ab,ac,ad
> YES
> ab,ac,ad
> ab
> ad
> 
> #154
> z(\w\s+(?:\w\s+\w)+)z
> za   b bc   cd dz
> YES
> za   b bc   cd dz
> a   b bc   cd d
> 
> #155
> (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> http://www.test.com
> YES
> http://www.test.com
> http://
> http
> .com
> 
> #156
> ((?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> ftp://www.test.com
> YES
> ftp://www.test.com
> ftp://
> .com
> 
> #157
> (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*
> htTp://www.test.com
> YES
> htTp://www.test.com
> htTp://
> htTp
> 
> #158
> (?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> FTP://www.test.com
> YES
> FTP://www.test.com
> FTP
> .com
> 
> #159
> ^(?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*$
> http://.www.test.com
> NO
> 
> #160
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> FtP://www.test.com
> YES
> FtP://www.test.com
> 
> #161
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> FtTP://www.test.com
> NO
> 
> #162
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> www.test.com
> YES
> www.test.com
Index: src/java/org/apache/regexp/RE.java
===
RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RE.java,v
retrieving revision 1.6
diff -r1.6 RE.java
176,186c176,186
<  *[:alnum:]Alphanumeric characters. 
<  *[:alpha:]Alphabetic characters. 
<  *[:blank:]Space and tab characters. 
<  *[:cntrl:]Control characters. 
<  *[:digit:]Numeric characters. 
<  *[:graph:]Characters that are printable and are also visible. (A 
space is printable, but not visible, while an `a' is both.) 
<  *[:lower:]Lower-case alphabetic characters. 
<  *[:print:]Printable characters (characters that are not control 
characters.) 
<  *[:punct:]Punctuation characters (characters that are not letter, 
digits, control characters, or space characters). 
<  *[:space:]Space characters (such as space, tab, and formfeed, to 
name a few). 
<  *[:upper:]Upper-case alphabetic characters. 
---
>  *[:alnum:]Alphanumeric characters.
>  *[:alpha:]Alphabetic characters.
>  *[:blank:]Space and tab characters.
>  *[:cntrl:]Control characters.
>  *[:digit:]Numeric characters.
>  *[:graph:]Characters that are printable and are also visible. (A 
>space is printable, but not visible, while an `a' is both.)
>  *[:lower:]Lower-case alphabetic characters.
>  *[:print:]Printable characters (characters that are not control 
>characters.)
>  *[:punct:]Punctuation characters (characters that are not letter, 
>digits, control characters, or space characters).
>  *[:space:]Space characters (such as space, tab, and formfeed, to 
>name a few).
>  *[:upper:]Upper-case alphabetic characters.
188c188
<  * 
---
>  *
199c199
<  * 
---
>  *
254a255
>  *   (?:A) Used for subexpression clustering (just like grouping but 
>no backrefs)
399a401
> static final char OP_OPEN_CLUSTER = '<';  // opening cluster
400a403
> 

Re: [Patch] Addition of Clustering (ie non backref'd grouping)

2001-02-08 Thread Jon Stevens

on 2/8/01 5:54 AM, "Michael McCallum" <[EMAIL PROTECTED]> wrote:

> I have started using the regexp package because is nice and lightweight but
> found I could not use clustering. I think it might be a Perl extension to re
> but found it was easy to implement in this package.

Thanks, I will take a look at it today.

-jon

-- 
If you come from a Perl or PHP background, JSP is a way to take
your pain to new levels. --Anonymous
 && 




Re: [Patch] Addition of Clustering (ie non backref'd grouping)

2001-02-09 Thread Jon Stevens

on 2/8/01 5:54 AM, "Michael McCallum" <[EMAIL PROTECTED]> wrote:

> Anyway there are about 10 tests in RETest.txt that demostrate this.
> 
> regards,
> Michael

Q: There are two tests that don't pass. Why?

#159
#161

Q: 89,90c89,90
< //new RETest(arg);
< test();
---
> new RETest(arg);
> //test();

Why?

-jon

-- 
If you come from a Perl or PHP background, JSP is a way to take
your pain to new levels. --Anonymous
 && 




Re: [Patch] Addition of Clustering (ie non backref'd grouping)

2001-02-10 Thread Michael McCallum

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



On 9 Feb 2001, at 11:55, Jon Stevens wrote:

> on 2/8/01 5:54 AM, "Michael McCallum" <[EMAIL PROTECTED]> wrote:
> 
> > Anyway there are about 10 tests in RETest.txt that demostrate this.
> > 
> > regards,
> > Michael
> 
> Q: There are two tests that don't pass. Why?
Good question. Possible I made the patch and then made a change it was 3 in the 
morning I will check it out when I am home next.
> #159
> #161
> 
> Q: 89,90c89,90
> < //new RETest(arg);
> < test();
> ---
> > new RETest(arg);
> > //test();
> 
> Why?
This was because test(); does not test of the command line parameters so could not use 
interative tests or pass the file I wished RETest to use as a parameter. Its possible 
something 
had changed that i missed between the 1.2 source and the cvs repository.

Michael

p.s. I will have a look later today to see whats going on with both of those and send 
another 
patch. 

-BEGIN PGP SIGNATURE-
Version: N/A

iQA/AwUBOoVHKrPjWznw9K1HEQKajACgmy4cZAE73lEMZ9wSRlhXhA1UVNsAoKsT
qyDNCPyVrJKEmJsv/BnAdPiO
=PS0U
-END PGP SIGNATURE-
--- BEGIN GEEK CODE BLOCK ---
Version 3.12
GCS d+(-) s:- a-- C++(+++)$ UL(H)(S)$ P+++$ L+++$>
E--- W++ N++ o++ K? !w() O? !M V? PS+ PE+++ Y+ t+ 5++ X++ 
R(+) !tv b++() D++ G>++ e++> h--()(*) r+ y+()
--- END GEEK CODE BLOCK ---



Re: [Patch] Addition of Clustering (ie non backref'd grouping)

2001-02-10 Thread Michael McCallum

Hi.

Jon Stevens > Q: There are two tests that don't pass. Why?
Having got home and looked at the tests I think I did not answer the question 
properly earlier. I assume you mean why should these not match?

Jon Stevens > #159
If so then #159 the www is preceded by a period.
The re requires that the the first character of the domain name be 
alpanumeric or a hyphen.
http://.www.test.com

Jon Stevens > #161
The re only matchs ftp and http protocols. But not Fttp.

NOTE:  The "Match: NO" means a successful non-matching.

A big...
*
Failure*
*
type message appears if one of the tests "fails".


As to the test(); as opposed to new RETest( args );
I have included another patch to clean this up( this is a repeat of the 
previous patch with additions).
I think someone intended to clean it up earlier but did not finish or was 
distracted as the javadocs says one thing and the code does something other.
I think it on track now...

Michael


Index: docs/RETest.txt
===
RCS file: /home/cvspublic/jakarta-regexp/docs/RETest.txt,v
retrieving revision 1.1
diff -r1.1 RETest.txt
886a887,980
> 
> #149
> (?:a)
> a
> YES
> a
> 
> #150
> (?:a)
> aa
> YES
> a
> 
> #151
> (?:\w)
> abc
> YES
> a
> 
> #152
> (?:\w\s\w)+
> a b c
> YES
> a b
> 
> #153
> (a\w)(?:,(a\w))+
> ab,ac,ad
> YES
> ab,ac,ad
> ab
> ad
> 
> #154
> z(\w\s+(?:\w\s+\w)+)z
> za   b bc   cd dz
> YES
> za   b bc   cd dz
> a   b bc   cd d
> 
> #155
> (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> http://www.test.com
> YES
> http://www.test.com
> http://
> http
> .com
> 
> #156
> ((?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> ftp://www.test.com
> YES
> ftp://www.test.com
> ftp://
> .com
> 
> #157
> (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*
> htTp://www.test.com
> YES
> htTp://www.test.com
> htTp://
> htTp
> 
> #158
> (?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> FTP://www.test.com
> YES
> FTP://www.test.com
> FTP
> .com
> 
> #159
> ^(?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*$
> http://.www.test.com
> NO
> 
> #160
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> FtP://www.test.com
> YES
> FtP://www.test.com
> 
> #161
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> FtTP://www.test.com
> NO
> 
> #162
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> www.test.com
> YES
> www.test.com
Index: src/java/org/apache/regexp/RE.java
===
RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RE.java,v
retrieving revision 1.6
diff -r1.6 RE.java
176,186c176,186
<  *[:alnum:]Alphanumeric characters. 
<  *[:alpha:]Alphabetic characters. 
<  *[:blank:]Space and tab characters. 
<  *[:cntrl:]Control characters. 
<  *[:digit:]Numeric characters. 
<  *[:graph:]Characters that are printable and are also visible. (A 
space is printable, but not visible, while an `a' is both.) 
<  *[:lower:]Lower-case alphabetic characters. 
<  *[:print:]Printable characters (characters that are not control 
characters.) 
<  *[:punct:]Punctuation characters (characters that are not letter, 
digits, control characters, or space characters). 
<  *[:space:]Space characters (such as space, tab, and formfeed, to 
name a few). 
<  *[:upper:]Upper-case alphabetic characters. 
---
>  *[:alnum:]Alphanumeric characters.
>  *[:alpha:]Alphabetic characters.
>  *[:blank:]Space and tab characters.
>  *[:cntrl:]Control characters.
>  *[:digit:]Numeric characters.
>  *[:graph:]Characters that are printable and are also visible. (A 
>space is printable, but not visible, while an `a' is both.)
>  *[:lower:]Lower-case alphabetic characters.
>  *[:print:]Printable characters (characters that are not control 
>characters.)
>  *[:punct:]Punctuation characters (characters that are not letter, 
>digits, control characters, or space characters).
>  *[:space:]Space characters (such as space, tab, and formfeed, to 
>name a few).
>  *[:upper:]Upper-case alphabetic characters.
188c188
<  * 
---
>  *
199c199
<  * 
---
>  *
254a255
>  *   (?:A) Used for subexpression clustering (just like grouping but 
>no backrefs)
399a401
> static final char OP_OPEN_CLUSTER = '<';  // opening cluster
400a403
> static final char O

Re: [Patch] Addition of Clustering (ie non backref'd grouping)

2001-02-11 Thread Jon Stevens

Hi Michael,

your patch is now in CVS.

thanks,

-jon