Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi
Curtis Hatter wrote: On Friday 31 March 2006 06:52, Mauro Tortonesi: while i like the idea of supporting modifiers like "quick" (short circuit) and maybe "i" (case insensitive comparison), i think that (?i:) and (?-i:) constructs would be overkill and rather hard to implement. I figured that

Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi
Hrvoje Niksic wrote: "Tony Lewis" <[EMAIL PROTECTED]> writes: I don't think ",r" complicates the command that much. Internally, the only additional work for supporting both globs and regular expressions is a function that converts a glob into a regexp when ",r" is not requested. That's a strai

Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi
Tony Lewis wrote: Hrvoje Niksic wrote: I don't see a clear line that connects --filter to glob patterns as used by the shell. I want to list all PDFs in the shell, ls -l *.pdf I want a filter to keep all PDFs, --filter=+file:*.pdf you don't need --filter for that. you can simply use -A.

RE: regex support RFC

2006-03-31 Thread Sandhu, Ranjit
31, 2006 10:03 AM To: wget@sunsite.dk Subject: RE: regex support RFC Mauro Tortonesi wrote: > no. i was talking about regexps. they are more expressive and powerful > than simple globs. i don't see what's the point in supporting both. The problem is that users who are expectin

Re: regex support RFC

2006-03-31 Thread Scott Scriven
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote: >> I'm hoping for ... a "raw" type in addition to "file", >> "domain", etc. > > do you mean you would like to have a regex class working on the > content of downloaded files as well? Not exactly. (details below) > i don't like your "raw" proposal as

Re: regex support RFC

2006-03-31 Thread TPCnospam
> * [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/ > > > > soon leads to non wget related links being downloaded, eg. > > http://www.gnu.org/graphics/agnuhead.html > > In that particular case, I think --no-parent would solv

Re: regex support RFC

2006-03-31 Thread Curtis Hatter
On Friday 31 March 2006 06:52, Mauro Tortonesi: > while i like the idea of supporting modifiers like "quick" (short > circuit) and maybe "i" (case insensitive comparison), i think that (?i:) > and (?-i:) constructs would be overkill and rather hard to implement. I figured that the (?i:) and (?-i:)

RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote: > I don't see a clear line that connects --filter to glob patterns as used > by the shell. I want to list all PDFs in the shell, ls -l *.pdf I want a filter to keep all PDFs, --filter=+file:*.pdf Note that "*.pdf" is not a valid regular expression even though it's what most

Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes: > I didn't miss the point at all. I'm trying to make a completely different > one, which is that regular expressions will confuse most users (even if you > tell them that the argument to --filter is a regular expression). Well, "most users" will probably n

RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote: > But that misses the point, which is that we *want* to make the > more expressive language, already used elsewhere on Unix, the > default. I didn't miss the point at all. I'm trying to make a completely different one, which is that regular expressions will confuse most user

Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes: > Mauro Tortonesi wrote: > >> no. i was talking about regexps. they are more expressive >> and powerful than simple globs. i don't see what's the >> point in supporting both. > > The problem is that users who are expecting globs will try things like > --fi

RE: regex support RFC

2006-03-31 Thread Tony Lewis
Mauro Tortonesi wrote: > no. i was talking about regexps. they are more expressive > and powerful than simple globs. i don't see what's the > point in supporting both. The problem is that users who are expecting globs will try things like --filter=-file:*.pdf rather than --filter:-file:.*\.pdf.

Re: regex support RFC

2006-03-31 Thread Oliver Schulze L.
Mauro Tortonesi wrote: for consistency and to avoid maintenance problems, i would like wget to have the same behavior on windows and unix. please, notice that if we implemented regex support only on unix, windows binaries of wget built with cygwin would have regex support but native binaries w

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi
Hrvoje Niksic wrote: Wincent Colaiuta <[EMAIL PROTECTED]> writes: Are you sure that "www-*" matches "www"? Yes. hrvoje is right. try this perl script: #!/usr/bin/perl -w use strict; my @strings = ("www-.yoyodyne.com", "www.yoyodyne.com"); foreach my $str (@strings) {

Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Wincent Colaiuta <[EMAIL PROTECTED]> writes: > Are you sure that "www-*" matches "www"? Yes. > As far as I know "www-*" matches "one w, another w, a third w, a > hyphen, then 0 or more hyphens". That would be "www--*" or "www-+".

Re: regex support RFC

2006-03-31 Thread Wincent Colaiuta
El 31/03/2006, a las 14:37, Hrvoje Niksic escribió: "*" matches the previous character repeated 0 or more times. This is in contrast to wildcards, where "*" alone matches any character 0 or more times. (This is part of why regexps are often confusing to people used to the much simpler wildcard

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi
Hrvoje Niksic wrote: Herold Heiko <[EMAIL PROTECTED]> writes: Get the best of both, use a syntax permitting a "first match-exits" ACL, single ACE permits several statements ANDed together. Cooking up a simple syntax for users without much regexp experience won't be easy. I assume ACL stands f

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi
Hrvoje Niksic wrote: Mauro Tortonesi <[EMAIL PROTECTED]> writes: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. not really. it would not match www.yoyodyne.com. Wh

Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes: >wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. >>> >>> not really. it would not match www.yoyodyne.com. >> Why

RE: regex support RFC

2006-03-31 Thread Herold Heiko
> From: Oliver Schulze L. [mailto:[EMAIL PROTECTED] > My personal idea on this is to: enable regex in Unix and > disable it on > Windows. > > We all use Unix/Linux and regex is really usefull. I think not having We all use Unix/Linux ? You would be surprised how many wget users on windows are

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi
Curtis Hatter wrote: On Thursday 30 March 2006 13:42, Tony Lewis wrote: Perhaps --filter=path,i:/path/to/krs would work. That would look to be the most elegant method. I do hope that the (?i:) and (?-i:) constructs are supported since I may not want the entire path/file to be case (in)?sens

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi
Oliver Schulze L. wrote: Hrvoje Niksic wrote: The regexp API's found on today's Unix systems might be usable, but unfortunately those are not available on Windows. My personal idea on this is to: enable regex in Unix and disable it on Windows. > We all use Unix/Linux and regex is really us

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi
Hrvoje Niksic wrote: Mauro Tortonesi <[EMAIL PROTECTED]> writes: Scott Scriven wrote: * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpr

Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes: > Scott Scriven wrote: >> * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: >> >>>wget -r --filter=-domain:www-*.yoyodyne.com >> This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", >> "www---.yoyodyne.com", and so on, if interpreted as a regex

Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi
Scott Scriven wrote: * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. not really. it would not match www.yoyodyne.com. It

Re: regex support RFC

2006-03-30 Thread Scott Scriven
* [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/ > > soon leads to non wget related links being downloaded, eg. > http://www.gnu.org/graphics/agnuhead.html In that particular case, I think --no-parent would solve the problem.

Re: regex support RFC

2006-03-30 Thread Oliver Schulze L.
Hrvoje Niksic wrote: The regexp API's found on today's Unix systems might be usable, but unfortunately those are not available on Windows. My personal idea on this is to: enable regex in Unix and disable it on Windows. We all use Unix/Linux and regex is really usefull. I think not having

Re: regex support RFC

2006-03-30 Thread Curtis Hatter
On Thursday 30 March 2006 13:42, Tony Lewis wrote: > Perhaps --filter=path,i:/path/to/krs would work. That would look to be the most elegant method. I do hope that the (?i:) and (?-i:) constructs are supported since I may not want the entire path/file to be case (in)?sensitive =), but that will

RE: regex support RFC

2006-03-30 Thread Tony Lewis
Curtis Hatter wrote: > Also any way to add modifiers to the regexs? Perhaps --filter=path,i:/path/to/krs would work. Tony

Re: regex support RFC

2006-03-30 Thread Scott Scriven
* Jim Wright <[EMAIL PROTECTED]> wrote: > Suppose you want files from some.dom.com://*/foo/*.png. The > part I'm thinking of here is "foo as last directory component, > and png as filename extension." Can the individual rules be > combined to express this? Only one rule is needed for that patter

Re: regex support RFC

2006-03-30 Thread Scott Scriven
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote: > wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. It would most likely also match "www---zyoyodyneXcom". Perhaps you want glob

Re: regex support RFC

2006-03-30 Thread Curtis Hatter
On Thursday 30 March 2006 11:49, you wrote: > How many keywords do we need to provide maximum flexibility on the > components of the URI? (I'm thinking we need five.) > > Consider http://www.example.com/path/to/script.cgi?foo=bar > > --filter=uri:regex could match against any part of the URI > --fi

RE: regex support RFC

2006-03-30 Thread Tony Lewis
How many keywords do we need to provide maximum flexibility on the components of the URI? (I'm thinking we need five.) Consider http://www.example.com/path/to/script.cgi?foo=bar --filter=uri:regex could match against any part of the URI --filter=domain:regex could match against www.example.com --

Re: regex support RFC

2006-03-30 Thread Curtis Hatter
On Wednesday 29 March 2006 12:05, you wrote: > we also have to reach consensus on the filtering algorithm. for > instance, should we simply require that a url passes all the filtering > rules to allow its download (just like the current -A/R behaviour), or > should we instead adopt a short circuit

Re: regex support RFC

2006-03-30 Thread Jim Wright
On Thu, 30 Mar 2006, Mauro Tortonesi wrote: > > > I do like the [file|path|domain]: approach. very nice and flexible. > > (and would be a huge help to one specific need I have!) I suggest also > > including an "any" option as a shortcut for putting the same pattern in > > all three options. > >

RE: regex support RFC

2006-03-30 Thread Herold Heiko
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] > > I agree. Just how often will there be problems in a single > wget run due to > > both some.domain.com and somedomain.com present (famous last > > words...) > > Actually it would have to be somedomain.com -- a "." > will not match the null string

Re: regex support RFC

2006-03-30 Thread Hrvoje Niksic
Herold Heiko <[EMAIL PROTECTED]> writes: >> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] >> I don't think such a thing is necessary in practice, though; remember >> that even if you don't escape the dot, it still matches the (intended) >> dot, along with other characters. So for quick&dirty usag

RE: regex support RFC

2006-03-30 Thread Herold Heiko
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] > I don't think such a thing is necessary in practice, though; remember > that even if you don't escape the dot, it still matches the (intended) > dot, along with other characters. So for quick&dirty usage not > escaping dots will "just work", and th

Re: regex support RFC

2006-03-30 Thread Hrvoje Niksic
Herold Heiko <[EMAIL PROTECTED]> writes: > Get the best of both, use a syntax permitting a "first match-exits" > ACL, single ACE permits several statements ANDed together. Cooking > up a simple syntax for users without much regexp experience won't be > easy. I assume ACL stands for "access contro

Re: regex support RFC

2006-03-30 Thread Hrvoje Niksic
Herold Heiko <[EMAIL PROTECTED]> writes: > BTW any comments about the dots ? Requiring escaped dots in domains would > become old really fast, reversing behaviour (\. = any char) would be against > the principle of least surprise, since any other regexp syntax does use the > opposite. Modifying t

RE: regex support RFC

2006-03-30 Thread Herold Heiko
[Immagination running freely, I do not have a lot of experience designing syntax, but I suffer a lot in a helpdeskish way trying to explain syntax to users. Hopefully this can be somehow useful] > we also have to reach consensus on the filtering algorithm. for > instance, should we simply require

Re: regex support RFC

2006-03-30 Thread Mauro Tortonesi
Jim Wright wrote: what definition of regexp would you be following? that's another degree of liberty. hrovje and i have chosen to integrate in wget the GNU regex implementation, which allows the exploitation of one of these different syntaxes: RE_SYNTAX_EMACS RE_SYNTAX_AWK RE_SYNTAX_GNU_AWK

Re: regex support RFC

2006-03-29 Thread Hrvoje Niksic
Jim Wright <[EMAIL PROTECTED]> writes: > what definition of regexp would you be following? or would this be > making up something new? It wouldn't be new, Mauro is definitely referring to regexps as normally understood. The regexp API's found on today's Unix systems might be usable, but unfortu

Re: regex support RFC

2006-03-29 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes: > for instance, the syntax for --filter presented above is basically the > following: > > --filter=[+|-][file|path|domain]:REGEXP I think there should also be "url" for filtering on the entire URL. People have been asking for that kind of thing a lot ov

Re: regex support RFC

2006-03-29 Thread TPCnospam
> for instance, the syntax for --filter presented above is basically the > following: > > --filter=[+|-][file|path|domain]:REGEXP I think a file 'contents' regexp search facility would be a useful addition here. eg. --filter=[+|-][file|path|domain|contents]:REGEXP The idea is that if the fi

Re: regex support RFC

2006-03-29 Thread Jim Wright
what definition of regexp would you be following? or would this be making up something new? I'm not quite understanding the comment about the comma and needing escaping for literal commas. this is true for any character in the regexp language, so why the special concern for comma? I do like the

regex support RFC

2006-03-29 Thread Mauro Tortonesi
hrvoje and i have been recently talking about adding regex support to wget. we were considering to add a new --filter option which, by supporting regular expressions, would allow more powerful ways of filtering urls to download. for instance the new option could allow the filtering of domain