Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi

Curtis Hatter wrote:

On Friday 31 March 2006 06:52, Mauro Tortonesi:


while i like the idea of supporting modifiers like "quick" (short
circuit) and maybe "i" (case insensitive comparison), i think that (?i:)
and (?-i:) constructs would be overkill and rather hard to implement.


I figured that the (?i:) and (?-i:) constructs would be provided by the 
regular expression engine and that the --filter switch would simply be able 
to use any construct provided by that engine.


i know, that would be really nice.

If, as you said, this would be hard to implement or require extra effort by 
you that is above and beyond that required for the more "standard" constructs 
then I would say that they shouldn't be implemented; at least at first.


i agree.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

"Tony Lewis" <[EMAIL PROTECTED]> writes:


I don't think ",r" complicates the command that much. Internally,
the only additional work for supporting both globs and regular
expressions is a function that converts a glob into a regexp when
",r" is not requested.  That's a straightforward transformation.


",r" makes it harder to input regexps, which are the whole point of
introducing --filter.  Besides, having two different syntaxes for the
same switch, and for no good reason, is not really acceptable, even if
the implementation is straightforward.


i agree 100%. and don't forget that globs are already supported by 
current filtering options.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-04-03 Thread Mauro Tortonesi

Tony Lewis wrote:

Hrvoje Niksic wrote:


I don't see a clear line that connects --filter to glob patterns as used
by the shell.


I want to list all PDFs in the shell, ls -l *.pdf

I want a filter to keep all PDFs, --filter=+file:*.pdf


you don't need --filter for that. you can simply use -A.


I predict that the vast majority of bug reports and support requests will be
for users who are trying a glob rather than a regular expression.


i you might be right about this. but i think your point of view is 
somewhat flawed. hrvoje and i designed --filter to extend current wget 
filtering capabilities, not to replace them. in this sense, --filter 
should be used only when regex filtering capabilities are needed. if 
not, -A/R & company are just fine.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


RE: regex support RFC

2006-03-31 Thread Sandhu, Ranjit
I agree with Tony.i think most basic users, me included, thought
www-*.yoyodyne.com would not match www.yoyodyne.com

Support globs as default, regexp as the more powerful option.

Ranjit Sandhu
SRA
 

-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 31, 2006 10:03 AM
To: wget@sunsite.dk
Subject: RE: regex support RFC

Mauro Tortonesi wrote: 

> no. i was talking about regexps. they are more expressive and powerful

> than simple globs. i don't see what's the point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases
their expressions will simply work, which will result in significant
confusion when some expression doesn't work, such as
--filter:-domain:www-*.yoyodyne.com. :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:

--filter:-domain:www-*.yoyodyne.com
--filter:-domain,r:www-.*\.yoyodyne\.com

Internally, wget would convert the first into the second and then treat
it as a regular expression. For the vast majority of cases, glob will
work just fine.

One might argue that it's a lot of work to implement regular expressions
if the default input format is a glob, but I think we should aim for
both lack of confusion and robust functionality. Using ",r" means people
get regular expressions when they want them and know what they're doing.
The universe of wget users who "know what they're doing" are mostly
subscribed to this mailing list; the rest of them send us mail saying
"please CC me as I'm not on the list". :-)

If we go this route, I'm wondering if the appropriate conversion from
glob to regular expression should take directory separators into
account, such
as:

--filter:-path:path/to/*

becoming the same as:

--filter:-path,r:path/to/[^/]*

or even:

--filter:-path,r:path[/\\]to[/\\][^/\\]*

Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.)

Tony




Re: regex support RFC

2006-03-31 Thread Scott Scriven
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:
>> I'm hoping for ... a "raw" type in addition to "file",
>> "domain", etc.
> 
> do you mean you would like to have a regex class working on the
> content of downloaded files as well?

Not exactly.  (details below)

> i don't like your "raw" proposal as it is HTML-specific. i
> would like instead to develop a mechanism which could work for
> all supported protocols.

I see.  It would be problematic for other protocols.  :(
A raw match would be more complicated than I originally thought,
because it is HTML-specific and uses extra data which isn't
currently available to the filters.

Would it be feasible to make "raw" simply return the full URI
when the document is not HTML?

I think there is some value in matching based on the entire link
tag, instead of just the URI.  Wget already has --follow-tags and
--ignore-tags, and a "raw" match would be like an extension to
that concept.  I would find it useful to be able to filter
according to things which are not part of the URI.  For example:

  follow: article
  skip:   buy now

Either the class property or the visible link text could be used
to decide if the link is worth following, but the URI in this
case is pretty useless.

It may need to be a different option; use "--filter" to filter
the URI list, and use "--filter-tag" earlier in the process (same
place as "--follow-tags"), to help generate the URI list.
Regardless, I think it would be useful.

Any thoughts?


-- Scott


Re: regex support RFC

2006-03-31 Thread TPCnospam
> * [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/
> > 
> > soon leads to non wget related links being downloaded, eg. 
> > http://www.gnu.org/graphics/agnuhead.html
> 
> In that particular case, I think --no-parent would solve the
> problem.

No.  The idea is not to be restricted to not descending the tree. 

> 
> Maybe I misunderstood, though.  It seems awfully risky to use -r
> and -H without having something to strictly limit the links
> followed.  So, I suppose the content filter would be an effective
> way to make cross-host downloading safer.

Absolutely.  That is why I proposed a 'contents' regexp.

> 
> I think I'd prefer to have a different option, for that sort of
> thing -- filter by using external programs.  If the program
> returns a specific code, follow the link or recurse into the
> links contained in the file.  Then you could do far more complex
> filtering, including things like interactive pruning.

True.  That could be a future feature request but now that the wget team 
are writing regexp code, it seems an ideal time to implement it.  By 
constructing suitable regexps, one could use this feature to search for 
any string in the html file, (as above), or just in metatags etc.  IMHO it 
gives a lot of flexibility for little extra developer programming.

Any comments, Mauro & Hrvoje?

Thanks
Tom Crane

-- 
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England. 
Email:  [EMAIL PROTECTED]
Fax:+44 (0) 1784 472794


Re: regex support RFC

2006-03-31 Thread Curtis Hatter
On Friday 31 March 2006 06:52, Mauro Tortonesi:
> while i like the idea of supporting modifiers like "quick" (short
> circuit) and maybe "i" (case insensitive comparison), i think that (?i:)
> and (?-i:) constructs would be overkill and rather hard to implement.

I figured that the (?i:) and (?-i:) constructs would be provided by the 
regular expression engine and that the --filter switch would simply be able 
to use any construct provided by that engine.

I was more trying to persuade for the use of a regex engine that supports such 
constructs (like Perl's). Some other constructs I find useful are: (?=), 
(?!=), (?)

These may be overkill but I would rather have the expressiveness of a regex 
engine like Perls when I need it instead of writing regexs in another engine 
that have to be twice as long to compensate for the lack of language 
constructs. Those who don't want to use them, or don't know of them they can 
write regex's as normal.

If, as you said, this would be hard to implement or require extra effort by 
you that is above and beyond that required for the more "standard" constructs 
then I would say that they shouldn't be implemented; at least at first.

Curtis


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote:

> I don't see a clear line that connects --filter to glob patterns as used
> by the shell.

I want to list all PDFs in the shell, ls -l *.pdf

I want a filter to keep all PDFs, --filter=+file:*.pdf

Note that "*.pdf" is not a valid regular expression even though it's what
most people will try naturally. Perl complains:
/*.pdf/: ?+*{} follows nothing in regexp

I predict that the vast majority of bug reports and support requests will be
for users who are trying a glob rather than a regular expression.

Tony



Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes:

> I didn't miss the point at all. I'm trying to make a completely different
> one, which is that regular expressions will confuse most users (even if you
> tell them that the argument to --filter is a regular expression).

Well, "most users" will probably not use --filter in the first place.
Those that do will have to look at the documentation where they'll
find that it accepts regexps.  Since Wget is hardly the first program
to use regexps, I don't see why most users would be confused by that
choice.

> Yes, regular expressions are used elsewhere on Unix, but not
> everywhere. The shell is the most obvious comparison for user input
> dealing with expressions that select multiple objects; the shell
> uses globs.

I don't see a clear line that connects --filter to glob patterns as
used by the shell.  If anything, the connection is with grep and other
commands that provide powerful filtering (awk and Perl's //
operators), which all seem to work on regexps.  Where the context can
be thought of shell-like (as in wget ftp://blah/*), Wget happily
obliges by providing shell-compatible patterns.

> I don't think ",r" complicates the command that much. Internally,
> the only additional work for supporting both globs and regular
> expressions is a function that converts a glob into a regexp when
> ",r" is not requested.  That's a straightforward transformation.

",r" makes it harder to input regexps, which are the whole point of
introducing --filter.  Besides, having two different syntaxes for the
same switch, and for no good reason, is not really acceptable, even if
the implementation is straightforward.


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote: 

> But that misses the point, which is that we *want* to make the
> more expressive language, already used elsewhere on Unix, the
> default.

I didn't miss the point at all. I'm trying to make a completely different
one, which is that regular expressions will confuse most users (even if you
tell them that the argument to --filter is a regular expression). This
mailing list will get a huge number of bug reports when users try to use
globs that fail.

Yes, regular expressions are used elsewhere on Unix, but not everywhere. The
shell is the most obvious comparison for user input dealing with expressions
that select multiple objects; the shell uses globs.

Personally, I will be quite happy if --filter only supports regular
expressions because I've been using them quite effectively for years. I just
don't think the same thing can be said for the typical wget user. We've
already had disagreements in this chain about what would match a particular
regular expression; I suspect everyone involved in the conversation could
have correctly predicted what the equivalent glob would do.

I don't think ",r" complicates the command that much. Internally, the only
additional work for supporting both globs and regular expressions is a
function that converts a glob into a regexp when ",r" is not requested.
That's a straightforward transformation.

Tony



Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes:

> Mauro Tortonesi wrote: 
>
>> no. i was talking about regexps. they are more expressive
>> and powerful than simple globs. i don't see what's the
>> point in supporting both.
>
> The problem is that users who are expecting globs will try things like
> --filter=-file:*.pdf

The --filter command will be documented from the start to support
regexps.  Since most Unix utilities work with regexps and very few
with globs (excepting the shell), this should not be a problem.

> It is pretty easy to programmatically convert a glob into a regular
> expression.

But it's harder to document and explain, and it requires more options
and logic.  Supporting two different syntaxes (the old one for
backward compatibility) is bad enough: supporting three is at least
one too many.

> One possibility is to make glob the default input and allow regular
> expressions. For example, the following could be equivalent:
>
> --filter:-domain:www-*.yoyodyne.com
> --filter:-domain,r:www-.*\.yoyodyne\.com

But that misses the point, which is that we *want* to make the more
expressive language, already used elsewhere on Unix, the default.


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Mauro Tortonesi wrote: 

> no. i was talking about regexps. they are more expressive
> and powerful than simple globs. i don't see what's the
> point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their
expressions will simply work, which will result in significant confusion
when some expression doesn't work, such as
--filter:-domain:www-*.yoyodyne.com. :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:

--filter:-domain:www-*.yoyodyne.com
--filter:-domain,r:www-.*\.yoyodyne\.com

Internally, wget would convert the first into the second and then treat it
as a regular expression. For the vast majority of cases, glob will work just
fine.

One might argue that it's a lot of work to implement regular expressions if
the default input format is a glob, but I think we should aim for both lack
of confusion and robust functionality. Using ",r" means people get regular
expressions when they want them and know what they're doing. The universe of
wget users who "know what they're doing" are mostly subscribed to this
mailing list; the rest of them send us mail saying "please CC me as I'm not
on the list". :-)

If we go this route, I'm wondering if the appropriate conversion from glob
to regular expression should take directory separators into account, such
as:

--filter:-path:path/to/*

becoming the same as:

--filter:-path,r:path/to/[^/]*

or even:

--filter:-path,r:path[/\\]to[/\\][^/\\]*

Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.)

Tony



Re: regex support RFC

2006-03-31 Thread Oliver Schulze L.

Mauro Tortonesi wrote:
for consistency and to avoid maintenance problems, i would like wget 
to have the same behavior on windows and unix. please, notice that if 
we implemented regex support only on unix, windows binaries of wget 
built with cygwin would have regex support but native binaries 
wouldn't. that would be very confusing for windows users, IMHO.

Ok, I understand.
I was thinking in a #ifdef in the source code so you can:
- enable all regex code/command line parameters in Unix/Linux
- at runtime, print the error "regex not yet supported on windows" if 
any regex related command

  parameter line parameter is passed to wget on windows/cygwin

this is planned for wget 1.12 (which might become 2.0). i already have 
some code implementing connection cache data structure.

Excelent!

URL regex
this is planned for wget 1.11. i've already started working on it.

looking forward to it, many thanks!

--
Oliver Schulze L.
<[EMAIL PROTECTED]>



Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Wincent Colaiuta <[EMAIL PROTECTED]> writes:



Are you sure that "www-*" matches "www"?


Yes.


hrvoje is right. try this perl script:


#!/usr/bin/perl -w

use strict;

my @strings = ("www-.yoyodyne.com",
   "www.yoyodyne.com");

foreach my $str (@strings) {
$str =~ /www-*.yoyodyne.com/ or print "$str doesn't match\n";
}


both the strings match.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Wincent Colaiuta <[EMAIL PROTECTED]> writes:

> Are you sure that "www-*" matches "www"?

Yes.

> As far as I know "www-*" matches "one w, another w, a third w, a
> hyphen, then 0 or more hyphens".

That would be "www--*" or "www-+".


Re: regex support RFC

2006-03-31 Thread Wincent Colaiuta

El 31/03/2006, a las 14:37, Hrvoje Niksic escribió:


"*" matches the previous character repeated 0 or more times.  This is
in contrast to wildcards, where "*" alone matches any character 0 or
more times.  (This is part of why regexps are often confusing to
people used to the much simpler wildcards.)

Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's
interpretation was correct.  What you describe is achieved with the
"www-.*.yoyodyne.com".


Are you sure that "www-*" matches "www"?

As far as I know "www-*" matches "one w, another w, a third w, a  
hyphen, then 0 or more hyphens". In other words, "www", does not match.


Wincent




smime.p7s
Description: S/MIME cryptographic signature


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Herold Heiko <[EMAIL PROTECTED]> writes:


Get the best of both, use a syntax permitting a "first match-exits"
ACL, single ACE permits several statements ANDed together. Cooking
up a simple syntax for users without much regexp experience won't be
easy.


I assume ACL stands for "access control list", but what is ACE?


access control entry, i guess.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Mauro Tortonesi <[EMAIL PROTECTED]> writes:


wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com.


Why not?


i may be wrong, but if - is not a special charachter, the previous
expression should match only domains starting with www- and ending
in [randomchar]yoyodyne[randomchar]com.


"*" matches the previous character repeated 0 or more times.  This is
in contrast to wildcards, where "*" alone matches any character 0 or
more times.  (This is part of why regexps are often confusing to
people used to the much simpler wildcards.)

Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's
interpretation was correct.  What you describe is achieved with the
"www-.*.yoyodyne.com".


you're right. ok, it is official. i must stop drinking this much - it 
just doesn't work. i have to start drinking less or, even better, more.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes:

>wget -r --filter=-domain:www-*.yoyodyne.com

This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.
>>>
>>> not really. it would not match www.yoyodyne.com.
>> Why not?
>
> i may be wrong, but if - is not a special charachter, the previous
> expression should match only domains starting with www- and ending
> in [randomchar]yoyodyne[randomchar]com.

"*" matches the previous character repeated 0 or more times.  This is
in contrast to wildcards, where "*" alone matches any character 0 or
more times.  (This is part of why regexps are often confusing to
people used to the much simpler wildcards.)

Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's
interpretation was correct.  What you describe is achieved with the
"www-.*.yoyodyne.com".


RE: regex support RFC

2006-03-31 Thread Herold Heiko
> From: Oliver Schulze L. [mailto:[EMAIL PROTECTED]
> My personal idea on this is to: enable regex in Unix and 
> disable it on 
> Windows.
> 
> We all use Unix/Linux and regex is really usefull. I think not having 

We all use Unix/Linux ? You would be surprised how many wget users on
windows are out there.

Beside that, Those Who Know The Code better than me please consider how bad
portability issues in using native regexp engines could be.
Are the interfaces and capabilities all the same or are there consistent
differences between various flavors (gnu, several BSD, hpux, aix, sunos,
solaris, older flavours...). If so that would be a point favouring an
external library (hopefully supported on as many as possible flavours).

Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED]
-- +39-041-5907073 / +39-041-5917073 ph
-- +39-041-5907472 / +39-041-5917472 fax


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Curtis Hatter wrote:

On Thursday 30 March 2006 13:42, Tony Lewis wrote:


Perhaps --filter=path,i:/path/to/krs would work.


That would look to be the most elegant method. I do hope that the (?i:) and 
(?-i:) constructs are supported since I may not want the entire path/file to 
be case (in)?sensitive =), but that will depend on the regex engine chosen.


while i like the idea of supporting modifiers like "quick" (short 
circuit) and maybe "i" (case insensitive comparison), i think that (?i:) 
and (?-i:) constructs would be overkill and rather hard to implement.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Oliver Schulze L. wrote:

Hrvoje Niksic wrote:


 The regexp API's found on today's Unix systems
might be usable, but unfortunately those are not available on Windows.


My personal idea on this is to: enable regex in Unix and disable it on 
Windows.

>
We all use Unix/Linux and regex is really usefull. I think not having 
regex on
Windows will not do any more harm that it is doing now (not having it at 
all)


for consistency and to avoid maintenance problems, i would like wget to 
have the same behavior on windows and unix. please, notice that if we 
implemented regex support only on unix, windows binaries of wget built 
with cygwin would have regex support but native binaries wouldn't. that 
would be very confusing for windows users, IMHO.


I hope wget can get conection cache, 


this is planned for wget 1.12 (which might become 2.0). i already have 
some code implementing connection cache data structure.


URL regex 


this is planned for wget 1.11. i've already started working on it.


and advanced mirror functions (sync 2 folders) in the near future.


this is very interesting.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Hrvoje Niksic wrote:

Mauro Tortonesi <[EMAIL PROTECTED]> writes:



Scott Scriven wrote:


* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:



wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com. 


Why not?


i may be wrong, but if - is not a special charachter, the previous 
expression should match only domains starting with www- and ending in 
[randomchar]yoyodyne[randomchar]com.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-31 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes:

> Scott Scriven wrote:
>> * Mauro Tortonesi <[EMAIL PROTECTED]> wrote:
>>
>>>wget -r --filter=-domain:www-*.yoyodyne.com
>> This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
>> "www---.yoyodyne.com", and so on, if interpreted as a regex.
>
> not really. it would not match www.yoyodyne.com.

Why not?

>> Perhaps you want glob patterns instead?  I know I wouldn't mind
>> having glob patterns in addition to regexes...  glob is much
>> eaesier when you're not doing complex matches.
>
> no. i was talking about regexps. they are more expressive and
> powerful than simple globs. i don't see what's the point in
> supporting both.

I agree with this.


Re: regex support RFC

2006-03-31 Thread Mauro Tortonesi

Scott Scriven wrote:

* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:


wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.


not really. it would not match www.yoyodyne.com.

It would most likely also match "www---zyoyodyneXcom".  


yes.


Perhaps you want glob patterns instead?  I know I wouldn't mind having
glob patterns in addition to regexes...  glob is much eaesier
when you're not doing complex matches.


no. i was talking about regexps. they are more expressive and powerful 
than simple globs. i don't see what's the point in supporting both.



If I had to choose just one though, I'd prefer to use PCRE,
Perl-Compatible Regular Expressions.  They offer a richer, more
concise syntax than traditional regexes, such as \d instead of
[:digit:] or [0-9].


i agree, but adding a dependency from PCRE to wget is asking for 
infinite maintenance nightmares. and i don't know if we can simply 
bundle code from PCRE in wget, as it has a BSD license.



--filter=[+|-][file|path|domain]:REGEXP

is it consistent? is it flawed? is there a more convenient one?


It seems like a good idea, but wouldn't actually provide the
regex-filtering features I'm hoping for unless there was a "raw"
type in addition to "file", "domain", etc.  I'll give details
below.  Basically, I need to match based on things like the
inline CSS data, the visible link text, etc.


do you mean you would like to have a regex class working on the content 
of downloaded files as well?



Below is the original message I sent to the wget list a few
months ago, about this same topic:

=
I'd find it useful to guide wget by using regular expressions to
control which links get followed.  For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any "add to cart" or "buy" links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&g2_itemId=11436&g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&g2_returnName=album";
 class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or...  if there's already a way to do this, let me know.  I
didn't see anything in the docs, but I may have missed something.

:)
=

I think what I want could be implemented via the --filter option,
with a few small modifications to what was proposed.  I'm not
sure exactly what syntax to use, but it should be able to specify
whether to include/exclude the link, which PCRE flags to use, how
much of the raw HTML tag to use as input, and what pattern to use
for matching.  Here's an idea:

  --filter=[allow][flags,][scope][:]pattern

Example:

  '--filter=-i,raw:add ?to ?cart'
  (the quotes are there only to make the shell treat it as one parameter)

The details are:

  "allow" is "+" for "include" or "-" for "exclude".
  It defaults to "+" if omitted.

  "flags," is a set of letters to control regex options, followed
  by a comma (to separate it from scope).  For example, "i"
  specifies a case-insensitive search.  These would be the same
  flags that perl appends to the end of search patterns.  So,
  instead of "/foo/i", it would be "--filter=+i,:foo"

  "scope" controls how much of the  or similar tag gets used
  as input to the regex.  Values include:
raw: use the entire tag and all contents (default)
 bar
domain: use only the domain name
 www.example.com
file: use only the file name
 foo.ext
path: use the directory, but not the file name
 /path/to
others...  can be added as desired

  ":" is required if "allow" or "flags" or "scope" is given

So, for example, to exclude the "add to cart" links in my
previous post, this could be used:

  --filter=-raw:'AddToCart|add to cart'
or
  --filter=-raw:AddToCart\|add\ to\ cart
or
  --filter=-:'AddToCart|add to cart'
or
  --filter=-i,raw:'add ?to ?cart'

Alternately, the --filter option could be split into two options:
one for including content, and one for excluding.  This would be
more consistent with wget's existing parameters, and would
slightly simplify the syntax.

I hope I haven't been to full of hot air.  This is a feature I've
wanted in wget for a long time, and I'm a bit excited that it
might happen soon.  :)


i don't like your "raw" proposal as it is HTML-specific. i would like 
instead to develop a mechanism which could work for all supported 
protocols.

Re: regex support RFC

2006-03-30 Thread Scott Scriven
* [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/
> 
> soon leads to non wget related links being downloaded, eg. 
> http://www.gnu.org/graphics/agnuhead.html

In that particular case, I think --no-parent would solve the
problem.

Maybe I misunderstood, though.  It seems awfully risky to use -r
and -H without having something to strictly limit the links
followed.  So, I suppose the content filter would be an effective
way to make cross-host downloading safer.

I think I'd prefer to have a different option, for that sort of
thing -- filter by using external programs.  If the program
returns a specific code, follow the link or recurse into the
links contained in the file.  Then you could do far more complex
filtering, including things like interactive pruning.


-- Scott


Re: regex support RFC

2006-03-30 Thread Oliver Schulze L.

Hrvoje Niksic wrote:

 The regexp API's found on today's Unix systems
might be usable, but unfortunately those are not available on Windows.
  
My personal idea on this is to: enable regex in Unix and disable it on 
Windows.


We all use Unix/Linux and regex is really usefull. I think not having 
regex on
Windows will not do any more harm that it is doing now (not having it at 
all)


I hope wget can get conection cache, URL regex and advanced mirror functions
(sync 2 folders) in the near future.
Thats all I still wants from wget and still could not find in another 
OSS software.


Thanks
Oliver

--
Oliver Schulze L.
<[EMAIL PROTECTED]>



Re: regex support RFC

2006-03-30 Thread Curtis Hatter
On Thursday 30 March 2006 13:42, Tony Lewis wrote:
> Perhaps --filter=path,i:/path/to/krs would work.

That would look to be the most elegant method. I do hope that the (?i:) and 
(?-i:) constructs are supported since I may not want the entire path/file to 
be case (in)?sensitive =), but that will depend on the regex engine chosen.

Curtis


RE: regex support RFC

2006-03-30 Thread Tony Lewis
Curtis Hatter wrote:

> Also any way to add modifiers to the regexs? 

Perhaps --filter=path,i:/path/to/krs would work.

Tony



Re: regex support RFC

2006-03-30 Thread Scott Scriven
* Jim Wright <[EMAIL PROTECTED]> wrote:
> Suppose you want files from some.dom.com://*/foo/*.png.  The
> part I'm thinking of here is "foo as last directory component,
> and png as filename extension."  Can the individual rules be
> combined to express this?

Only one rule is needed for that pattern:

  some.dom.com/.*/foo/[^/]*.png$

The file/path/domain specifiers would actually get in the way for
this type of matching.  What you'd really want is either a full
URI/URL match, or a raw tag match.  Either would work with the
above pattern.


-- Scott


Re: regex support RFC

2006-03-30 Thread Scott Scriven
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:
> wget -r --filter=-domain:www-*.yoyodyne.com

This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www---.yoyodyne.com", and so on, if interpreted as a regex.
It would most likely also match "www---zyoyodyneXcom".  Perhaps
you want glob patterns instead?  I know I wouldn't mind having
glob patterns in addition to regexes...  glob is much eaesier
when you're not doing complex matches.

If I had to choose just one though, I'd prefer to use PCRE,
Perl-Compatible Regular Expressions.  They offer a richer, more
concise syntax than traditional regexes, such as \d instead of
[:digit:] or [0-9].

> --filter=[+|-][file|path|domain]:REGEXP
> 
> is it consistent? is it flawed? is there a more convenient one?

It seems like a good idea, but wouldn't actually provide the
regex-filtering features I'm hoping for unless there was a "raw"
type in addition to "file", "domain", etc.  I'll give details
below.  Basically, I need to match based on things like the
inline CSS data, the visible link text, etc.

> please notice that supporting multiple comma-separated regexp in a 
> single --filter option:
> 
> --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,...

Commas for multiple regexes are unnecessary.  Regexes already
have an "or" operator built in.  If you want to match "fee" or
"fie" or "foe" or "fum", the pattern is fee|fie|foe|fum.

> we also have to reach consensus on the filtering algorithm. for
> instance, should we simply require that a url passes all the
> filtering rules to allow its download (just like the current
> -A/R behaviour), or should we instead adopt a short circuit
> algorithm that applies all rules in the same order in which
> they were given in the command line and immediately allows the
> download of an url if it passes the first "allow" match?

Regexes implicitly have "or" functionality built in, via the pipe
operator.  They also have "and" built in simply by extending the
pattern.  To require both "foo" and "bar" in a match, you could
do something like "foo.*bar|bar.*foo".  So, it's not strictly
necessary to support more than one regex unless you specify both
an include pattern and an exclude pattern.

However, if multiple patterns are supported, I think it would be
more helpful to implement them as "and" rather than "or".  This
is just because "and" doubles the length of the filter, so it may
be more convenient to say "--filter=foo --filter=bar" than
"--filter='foo.*bar|bar.*foo'".


Below is the original message I sent to the wget list a few
months ago, about this same topic:

=
I'd find it useful to guide wget by using regular expressions to
control which links get followed.  For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any "add to cart" or "buy" links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&g2_itemId=11436&g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&g2_returnName=album";
 class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or...  if there's already a way to do this, let me know.  I
didn't see anything in the docs, but I may have missed something.

:)
=

I think what I want could be implemented via the --filter option,
with a few small modifications to what was proposed.  I'm not
sure exactly what syntax to use, but it should be able to specify
whether to include/exclude the link, which PCRE flags to use, how
much of the raw HTML tag to use as input, and what pattern to use
for matching.  Here's an idea:

  --filter=[allow][flags,][scope][:]pattern

Example:

  '--filter=-i,raw:add ?to ?cart'
  (the quotes are there only to make the shell treat it as one parameter)

The details are:

  "allow" is "+" for "include" or "-" for "exclude".
  It defaults to "+" if omitted.

  "flags," is a set of letters to control regex options, followed
  by a comma (to separate it from scope).  For example, "i"
  specifies a case-insensitive search.  These would be the same
  flags that perl appends to the end of search patterns.  So,
  instead of "/foo/i", it would be "--filter=+i,:foo"

  "scope" controls how much of the  or similar tag gets used
  as input to the regex.  Values include:
raw: use the entire tag and all contents (default)
 bar
domain: use only the domain name
 www.example.com
file: use only the file name
 foo.ext
path: use the directory, but not the file name
 

Re: regex support RFC

2006-03-30 Thread Curtis Hatter
On Thursday 30 March 2006 11:49, you wrote:
> How many keywords do we need to provide maximum flexibility on the
> components of the URI? (I'm thinking we need five.)
>
> Consider http://www.example.com/path/to/script.cgi?foo=bar
>
> --filter=uri:regex could match against any part of the URI
> --filter=domain:regex could match against www.example.com
> --filter=path:regex could match against /path/to/script.cgi
> --filter=file:regex could match against script.cgi
> --filter=query:regex could match against foo=bar
>
> I think there are good arguments for and against matching against the file
> name in "path:"
>
> Tony

The query keyword is a great idea. So many of the sites I download from use 
that, and would greatly help in limiting the material that is downloaded.

I was also wondering this, does the "path:" need the begin and end slashes or 
are those assumed? They could be assumed, but if you combine the "file:" with 
the path I'm not sure you can make that assumption anymore. This comes into 
play when wanting to search from the start, or at the end of a path.

--filter=path:^path/to/files or --filter=path:^/path/to/files
--filter=path:path/to/files$ or --filter=path:path/to/files/$

Also any way to add modifiers to the regexs? The only modifier I can think of 
off the top of my head that would see much use is /i. I download material 
from a site where for some reason they use a KRS and a krs interchangeably in 
the path name. So something akin to: "--filter=path:^path/to/(?i:krs)" would 
be helpful. Or some other way to include modifiers?

Curtis


RE: regex support RFC

2006-03-30 Thread Tony Lewis
How many keywords do we need to provide maximum flexibility on the
components of the URI? (I'm thinking we need five.)

Consider http://www.example.com/path/to/script.cgi?foo=bar

--filter=uri:regex could match against any part of the URI
--filter=domain:regex could match against www.example.com
--filter=path:regex could match against /path/to/script.cgi
--filter=file:regex could match against script.cgi
--filter=query:regex could match against foo=bar

I think there are good arguments for and against matching against the file
name in "path:"

Tony



Re: regex support RFC

2006-03-30 Thread Curtis Hatter
On Wednesday 29 March 2006 12:05, you wrote:
> we also have to reach consensus on the filtering algorithm. for
> instance, should we simply require that a url passes all the filtering
> rules to allow its download (just like the current -A/R behaviour), or
> should we instead adopt a short circuit algorithm that applies all rules
> in the same order in which they were given in the command line and
> immediately allows the download of an url if it passes the first "allow"
> match? should we also support apache-like deny-from-all and
> allow-from-all policies? and what would be the best syntax to trigger
> the usage of these policies?

I would recommend parsing the filters in the order given, that puts the onus 
on the user to optimize the filters and not you. Another way could possibly 
be all filters by domain, then path, and finally file.

Regardless of how you ultimately decide to order the filters, would it be 
possible to allow for users to specify a short circuit? I'm thinking 
something similar to PF's (http://www.openbsd.org/faq/pf/filter.html#quick) 
quick keyword. Example usage of this would be something like:

Need to mirror a site that uses several domains:

--filter=+domain:example.(net|org|com)

Within that domain several paths. One of those paths, which is four levels 
deep, I know I want everything regardless of it's file name/type/etc. It's 
four levels deep.

--filter=+path,quick:([^/]+/){3}/thefiles

The "quick" keyword is used to skip all other filters, because I've told wget 
that I'm sure I want everything in that path if it matches.

Wget would first evaluate the domain, if it passes evaluate the path and if 
that passes then skip all other filters. Should it fail, wget continues to 
evaluate the rest of the filters.

Another example: I know I want nothing from any site other than example.com

--filter=-domain,quick:^(?!example.com)

That should ignore any domain that doesn't begin with example.com and skip all 
other rules because of the "quick" keyword. This would make processing more 
efficient, since other filters don't have to be evaluated.

Curtis


Re: regex support RFC

2006-03-30 Thread Jim Wright
On Thu, 30 Mar 2006, Mauro Tortonesi wrote:
> 
> > I do like the [file|path|domain]: approach.  very nice and flexible.
> > (and would be a huge help to one specific need I have!)  I suggest also
> > including an "any" option as a shortcut for putting the same pattern in
> > all three options.
> 
> do you think the "any" option would be really useful? if so, could you please
> give us an example?

Depends on how individual [file|path|domain]: entries are combined.
AND, OR?  Suppose you want files from some.dom.com://*/foo/*.png.
The part I'm thinking of here is "foo as last directory component,
and png as filename extension."  Can the individual rules be combined
to express this?  I guess the real question is, how are rules combined.

Jim


RE: regex support RFC

2006-03-30 Thread Herold Heiko
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
> > I agree. Just how often will there be problems in a single 
> wget run due to
> > both some.domain.com and somedomain.com present (famous last
> > words...)
> 
> Actually it would have to be somedomain.com -- a "."
> will not match the null string.  My point was that people who care

yes, sorry.

> about that potential problem will carefully quote their dots, while
> the rest of us will use the more convenient notation.

I am of the same opinion, I'm just wondering how often the "correct"
notations will be neccessary, I don't think in my personal experience
something like that would ever happen, but what with the pletora if similar
domain existing (and even worse, the phishing domains...)

Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED]
-- +39-041-5907073 / +39-041-5917073 ph
-- +39-041-5907472 / +39-041-5917472 fax


Re: regex support RFC

2006-03-30 Thread Hrvoje Niksic
Herold Heiko <[EMAIL PROTECTED]> writes:

>> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
>> I don't think such a thing is necessary in practice, though; remember
>> that even if you don't escape the dot, it still matches the (intended)
>> dot, along with other characters.  So for quick&dirty usage not
>> escaping dots will "just work", and those who want to be precise can
>> escape them.
>
> I agree. Just how often will there be problems in a single wget run due to
> both some.domain.com and somedomain.com present (famous last
> words...)

Actually it would have to be somedomain.com -- a "."
will not match the null string.  My point was that people who care
about that potential problem will carefully quote their dots, while
the rest of us will use the more convenient notation.


RE: regex support RFC

2006-03-30 Thread Herold Heiko
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
> I don't think such a thing is necessary in practice, though; remember
> that even if you don't escape the dot, it still matches the (intended)
> dot, along with other characters.  So for quick&dirty usage not
> escaping dots will "just work", and those who want to be precise can
> escape them.

I agree. Just how often will there be problems in a single wget run due to
both some.domain.com and somedomain.com present (famous last words...)

Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED]
-- +39-041-5907073 / +39-041-5917073 ph
-- +39-041-5907472 / +39-041-5917472 fax


Re: regex support RFC

2006-03-30 Thread Hrvoje Niksic
Herold Heiko <[EMAIL PROTECTED]> writes:

> Get the best of both, use a syntax permitting a "first match-exits"
> ACL, single ACE permits several statements ANDed together. Cooking
> up a simple syntax for users without much regexp experience won't be
> easy.

I assume ACL stands for "access control list", but what is ACE?

> One way (probably not the most beautiful syntax) could be a running
> number,

The numbers are just too ugly, sorry.

Also, having *two* instances of +/- (one before the "=" and one after
the "=") is just too confusing; it took me a minute or two to figure
out.

> I realize much of this syntax can be thrown out of the window simply
> considering we can probably reach the same effect with uri filters and more
> complicated regexp (perl5 syntax):
> --filter=+uri:.+\.dom\.com/*.download
> --filter=-domain:sweets\.dom\.com
> --filter="+uri:peanuts\.dom\.com/.*(?!brown)"
> --filter=+path:peanuts

That seems much more acceptable IMHO.


Re: regex support RFC

2006-03-30 Thread Hrvoje Niksic
Herold Heiko <[EMAIL PROTECTED]> writes:

> BTW any comments about the dots ? Requiring escaped dots in domains would
> become old really fast, reversing behaviour (\. = any char) would be against
> the principle of least surprise, since any other regexp syntax does use the
> opposite.

Modifying the dot to only match a dot might be useful for "domain"
patterns, but I suspect it's not easy to implement.

I don't think such a thing is necessary in practice, though; remember
that even if you don't escape the dot, it still matches the (intended)
dot, along with other characters.  So for quick&dirty usage not
escaping dots will "just work", and those who want to be precise can
escape them.

> Either way pure windows users will be confused (*.html instead of
> .*\.html),

Increased expressive power will hopefully outweigh the confusion.
After all, people who use Wget on Windows are hardly typical Windows
users.  :-)

> but personally I don't think permitting yet another alternate syntax
> (using globs) is justified, and a syntax using exclusively globs
> would be too limited.

My thoughts exactly.


RE: regex support RFC

2006-03-30 Thread Herold Heiko
[Immagination running freely, I do not have a lot of experience designing
syntax, but I suffer a lot in a helpdeskish way trying to explain syntax to
users. Hopefully this can be somehow useful]

> we also have to reach consensus on the filtering algorithm. for 
> instance, should we simply require that a url passes all the 
> filtering 
> rules to allow its download (just like the current -A/R 
> behaviour), or 
> should we instead adopt a short circuit algorithm that 
> applies all rules 
> in the same order in which they were given in the command line and 
> immediately allows the download of an url if it passes the 
> first "allow" 
> match? should we also support apache-like deny-from-all and 
> allow-from-all policies? and what would be the best syntax to trigger 
> the usage of these policies?
> 


Get the best of both, use a syntax permitting a "first match-exits" ACL,
single ACE permits several statements ANDed together. Cooking up a simple
syntax for users without much regexp experience won't be easy.

One way (probably not the most beautiful syntax) could be a running number,
AND together repeated filters with the same number but use FIRST MATCH
between numbers:

download every path containing download on every *.dom.com (including
sweets.dom.com);
OTHERWISE avoid anything (else) on sweets.dom.com;
OTHERWISE from peanuts.dom.com get everything except brown stuff (currants
and so on);
OTHERWISE get peanuts from everywhere else

--filter1+=+domain:.+\.dom\.com --filter1=+path:download &&
--filter2-=+domain:sweets\.dom\.com &&
--filter3+=+peanuts\.dom\.com --filter3=-file:brown &&
--filter4+=+path:peanuts &&

(&& omitted later on)
The first filterX (for every X) does carry a +/- before the = (permit/deny
ACE), every filterX does carry a + or - after the = (what are whe matching).

Well, I wrote the example and I hate it already, hopefully some better
syntax comes up which doesn't require nested quotes.

Require an additional switch permit/deny for every ACE:
--filter1=permit --filter1=+domain:.+\.dom\.com --filter1=+path:download
--filter2=deny   --filter2=+domain:sweets\.dom\.com
--filter3=permit --filter3=+peanuts\.dom\.com --filter3=-file:brown
--filter4=permit --filter4=+path:peanuts

With permit and + as default that would make
--filter1=domain:.+\.dom\.com --filter1=path:download
--filter2=deny   --filter2=domain:sweets\.dom\.com
--filter3=peanuts\.dom\.com --filter3=-file:brown
--filter4=path:peanuts

On the other hand, without the default=permit we could loose the numbers
(use position):
--filter=permit --filter=+domain:.+\.dom\.com --filter=+path:download
--filter=deny   --filter=+domain:sweets\.dom\.com
--filter=permit --filter=+peanuts\.dom\.com --filter=-file:brown
--filter=permit --filter=+path:peanuts

e.g. start with permit or deny (or default permit for first ACE only),
following statements are ANDed together as a single ACE until next
permit/deny.

Considering command line restrictions and so on for complicated expression
there should also be a --filter-file=filename, same syntax except the
--filter ?

I realize much of this syntax can be thrown out of the window simply
considering we can probably reach the same effect with uri filters and more
complicated regexp (perl5 syntax):
--filter=+uri:.+\.dom\.com/*.download
--filter=-domain:sweets\.dom\.com
--filter="+uri:peanuts\.dom\.com/.*(?!brown)"
--filter=+path:peanuts

Simpler and shorter invocation syntax, but more complicated regexp
requirement, not a simple thing for the casual user, after all wget doesn't
try to appeal to programmers only, many examples in the manual will be
neccessary.

BTW any comments about the dots ? Requiring escaped dots in domains would
become old really fast, reversing behaviour (\. = any char) would be against
the principle of least surprise, since any other regexp syntax does use the
opposite.
Either way pure windows users will be confused (*.html instead of .*\.html),
but personally I don't think permitting yet another alternate syntax (using
globs) is justified, and a syntax using exclusively globs would be too
limited.

Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED]
-- +39-041-5907073 / +39-041-5917073 ph
-- +39-041-5907472 / +39-041-5917472 fax


Re: regex support RFC

2006-03-30 Thread Mauro Tortonesi

Jim Wright wrote:
what definition of regexp would you be following? 


that's another degree of liberty. hrovje and i have chosen to integrate 
in wget the GNU regex implementation, which allows the exploitation of 
one of these different syntaxes:


RE_SYNTAX_EMACS
RE_SYNTAX_AWK
RE_SYNTAX_GNU_AWK
RE_SYNTAX_POSIX_AWK
RE_SYNTAX_GREP
RE_SYNTAX_EGREP
RE_SYNTAX_POSIX_EGREP
RE_SYNTAX_POSIX_BASIC
RE_SYNTAX_POSIX_MINIMAL_BASIC
RE_SYNTAX_POSIX_EXTENDED
RE_SYNTAX_POSIX_MINIMAL_EXTENDED

(see 
http://cvs.savannah.gnu.org/viewcvs/emacs/emacs/src/regex.h?view=markup)


among these, i would probably go for a POSIX_EXTENDED syntax.

I'm not quite understanding the comment about the comma and needing 
escaping for literal commas. this is true for any character in the 
regexp language, so why the special concern for comma?


hrvoje already answered to this question.


I do like the [file|path|domain]: approach.  very nice and flexible.
(and would be a huge help to one specific need I have!)  I suggest also
including an "any" option as a shortcut for putting the same pattern in
all three options.


do you think the "any" option would be really useful? if so, could you 
please give us an example?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: regex support RFC

2006-03-29 Thread Hrvoje Niksic
Jim Wright <[EMAIL PROTECTED]> writes:

> what definition of regexp would you be following?  or would this be
> making up something new?

It wouldn't be new, Mauro is definitely referring to regexps as
normally understood.  The regexp API's found on today's Unix systems
might be usable, but unfortunately those are not available on Windows.
They also lack the support for the very useful non-greedy matching
quantifier (the "?" modifier to the "*" operator) introduced by Perl 5
and supported by most of today's major regexp implementations: Python,
Java, Tcl, etc.

One idea was to use PCRE, bundling it with Wget for the sake of
Windows and systems without PCRE.  Another (http://tinyurl.com/elp7h)
was to use and bundle Emacs's regex.c, the version of GNU regex
shipped with GNU Emacs.  It is small (one source) and offers
Unix-compatible basic and extended regeps, but also supports the
non-greedy quantifier and non-capturing groups.

See the message and the related discussion at http://tinyurl.com/mdwhx
for more about this topic.

> I'm not quite understanding the comment about the comma and needing
> escaping for literal commas.

Supporting PATTERN1,PATTERN2,... would require having a way to quote
the comma character.  But there is little reason for a specific comma
syntax since one can always use (PATTERN1|PATTERN2|...).

Being unable to have a comma in the pattern is a shortcoming in the
current -R/-A options.

> I do like the [file|path|domain]: approach.  very nice and flexible.

Thanks.


Re: regex support RFC

2006-03-29 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes:

> for instance, the syntax for --filter presented above is basically the
> following:
>
> --filter=[+|-][file|path|domain]:REGEXP

I think there should also be "url" for filtering on the entire URL.
People have been asking for that kind of thing a lot over the years.


Re: regex support RFC

2006-03-29 Thread TPCnospam
> for instance, the syntax for --filter presented above is basically the 
> following:
> 
> --filter=[+|-][file|path|domain]:REGEXP

I think a file 'contents' regexp search facility would be a useful 
addition here.  eg.

 --filter=[+|-][file|path|domain|contents]:REGEXP

The idea is that if the file just downloaded has a regexp match for 
expression REGEXP (ie. as in 'egrep REGEXP file.html') then that file is 
kept and its links processed as normal.  If no match is found the file is 
just deleted.  Such a facility could be used to prevent recursive 
downloads wandering way off topic.

eg. 

wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/

soon leads to non wget related links being downloaded, eg. 
http://www.gnu.org/graphics/agnuhead.html

My suggestion is that with;

wget -e robots=off -r -N -k -E -p -H --filter=+contents:wget 
http://www.gnu.org/software/wget/

any page not containing  the string 'wget' is deleted and its links not 
followed.

Thanks
Tom Crane
-- 
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England. 
Email:  [EMAIL PROTECTED]
Fax:+44 (0) 1784 472794


Re: regex support RFC

2006-03-29 Thread Jim Wright
what definition of regexp would you be following?  or would this be
making up something new?  I'm not quite understanding the comment about
the comma and needing escaping for literal commas.  this is true for any
character in the regexp language, so why the special concern for comma?

I do like the [file|path|domain]: approach.  very nice and flexible.
(and would be a huge help to one specific need I have!)  I suggest also
including an "any" option as a shortcut for putting the same pattern in
all three options.

Jim



On Wed, 29 Mar 2006, Mauro Tortonesi wrote:

> 
> hrvoje and i have been recently talking about adding regex support to wget. we
> were considering to add a new --filter option which, by supporting regular
> expressions, would allow more powerful ways of filtering urls to download.
> 
> for instance the new option could allow the filtering of domain names, file
> names and url paths. in the following case --filter is used to prevent any
> download from the www-*.yoyodyne.com domain and to restrict download only to
> .gif files:
> 
> wget -r --filter=-domain:www-*.yoyodyne.com --filter=+file:\.gif$
> http://yoyodyne.com
> 
> (notice that --filter interprets every given rule as a regex).
> 
> i personally think the --filter option would be a great new feature for wget,
> and i have already started working on its implementation, but we still have a
> few opened questions.
> 
> for instance, the syntax for --filter presented above is basically the
> following:
> 
> --filter=[+|-][file|path|domain]:REGEXP
> 
> is it consistent? is it flawed? is there a more convenient one?
> 
> please notice that supporting multiple comma-separated regexp in a single
> --filter option:
> 
> --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,...
> 
> would significantly complicate the implementation and usage of --filter, as it
> would require escaping of the "," charachter. also notice that current
> filtering options like -A/R are somewhat broken, as they do not allow the
> usage of "," char in filtering rules.
> 
> we also have to reach consensus on the filtering algorithm. for instance,
> should we simply require that a url passes all the filtering rules to allow
> its download (just like the current -A/R behaviour), or should we instead
> adopt a short circuit algorithm that applies all rules in the same order in
> which they were given in the command line and immediately allows the download
> of an url if it passes the first "allow" match? should we also support
> apache-like deny-from-all and allow-from-all policies? and what would be the
> best syntax to trigger the usage of these policies?
> 
> i am looking forward to read your opinions on this topic.
> 
> 
> P.S.: the new --filter option would replace and extend the old -D, -I/X
> and -A/R options, which will be deprecated but still supported.
> 
> -- 
> Aequam memento rebus in arduis servare mentem...
> 
> Mauro Tortonesi  http://www.tortonesi.com
> 
> University of Ferrara - Dept. of Eng.http://www.ing.unife.it
> GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
> Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
> Ferrara Linux User Group http://www.ferrara.linux.it
> 
>