Re: regex support RFC

TPCnospam Fri, 31 Mar 2006 10:11:19 -0800

> * [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/
> > 
> > soon leads to non wget related links being downloaded, eg. 
> > http://www.gnu.org/graphics/agnuhead.html
> 
> In that particular case, I think --no-parent would solve the
> problem.


No.  The idea is not to be restricted to not descending the tree. 

> 
> Maybe I misunderstood, though.  It seems awfully risky to use -r
> and -H without having something to strictly limit the links
> followed.  So, I suppose the content filter would be an effective
> way to make cross-host downloading safer.

Absolutely.  That is why I proposed a 'contents' regexp.

> 
> I think I'd prefer to have a different option, for that sort of
> thing -- filter by using external programs.  If the program
> returns a specific code, follow the link or recurse into the
> links contained in the file.  Then you could do far more complex
> filtering, including things like interactive pruning.

True.  That could be a future feature request but now that the wget team 
are writing regexp code, it seems an ideal time to implement it.  By 
constructing suitable regexps, one could use this feature to search for 
any string in the html file, (as above), or just in metatags etc.  IMHO it 
gives a lot of flexibility for little extra developer programming.

Any comments, Mauro & Hrvoje?

Thanks
Tom Crane

-- 
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England. 
Email:  [EMAIL PROTECTED]
Fax:    +44 (0) 1784 472794

Re: regex support RFC

Reply via email to