Re: regex support RFC

TPCnospam Wed, 29 Mar 2006 09:31:14 -0800

> for instance, the syntax for --filter presented above is basically the 
> following:
> 
> --filter=[+|-][file|path|domain]:REGEXP


I think a file 'contents' regexp search facility would be a useful 
addition here.  eg.

 --filter=[+|-][file|path|domain|contents]:REGEXP

The idea is that if the file just downloaded has a regexp match for 
expression REGEXP (ie. as in 'egrep REGEXP file.html') then that file is 
kept and its links processed as normal.  If no match is found the file is 
just deleted.  Such a facility could be used to prevent recursive 
downloads wandering way off topic.

eg. 

wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/

soon leads to non wget related links being downloaded, eg. 
http://www.gnu.org/graphics/agnuhead.html

My suggestion is that with;

wget -e robots=off -r -N -k -E -p -H --filter=+contents:wget 
http://www.gnu.org/software/wget/

any page not containing  the string 'wget' is deleted and its links not 
followed.

Thanks
Tom Crane
-- 
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England. 
Email:  [EMAIL PROTECTED]
Fax:    +44 (0) 1784 472794

Re: regex support RFC

Reply via email to