Using wget 1.8.2:

$ wget --page-requisites http://news.com.com

...fails to retrieve most of the files that are required to properly 
render the HTML document, because they are forbidden by 
http://news.com.com/robots.txt .

I think that use of --page-requisites implies that wget is being used as 
a "save this entire web page as..." utility for later human viewing, 
rather than a text indexing spider that wants to analyze the content but 
not the presentation. So I believe that wget should ignore robots.txt 
when --page-requisites is specified.

If you agree then I'll try to write a patch & send it to you this 
week... please let me know if you agree or disagree. Thanks!


--- the gory bits:

   "wget -d --page-requisites http://news.com.com"; says:

appending "http://news.com.com/i/hdrs/ne/y_fd.gif"; to urlpos.

   etc., but then later says:

Deciding whether to enqueue "http://news.com.com/i/hdrs/ne/y_fd.gif";.
Rejecting path i/hdrs/ne/y_fd.gif because of rule `i/'.
Not following http://news.com.com/i/hdrs/ne/y_fd.gif because robots.txt 
forbids it.
Decided NOT to load it.



Reply via email to