Re: wget downloading a single page when it should recurse

2003-10-18 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 Philip Mateescu wrote:

 A warning message would be nice when for not so obvious reasons wget
 doesn't behave as one would expect.

 I don't know if there are other tags that could change wget's behavior
 (like -r and meta name=robots do), but if they happen it would be
 useful to have a message.

 I agree that this is worth a notable mention in the wget output. At the very
 least, running with -d should provided more guidance on why the links it has
 appended to urlpos are not being followed. Buried in the middle of hundreds
 of lines of output is:

 no-follow in index.php

 On the other hand, if other rules prevent a URL from being followed, you
 might see something like:

 Deciding whether to enqueue http://www.othersite.com/index.html;.
 This is not the same hostname as the parent's (www.othersite.com and
 www.thissite.com).
 Decided NOT to load it.

There's a practical reason for this discrepancy.  All these other
links are examined one by one and rejected one by one.  On the other
hand, when nofollow is specified, it causes Wget to not even
*consider* any of the links for download.

Another tweak that should be added (easily, I think): Wget should
ignore robots when downloading the page requisites.



Re: wget downloading a single page when it should recurse

2003-10-17 Thread Aaron S. Hawley
The HTML of those pages contains the meta-tag

meta name=robots content=noindex,nofollow /

and Wget listened, and only downloaded the first page.

Perhaps Wget should give a warning message that the file contained a
meta-robots tag, so that people aren't quite so dumb-founded.

/a

On Fri, 17 Oct 2003, Philip Mateescu wrote:

 Hi,

 I'm having a problem with wget 1.8.2 cygwin and I'm almost ready to
 swear it once worked...

 I'm trying to download the php manual off the web using this command:

 $ wget -nd -nH -r -np -p -k -S http://us4.php.net/manual/en/print/index.php

-- 
Consider supporting GNU Software and the Free Software Foundation
By Buying Stuff - http://www.gnu.org/gear/
  (GNU and FSF are not responsible for this promotion
   nor necessarily agree with the views of the author)


Re: wget downloading a single page when it should recurse

2003-10-17 Thread Philip Mateescu
Thanks!

A warning message would be nice when for not so obvious reasons wget 
doesn't behave as one would expect.

I don't know if there are other tags that could change wget's behavior 
(like -r and meta name=robots do), but if they happen it would be 
useful to have a message.

Thanks again!



Aaron S. Hawley wrote:

The HTML of those pages contains the meta-tag

meta name=robots content=noindex,nofollow /

and Wget listened, and only downloaded the first page.

Perhaps Wget should give a warning message that the file contained a
meta-robots tag, so that people aren't quite so dumb-founded.
/a

On Fri, 17 Oct 2003, Philip Mateescu wrote:


Hi,

I'm having a problem with wget 1.8.2 cygwin and I'm almost ready to
swear it once worked...
I'm trying to download the php manual off the web using this command:

$ wget -nd -nH -r -np -p -k -S http://us4.php.net/manual/en/print/index.php


---
Don't belong. Never join. Think for yourself. Peace
---


Re: wget downloading a single page when it should recurse

2003-10-17 Thread Tony Lewis
Philip Mateescu wrote:

 A warning message would be nice when for not so obvious reasons wget
 doesn't behave as one would expect.

 I don't know if there are other tags that could change wget's behavior
 (like -r and meta name=robots do), but if they happen it would be
 useful to have a message.

I agree that this is worth a notable mention in the wget output. At the very
least, running with -d should provided more guidance on why the links it has
appended to urlpos are not being followed. Buried in the middle of hundreds
of lines of output is:

no-follow in index.php

On the other hand, if other rules prevent a URL from being followed, you
might see something like:

Deciding whether to enqueue http://www.othersite.com/index.html;.
This is not the same hostname as the parent's (www.othersite.com and
www.thissite.com).
Decided NOT to load it.

Tony



Re: wget downloading a single page when it should recurse

2003-10-17 Thread Hrvoje Niksic
Aaron S. Hawley [EMAIL PROTECTED] writes:

 The HTML of those pages contains the meta-tag

 meta name=robots content=noindex,nofollow /

 and Wget listened, and only downloaded the first page.

 Perhaps Wget should give a warning message that the file contained a
 meta-robots tag, so that people aren't quite so dumb-founded.

Good point.  A message would be easy to add, and in this case
enormously useful.