Re: spanning hosts: 2 Problems

Ian Abbott Wed, 27 Mar 2002 03:26:58 -0800

On 26 Mar 2002 at 20:24, Jens Rösner wrote:

> Hi Ian!
Hi Jens! 
> > > The first page of even the rejected hosts gets saved.
> > That sounds like a bug.
> 
> Should I try to get a useful debug log? 
> (It is Windows, so I do not know if it is helpful.)


A debug log will be useful if you can produce one. However, see
the discussion below about the --input-file option. What I
initially assumed to be a bug may be a feauture (or at least a
misfeature).

Also note that if receive cookies that expire around 2038 with
debugging on, the Windows version of Wget will crash! (This is a
known bug with a known fix, but not yet finalised in CVS.)

> [depth first]
> > > Now, with downloading from many (20+) different servers, this is a bit
> > > frustrating,
> > > as I will probably have the first completely downloaded site in a few
> > > days...
> > 
> > Would that be less of a problem if the first problem (first page
> > >from rejected domains) was fixed?
> 
> Not really, the problems are quite different for me.

Oh well, it was just a thought!
 
> > > Is there any other way to work around this besides installing wget 1.6
> > > (or even 1.5?)
> > No, 
> 
> I just installed 1.7.1, which also works breadth-first.

(I think you mean depth-first.) Yes, that was the last version that
used depth-first retrieval. There are advantages and disadvantages
with both types of retrieval.

One of the reasons for the switch was that there were some problems
with Wget's measurement of 'depth' of links for highly
interconnected web-sites (for example on-line manuals with separate
pages for each section, 'next' and 'previous' links and a
'contents' page) that made limited-depth retrievals not work very
well due to pages being both deep-nested and shallow-nested at the
same time! The true measure of depth is the minimum length of a
path to a page, not the length of the first encountered path to a
page (which is what Wget's depth-first algorithm was using). The
breadth-first approach, by its very nature, sees the shortest paths
to a particular page before any longer paths and neatly avoided the
problem.

> > The other alternative is to run wget
> > several times in sequence with different starting URLs and restrictions, perhaps 
>using
> > the --timestamping or --no-clobber
> > options to avoid downloading things more than once.
> 
> Of course, this is possible.
> I just had hoped that by combining 
> -F -i url.html
> with domain acceptance would save me a lot of time.

Oh, I think I see what your first complaint is now. I initially
assumed that your local html file was being served by a local HTTP
server rather than being fed to the -F -i options. Is your complaint really that URLs 
supplied on the command line or via the
-i option are not subjected to the acceptance/rejection rules? That
does indeed seem to be the current behavior, but there is not
particular reason why we couldn't apply the tests to these URLs as
well as the URLs obtained through recursion.

Re: spanning hosts: 2 Problems

Reply via email to