Re: spanning hosts

Ian Abbott Tue, 02 Apr 2002 01:48:15 -0800

On 28 Mar 2002 at 18:01, Jens Rösner wrote:

> > > I came across a crash caused by a cookie
> > > two days ago. I disabled cookies and it worked.
> > I'm hoping you had debug output on when it crashed, otherwise this
> > is a different crash to the one I already know about. Can you
> > confirm this, please?
> 
> Yes, I had debug output on.


Thanks for the confirmation.

> > > wget -nc -x -r -l0 -t10 -H -Dstory.de,audi -o example.log -k -d
> > > -R.gif,.exe,*tn*,*thumb*,*small* -F -i example.html
> > >
> > > Result with 1.8.1 and 1.7.1 with -nh:
> > > audistory.com: Only index.html
> > > audistory.de: Everything
> > > audi100-online: only the first page
> > > kolaschnik.de: only the first page
> > 
> > Yes, that's how I thought it would behave. Any URLs specified on
> > the command line or in a --include-file file are always downloaded
> > irregardless of the domain acceptance rules. 
> 
> Well, one page of a rejected URL is downloaded, not more.
> Whereas the only accepted domain audistory.de gets downloaded
> completely.
> Doesn't this differ from what you just said?

Well I only said the URLs specified on the command line or by the
--include-file option are always downloaded. I didn't intend this
to be interpreted as also applying to URLs which Wget finds while
examining the contents of the downloaded html files. At the moment,
the domain acceptance/rejection checks are only performed when
downloaded html files are examined for further URLs to be
downloaded (for the --recursive and --page-requisites options),
which is why it behaves as it does.

> Agreed! How about introducing "wildcards" like 
> -Dbar.com behaves strictly: www.bar.com, www2.bar.com
> -D*bar.com behaves like now: www.bar.com, www2.bar.com, www.foobar.com
> -D*bar.com* gets www.bar.com, www2.bar.com, www.foobar.com,
> sex-bar.computer-dating.com
> That would leave current command lines operational 
> and introduce many possibilities without (too much) fuss.
> Or have I overlooked anything here?

It sounds like it should work okay. I'd prefer to let -Dbar.com
also match fubar.com for compatibility's sake. If you wanted to
match www.bar.com and www2.bar.com, but not www.fubar.com you
could use -D.bar.com, but that wouldn't work if you wanted to
match bar.com without the www (well, a leading . could be treated
as a special case).

It would be easiest and more consistent (currently) to use
"shell-globbing" wildcards (as used for the file-acceptance
rules) rather than grep/egrep-style wildcards.

Re: spanning hosts

Reply via email to