Re: spanning hosts: 2 Problems

2002-03-28 Thread Jens Rösner

Hi again, Ian and fellow wgeteers!

 A debug log will be useful if you can produce one. 
Sure I (or wget) can and did.
It is 60kB of text. Zipping? Attaching?

 Also note that if receive cookies that expire around 2038 with
 debugging on, the Windows version of Wget will crash! (This is a
 known bug with a known fix, but not yet finalised in CVS.)
Funny you mention that! 
I came across a crash caused by a cookie 
two days ago. I disabled cookies and it worked.
Should have traced this a bit more.

  I just installed 1.7.1, which also works breadth-first.
 (I think you mean depth-first.) 
*doh* /slaps forehead
Of course, thanks.

 used depth-first retrieval. There are advantages and disadvantages
 with both types of retrieval.
I understand, I followed (but not totally understood) 
the discussion back then.

  Of course, this is possible.
  I just had hoped that by combining
  -F -i url.html
  with domain acceptance would save me a lot of time.
 
 Oh, I think I see what your first complaint is now. I initially
 assumed that your local html file was being served by a local HTTP
 server rather than being fed to the -F -i options. Is your complaint really that URLs
 supplied on the command line or via the
 -i option are not subjected to the acceptance/rejection rules? That
 does indeed seem to be the current behavior, but there is not
 particular reason why we couldn't apply the tests to these URLs as
 well as the URLs obtained through recursion.

Well, you are confusing me a bit ;}
Assume a file like

html
body
a href=http://www.audistory-nospam.com;1/a
a href=http://www.audistory-nospam.de;2/a
a href=http://www.audi100-online-nospam.de;3/a
a href=http://www.kolaschnik-nospam.de;4/a
/body
/html

and a command line like

wget -nc -x -r -l0 -t10 -H -Dstory.de,audi -o example.log -k -d
-R.gif,.exe,*tn*,*thumb*,*small* -F -i example.html

Result with 1.8.1 and 1.7.1 with -nh: 
audistory.com: Only index.html
audistory.de: Everything
audi100-online: only the first page 
kolaschnik.de: only the first page

What I would have liked and expected:
audistory.com: Everything
audistory.de: Everything
audi100-online: Everything
kolaschnik.de: nothing

Independent from the the question how the string audi 
should be matched within the URL, I think rejected URLs 
should not be parsed or be retrieved.

I hope I could articulate what I wanted to say :)

CU
Jens



Re: spanning hosts: 2 Problems

2002-03-26 Thread Ian Abbott

On 26 Mar 2002 at 19:01, Jens Rösner wrote:

 I am using wget to parse a local html file which has numerous links into
 the www.
 Now, I only want hosts that include certain strings like 
 -H -Daudi,vw,online.de

It's probably worth noting that the comparisons between the -D
strings and the domains being followed (or not) is anchored at
the ends of the strings, i.e. -Dfoo matches bar.foo but not
foo.bar.

 Two things I don't like in the way wget 1.8.1 works on windows:
 
 The first page of even the rejected hosts gets saved.

That sounds like a bug.

 This messes up my directory structure as I force directories 
 (which is my default and normally useful)
 
 I am aware that wget has switched to breadth first (as opposed to
 depth-first) 
 retrieval.
 Now, with downloading from many (20+) different servers, this is a bit
 frustrating, 
 as I will probably have the first completely downloaded site in a few
 days...

Would that be less of a problem if the first problem (first page
from rejected domains) was fixed?

 Is there any other way to work around this besides installing wget 1.6
 (or even 1.5?)

No, but note that if you pass several starting URLs to Wget, it
will complete the first before moving on to the second. That also
works for the URLs in the file specified by the --input-file
parameter. However, if all the sites are interlinked, you would be
no better off with this. The other alternative is to run wget
several times in sequence with different starting URLs and restrictions, perhaps using 
the --timestamping or --no-clobber
options to avoid downloading things more than once.