Re: wget - tracking urls/web crawling

Frank McCown Thu, 22 Jun 2006 12:12:26 -0700

bruce wrote:

hi...


i'm testing wget on a test site.. i'm using the recursive function of wget
to crawl through a portion of the site...

it appears that wget is hitting a link within the crawl that's causing it to
begin to crawl through the section of the site again...

i know wget isn't as robust as nutch, but can someone tell me if wget keeps
a track of the URLs that it's been through so it doesn't repeat/get stuck in
a never ending processs...

i haven't run across anything in the docs that seems to speak to this
point..

thanks

-bruce



Bruce,

Wget does keep a list of URLs that it has visited in order to avoidre-visiting them. The problem could be due to the URL normalizationscheme. When wget crawls


http://foo.org/

it thinks puts this URL on the "visited" list. If it later runs into

http://foo.org/default.htm

which is actually the same as

http://foo.org/

then wget is not aware the URLs are the same, so default.htm will becrawled again. But, any URLs extracted from default.htm should be thesame as the previous crawl, so they should not be crawled again.

You may want to include a more detailed description of your problem ifthis doesn't help (for example, the command-line arguments, etc.).


Regards,
Frank

Re: wget - tracking urls/web crawling

Reply via email to