bruce wrote:
hi...

i'm testing wget on a test site.. i'm using the recursive function of wget
to crawl through a portion of the site...

it appears that wget is hitting a link within the crawl that's causing it to
begin to crawl through the section of the site again...

i know wget isn't as robust as nutch, but can someone tell me if wget keeps
a track of the URLs that it's been through so it doesn't repeat/get stuck in
a never ending processs...

i haven't run across anything in the docs that seems to speak to this
point..

thanks

-bruce



Bruce,

Wget does keep a list of URLs that it has visited in order to avoid re-visiting them. The problem could be due to the URL normalization scheme. When wget crawls

http://foo.org/

it thinks puts this URL on the "visited" list. If it later runs into

http://foo.org/default.htm

which is actually the same as

http://foo.org/

then wget is not aware the URLs are the same, so default.htm will be crawled again. But, any URLs extracted from default.htm should be the same as the previous crawl, so they should not be crawled again.

You may want to include a more detailed description of your problem if this doesn't help (for example, the command-line arguments, etc.).

Regards,
Frank

Reply via email to