hi frank...

there must be something simple i'm missing...

i'm looking to crawl the site >>>
http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i issue the wget:
 wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i thought that this would simply get everything under the http://...?20071.
however, it appears that wget is getting 20062, etc.. which are the other
semesters...

what i'd really like to do is to simply get 'all depts' for each of the
semesters...

any thoughts/comments/etc...

-bruce



-----Original Message-----
From: Frank McCown [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 12:12 PM
To: [EMAIL PROTECTED]; [email protected]
Subject: Re: wget - tracking urls/web crawling


bruce wrote:
> hi...
>
> i'm testing wget on a test site.. i'm using the recursive function of wget
> to crawl through a portion of the site...
>
> it appears that wget is hitting a link within the crawl that's causing it
to
> begin to crawl through the section of the site again...
>
> i know wget isn't as robust as nutch, but can someone tell me if wget
keeps
> a track of the URLs that it's been through so it doesn't repeat/get stuck
in
> a never ending processs...
>
> i haven't run across anything in the docs that seems to speak to this
> point..
>
> thanks
>
> -bruce
>


Bruce,

Wget does keep a list of URLs that it has visited in order to avoid
re-visiting them.  The problem could be due to the URL normalization
scheme.  When wget crawls

http://foo.org/

it thinks puts this URL on the "visited" list. If it later runs into

http://foo.org/default.htm

which is actually the same as

http://foo.org/

then wget is not aware the URLs are the same, so default.htm will be
crawled again.  But, any URLs extracted from default.htm should be the
same as the previous crawl, so they should not be crawled again.

You may want to include a more detailed description of your problem if
this doesn't help (for example, the command-line arguments, etc.).

Regards,
Frank

Reply via email to