hi frank... there must be something simple i'm missing...
i'm looking to crawl the site >>> http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071 i issue the wget: wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071 i thought that this would simply get everything under the http://...?20071. however, it appears that wget is getting 20062, etc.. which are the other semesters... what i'd really like to do is to simply get 'all depts' for each of the semesters... any thoughts/comments/etc... -bruce -----Original Message----- From: Frank McCown [mailto:[EMAIL PROTECTED] Sent: Thursday, June 22, 2006 12:12 PM To: [EMAIL PROTECTED]; [email protected] Subject: Re: wget - tracking urls/web crawling bruce wrote: > hi... > > i'm testing wget on a test site.. i'm using the recursive function of wget > to crawl through a portion of the site... > > it appears that wget is hitting a link within the crawl that's causing it to > begin to crawl through the section of the site again... > > i know wget isn't as robust as nutch, but can someone tell me if wget keeps > a track of the URLs that it's been through so it doesn't repeat/get stuck in > a never ending processs... > > i haven't run across anything in the docs that seems to speak to this > point.. > > thanks > > -bruce > Bruce, Wget does keep a list of URLs that it has visited in order to avoid re-visiting them. The problem could be due to the URL normalization scheme. When wget crawls http://foo.org/ it thinks puts this URL on the "visited" list. If it later runs into http://foo.org/default.htm which is actually the same as http://foo.org/ then wget is not aware the URLs are the same, so default.htm will be crawled again. But, any URLs extracted from default.htm should be the same as the previous crawl, so they should not be crawled again. You may want to include a more detailed description of your problem if this doesn't help (for example, the command-line arguments, etc.). Regards, Frank
