hey frank... creating a list of pages to parse doesn't do me any good... i really need to be able to recurse through the underlying pages.. or at least a section of the pages...
if there was a way that i could insert/use some form of a regex to exclude urls+querystring that match, then i'd be ok... the pages i need to exclude are based on information that's in the query portion of the url... -bruce -----Original Message----- From: Frank McCown [mailto:[EMAIL PROTECTED] Sent: Thursday, June 22, 2006 2:34 PM To: [EMAIL PROTECTED] Cc: wget@sunsite.dk Subject: Re: wget - tracking urls/web crawling bruce wrote: > i issue the wget: > wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071 > > i thought that this would simply get everything under the http://...?20071. > however, it appears that wget is getting 20062, etc.. which are the other > semesters... The -np option will keep wget from crawling any URLs that are outside of the cgi-bin directory. That means 20062, etc. *will* be crawled. > what i'd really like to do is to simply get 'all depts' for each of the > semesters... The problem with the site you are trying to crawl is that its pages are hidden behind a web form. Wget is best at getting pages that are directly linked (e.g., using <a> tag) to other pages. What I'd recommend doing is creating a list of pages that you want crawled. Maybe you can do this with a script. Then I'd use the --input-file and --page-requisites (no -r) to crawl just those pages and get any images, style sheets, etc. that the pages need to display. Hope that helps, Frank