Re: wget - tracking urls/web crawling

Frank McCown Thu, 22 Jun 2006 14:34:08 -0700

bruce wrote:

i issue the wget:
 wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071


i thought that this would simply get everything under the http://...?20071.
however, it appears that wget is getting 20062, etc.. which are the other
semesters...

The -np option will keep wget from crawling any URLs that are outside ofthe cgi-bin directory. That means 20062, etc. *will* be crawled.

what i'd really like to do is to simply get 'all depts' for each of the
semesters...

The problem with the site you are trying to crawl is that its pages arehidden behind a web form. Wget is best at getting pages that aredirectly linked (e.g., using <a> tag) to other pages.

What I'd recommend doing is creating a list of pages that you wantcrawled. Maybe you can do this with a script. Then I'd use the--input-file and --page-requisites (no -r) to crawl just those pages andget any images, style sheets, etc. that the pages need to display.



Hope that helps,
Frank

Re: wget - tracking urls/web crawling

Reply via email to