RE: wget - tracking urls/web crawling

bruce Thu, 22 Jun 2006 15:46:09 -0700

hey frank...

creating a list of pages to parse doesn't do me any good... i really need to
be able to recurse through the underlying pages.. or at least a section of
the pages...

if there was a way that i could insert/use some form of a regex to exclude
urls+querystring that match, then i'd be ok... the pages i need to exclude
are based on information that's in the query portion of the url...

-bruce

-----Original Message-----
From: Frank McCown [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 2:34 PM
To: [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: Re: wget - tracking urls/web crawling

bruce wrote:
> i issue the wget:
>  wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071
>
> i thought that this would simply get everything under the
http://...?20071.
> however, it appears that wget is getting 20062, etc.. which are the other
> semesters...

The -np option will keep wget from crawling any URLs that are outside of
the cgi-bin directory.  That means 20062, etc. *will* be crawled.

> what i'd really like to do is to simply get 'all depts' for each of the
> semesters...

The problem with the site you are trying to crawl is that its pages are
hidden behind a web form.  Wget is best at getting pages that are
directly linked (e.g., using <a> tag) to other pages.

What I'd recommend doing is creating a list of pages that you want
crawled.  Maybe you can do this with a script.  Then I'd use the
--input-file and --page-requisites (no -r) to crawl just those pages and
get any images, style sheets, etc. that the pages need to display.

Hope that helps,
Frank

RE: wget - tracking urls/web crawling

Reply via email to