RE: wget - tracking urls/web crawling

bruce Fri, 23 Jun 2006 10:04:19 -0700

hi...

what i really need regarding wget, is the ability to crawl through a site,
and to return information, based on some criteria that i'd like to define...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneat the URL. wget does this as well as providing some
additional functionality.

i need more functionality....

in particular, i'd like to be able to modify the way wget handles forms, and
links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page

this kind of functionality would allow the 'wget' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information....

thanks

-bruce

-----Original Message-----
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 4:36 PM
To: [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: RE: wget - tracking urls/web crawling

Bruce wrote:

> if there was a way that i could insert/use some form of a regex to exclude
> urls+querystring that match, then i'd be ok... the pages i need to
> urls+exclude
> are based on information that's in the query portion of the url...

Work on such a feature has been promised for an upcoming release of wget.

Tony Lewis

RE: wget - tracking urls/web crawling

Reply via email to