hi... what i really need regarding wget, is the ability to crawl through a site, and to return information, based on some criteria that i'd like to define...
a given crawling process, would normally start at some URL, and iteratively fetch files underneat the URL. wget does this as well as providing some additional functionality. i need more functionality.... in particular, i'd like to be able to modify the way wget handles forms, and links/queries on a given page. i'd like to be able to: for forms: allow the app to handle POST/GET forms allow the app to select (implement/ignore) given elements within a form track the FORM(s) for a given URL/page/level of the crawl for links: allow the app to either include/exclude a given link for a given page/URL via regex parsing or list of URLs allow the app to handle querystring data, ie to include/exclude the URL+Query based on regex parsing or simple text comparison data extraction: abiility to do xpath/regex extraction based on the DOM permit multiple xpath/regex functions to be run on a given page this kind of functionality would allow the 'wget' function to be relatively selective regarding the ability to crawl through a site and extract the required information.... thanks -bruce -----Original Message----- From: Tony Lewis [mailto:[EMAIL PROTECTED] Sent: Thursday, June 22, 2006 4:36 PM To: [EMAIL PROTECTED] Cc: wget@sunsite.dk Subject: RE: wget - tracking urls/web crawling Bruce wrote: > if there was a way that i could insert/use some form of a regex to exclude > urls+querystring that match, then i'd be ok... the pages i need to > urls+exclude > are based on information that's in the query portion of the url... Work on such a feature has been promised for an upcoming release of wget. Tony Lewis