Re: wget - tracking urls/web crawling

2006-06-28 Thread Mauro Tortonesi

Tony Lewis wrote:
Bruce wrote: 



any idea as to who's working on this feature?


Mauro Tortonesi sent out a request for comments to the mailing list on March
29. I don't know whether he has started working on the feature or not.


yes. i haven't started coding it yet, though. i am still working on the 
last fixes for recursive spider mode.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


RE: wget - tracking urls/web crawling

2006-06-23 Thread bruce
hey tony...

any idea as to who's working on this feature?

thanks..

-bruce


-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 4:36 PM
To: [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: RE: wget - tracking urls/web crawling


Bruce wrote:

 if there was a way that i could insert/use some form of a regex to exclude
 urls+querystring that match, then i'd be ok... the pages i need to
 urls+exclude
 are based on information that's in the query portion of the url...

Work on such a feature has been promised for an upcoming release of wget.

Tony Lewis



Re: wget - tracking urls/web crawling

2006-06-23 Thread Hrvoje Niksic
bruce [EMAIL PROTECTED] writes:

 any idea as to who's working on this feature?

No one, as far as I know.


RE: wget - tracking urls/web crawling

2006-06-23 Thread bruce
hi...

what i really need regarding wget, is the ability to crawl through a site,
and to return information, based on some criteria that i'd like to define...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneat the URL. wget does this as well as providing some
additional functionality.

i need more functionality

in particular, i'd like to be able to modify the way wget handles forms, and
links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'wget' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information

thanks

-bruce


-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 4:36 PM
To: [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: RE: wget - tracking urls/web crawling


Bruce wrote:

 if there was a way that i could insert/use some form of a regex to exclude
 urls+querystring that match, then i'd be ok... the pages i need to
 urls+exclude
 are based on information that's in the query portion of the url...

Work on such a feature has been promised for an upcoming release of wget.

Tony Lewis



RE: wget - tracking urls/web crawling

2006-06-23 Thread Tony Lewis
Bruce wrote: 

 any idea as to who's working on this feature?

Mauro Tortonesi sent out a request for comments to the mailing list on March
29. I don't know whether he has started working on the feature or not.

Tony



Re: wget - tracking urls/web crawling

2006-06-22 Thread Frank McCown

bruce wrote:

hi...

i'm testing wget on a test site.. i'm using the recursive function of wget
to crawl through a portion of the site...

it appears that wget is hitting a link within the crawl that's causing it to
begin to crawl through the section of the site again...

i know wget isn't as robust as nutch, but can someone tell me if wget keeps
a track of the URLs that it's been through so it doesn't repeat/get stuck in
a never ending processs...

i haven't run across anything in the docs that seems to speak to this
point..

thanks

-bruce




Bruce,

Wget does keep a list of URLs that it has visited in order to avoid 
re-visiting them.  The problem could be due to the URL normalization 
scheme.  When wget crawls


http://foo.org/

it thinks puts this URL on the visited list. If it later runs into

http://foo.org/default.htm

which is actually the same as

http://foo.org/

then wget is not aware the URLs are the same, so default.htm will be 
crawled again.  But, any URLs extracted from default.htm should be the 
same as the previous crawl, so they should not be crawled again.


You may want to include a more detailed description of your problem if 
this doesn't help (for example, the command-line arguments, etc.).


Regards,
Frank


RE: wget - tracking urls/web crawling

2006-06-22 Thread Post, Mark K
Try using the -np (no parent) parameter.


Mark Post 

-Original Message-
From: bruce [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 22, 2006 4:15 PM
To: 'Frank McCown'; wget@sunsite.dk
Subject: RE: wget - tracking urls/web crawling

hi frank...

there must be something simple i'm missing...

i'm looking to crawl the site 
http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i issue the wget:
 wget -r -np
http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i thought that this would simply get everything under the
http://...?20071.
however, it appears that wget is getting 20062, etc.. which are the
other
semesters...

what i'd really like to do is to simply get 'all depts' for each of the
semesters...

any thoughts/comments/etc...

-bruce




RE: wget - tracking urls/web crawling

2006-06-22 Thread bruce
hey frank...

creating a list of pages to parse doesn't do me any good... i really need to
be able to recurse through the underlying pages.. or at least a section of
the pages...

if there was a way that i could insert/use some form of a regex to exclude
urls+querystring that match, then i'd be ok... the pages i need to exclude
are based on information that's in the query portion of the url...

-bruce



-Original Message-
From: Frank McCown [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 2:34 PM
To: [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: Re: wget - tracking urls/web crawling


bruce wrote:
 i issue the wget:
  wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

 i thought that this would simply get everything under the
http://...?20071.
 however, it appears that wget is getting 20062, etc.. which are the other
 semesters...

The -np option will keep wget from crawling any URLs that are outside of
the cgi-bin directory.  That means 20062, etc. *will* be crawled.


 what i'd really like to do is to simply get 'all depts' for each of the
 semesters...

The problem with the site you are trying to crawl is that its pages are
hidden behind a web form.  Wget is best at getting pages that are
directly linked (e.g., using a tag) to other pages.

What I'd recommend doing is creating a list of pages that you want
crawled.  Maybe you can do this with a script.  Then I'd use the
--input-file and --page-requisites (no -r) to crawl just those pages and
get any images, style sheets, etc. that the pages need to display.


Hope that helps,
Frank