RE: wget - tracking urls/web crawling

2006-06-23 Thread Tony Lewis
Bruce wrote: 

> any idea as to who's working on this feature?

Mauro Tortonesi sent out a request for comments to the mailing list on March
29. I don't know whether he has started "working" on the feature or not.

Tony



RE: wget - tracking urls/web crawling

2006-06-23 Thread bruce
hi...

what i really need regarding wget, is the ability to crawl through a site,
and to return information, based on some criteria that i'd like to define...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneat the URL. wget does this as well as providing some
additional functionality.

i need more functionality

in particular, i'd like to be able to modify the way wget handles forms, and
links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'wget' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information

thanks

-bruce


-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 4:36 PM
To: [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: RE: wget - tracking urls/web crawling


Bruce wrote:

> if there was a way that i could insert/use some form of a regex to exclude
> urls+querystring that match, then i'd be ok... the pages i need to
> urls+exclude
> are based on information that's in the query portion of the url...

Work on such a feature has been promised for an upcoming release of wget.

Tony Lewis



Re: wget - tracking urls/web crawling

2006-06-23 Thread Hrvoje Niksic
"bruce" <[EMAIL PROTECTED]> writes:

> any idea as to who's working on this feature?

No one, as far as I know.


RE: wget - tracking urls/web crawling

2006-06-23 Thread bruce
hey tony...

any idea as to who's working on this feature?

thanks..

-bruce


-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 4:36 PM
To: [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: RE: wget - tracking urls/web crawling


Bruce wrote:

> if there was a way that i could insert/use some form of a regex to exclude
> urls+querystring that match, then i'd be ok... the pages i need to
> urls+exclude
> are based on information that's in the query portion of the url...

Work on such a feature has been promised for an upcoming release of wget.

Tony Lewis



BUGBUG

2006-06-23 Thread roman . strecker

Hi!

I'm trying to mirror a site www.turboborisov.com
and get an error (see also attached picture). I use the version 1.10.2
un a WinXPSP2 machine. Here are the options I use: 

set http_proxy=proxy-999b.changed.de:80
c:\Programme\Wget\wget.exe --proxy=on
--mirror --page-requisites --convert-links  --proxy --continue --force-directories
--html-extension %1

%1 i the 1st parameter I give to the
CMD-Script.

The error translation is:

The application in "0x00..."
shows to "0x00..0" in memory. "Read" could not be accomplished
on the memory. Klick "ok" to end and "cancel" to debug...

Here is some desassembled code:

00417807  ret      
       
00417808  nop      
       
00417809  nop      
       
0041780A  nop      
       
0041780B  nop      
       
0041780C  nop      
       
0041780D  nop      
       
0041780E  nop      
       
0041780F  nop      
       
00417810  mov      
  eax,dword ptr [esp+8] 
00417814  mov      
  ecx,dword ptr [esp+4] 
00417818  push      
 ebx  
00417819  push      
 ebp  
0041781A  mov      
  ebp,dword ptr [ecx] 
0041781C  push      
 esi  
0041781D  push      
 edi  
0041781E  mov      
  edi,dword ptr [eax] 
00417820  mov      
  eax,dword ptr [ebp+20h] 
00417823  mov      
  esi,dword ptr [edi+20h] 
00417826  mov      
  dl,byte ptr [eax] 
00417828  mov        
bl,byte ptr [esi] 
0041782A  mov      
  cl,dl 
0041782C  cmp      
  dl,bl 
0041782E  jne      
  0041784E 
00417830  test      
 cl,cl 
00417832  je        
 0041784A 
00417834  mov      
  dl,byte ptr [eax+1] 
00417837  mov      
  bl,byte ptr [esi+1] 
0041783A  mov      
  cl,dl 
0041783C  cmp      
  dl,bl 
0041783E  jne      
  0041784E 
00417840  add      
  eax,2 
00417843  add      
  esi,2 
00417846  test      
 cl,cl 
00417848  jne      
  00417826 
0041784A  xor      
  ebx,ebx 
0041784C  jmp      
  00417853 
0041784E  sbb      
  ebx,ebx 
00417850  sbb      
  ebx,0h 
00417853  mov      
  esi,dword ptr [edi+24h] 
00417856  mov      
  ecx,dword ptr [ebp+24h] 
00417859  mov      
  al,byte ptr [ecx] 
0041785B  mov      
  dl,al 
0041785D  cmp      
  al,byte ptr [esi] 
0041785F  jne      
  0041787D 
00417861  test      
 dl,dl 
00417863  je        
 00417879 
00417865  mov      
  al,byte ptr [ecx+1] 
00417868  mov      
  dl,al 
0041786A  cmp      
  al,byte ptr [esi+1] 
0041786D  jne      
  0041787D 
0041786F  add      
  ecx,2 
00417872  add      
  esi,2 
00417875  test      
 dl,dl 
00417877  jne      
  00417859 
00417879  xor      
  eax,eax 
0041787B  jmp      
  00417882 
0041787D  sbb      
  eax,eax 
0041787F  sbb      
  eax,0h 
00417882  test      
 ebx,ebx 
00417884  je        
 00417888 
00417886  mov      
  eax,ebx 
00417888  pop      
  edi  
00417889  pop      
  esi  
0041788A  pop      
  ebp  
0041788B  pop      
  ebx  
0041788C  ret      
       
0041788D  nop      
       
0041788E  nop      
       
0041788F  nop      
       




Regadrs, Roman.


Freundliche Grüße,

KKH - Die Kaufmännische
Roman Strecker

Systemengineering
Hauptverwaltung

Karl-Wiechert-Allee 61
30625 Hannover
Tel.: +49 (511) 2802-5686
Fax: +49 (511) 2802-5699
www.kkh.de