RE: wget - tracking urls/web crawling
Bruce wrote: > any idea as to who's working on this feature? Mauro Tortonesi sent out a request for comments to the mailing list on March 29. I don't know whether he has started "working" on the feature or not. Tony
RE: wget - tracking urls/web crawling
hi... what i really need regarding wget, is the ability to crawl through a site, and to return information, based on some criteria that i'd like to define... a given crawling process, would normally start at some URL, and iteratively fetch files underneat the URL. wget does this as well as providing some additional functionality. i need more functionality in particular, i'd like to be able to modify the way wget handles forms, and links/queries on a given page. i'd like to be able to: for forms: allow the app to handle POST/GET forms allow the app to select (implement/ignore) given elements within a form track the FORM(s) for a given URL/page/level of the crawl for links: allow the app to either include/exclude a given link for a given page/URL via regex parsing or list of URLs allow the app to handle querystring data, ie to include/exclude the URL+Query based on regex parsing or simple text comparison data extraction: abiility to do xpath/regex extraction based on the DOM permit multiple xpath/regex functions to be run on a given page this kind of functionality would allow the 'wget' function to be relatively selective regarding the ability to crawl through a site and extract the required information thanks -bruce -Original Message- From: Tony Lewis [mailto:[EMAIL PROTECTED] Sent: Thursday, June 22, 2006 4:36 PM To: [EMAIL PROTECTED] Cc: wget@sunsite.dk Subject: RE: wget - tracking urls/web crawling Bruce wrote: > if there was a way that i could insert/use some form of a regex to exclude > urls+querystring that match, then i'd be ok... the pages i need to > urls+exclude > are based on information that's in the query portion of the url... Work on such a feature has been promised for an upcoming release of wget. Tony Lewis
Re: wget - tracking urls/web crawling
"bruce" <[EMAIL PROTECTED]> writes: > any idea as to who's working on this feature? No one, as far as I know.
RE: wget - tracking urls/web crawling
hey tony... any idea as to who's working on this feature? thanks.. -bruce -Original Message- From: Tony Lewis [mailto:[EMAIL PROTECTED] Sent: Thursday, June 22, 2006 4:36 PM To: [EMAIL PROTECTED] Cc: wget@sunsite.dk Subject: RE: wget - tracking urls/web crawling Bruce wrote: > if there was a way that i could insert/use some form of a regex to exclude > urls+querystring that match, then i'd be ok... the pages i need to > urls+exclude > are based on information that's in the query portion of the url... Work on such a feature has been promised for an upcoming release of wget. Tony Lewis
BUGBUG
Hi! I'm trying to mirror a site www.turboborisov.com and get an error (see also attached picture). I use the version 1.10.2 un a WinXPSP2 machine. Here are the options I use: set http_proxy=proxy-999b.changed.de:80 c:\Programme\Wget\wget.exe --proxy=on --mirror --page-requisites --convert-links --proxy --continue --force-directories --html-extension %1 %1 i the 1st parameter I give to the CMD-Script. The error translation is: The application in "0x00..." shows to "0x00..0" in memory. "Read" could not be accomplished on the memory. Klick "ok" to end and "cancel" to debug... Here is some desassembled code: 00417807 ret 00417808 nop 00417809 nop 0041780A nop 0041780B nop 0041780C nop 0041780D nop 0041780E nop 0041780F nop 00417810 mov eax,dword ptr [esp+8] 00417814 mov ecx,dword ptr [esp+4] 00417818 push ebx 00417819 push ebp 0041781A mov ebp,dword ptr [ecx] 0041781C push esi 0041781D push edi 0041781E mov edi,dword ptr [eax] 00417820 mov eax,dword ptr [ebp+20h] 00417823 mov esi,dword ptr [edi+20h] 00417826 mov dl,byte ptr [eax] 00417828 mov bl,byte ptr [esi] 0041782A mov cl,dl 0041782C cmp dl,bl 0041782E jne 0041784E 00417830 test cl,cl 00417832 je 0041784A 00417834 mov dl,byte ptr [eax+1] 00417837 mov bl,byte ptr [esi+1] 0041783A mov cl,dl 0041783C cmp dl,bl 0041783E jne 0041784E 00417840 add eax,2 00417843 add esi,2 00417846 test cl,cl 00417848 jne 00417826 0041784A xor ebx,ebx 0041784C jmp 00417853 0041784E sbb ebx,ebx 00417850 sbb ebx,0h 00417853 mov esi,dword ptr [edi+24h] 00417856 mov ecx,dword ptr [ebp+24h] 00417859 mov al,byte ptr [ecx] 0041785B mov dl,al 0041785D cmp al,byte ptr [esi] 0041785F jne 0041787D 00417861 test dl,dl 00417863 je 00417879 00417865 mov al,byte ptr [ecx+1] 00417868 mov dl,al 0041786A cmp al,byte ptr [esi+1] 0041786D jne 0041787D 0041786F add ecx,2 00417872 add esi,2 00417875 test dl,dl 00417877 jne 00417859 00417879 xor eax,eax 0041787B jmp 00417882 0041787D sbb eax,eax 0041787F sbb eax,0h 00417882 test ebx,ebx 00417884 je 00417888 00417886 mov eax,ebx 00417888 pop edi 00417889 pop esi 0041788A pop ebp 0041788B pop ebx 0041788C ret 0041788D nop 0041788E nop 0041788F nop Regadrs, Roman. Freundliche Grüße, KKH - Die Kaufmännische Roman Strecker Systemengineering Hauptverwaltung Karl-Wiechert-Allee 61 30625 Hannover Tel.: +49 (511) 2802-5686 Fax: +49 (511) 2802-5699 www.kkh.de