wget parsing JavaScript
wget stumbled upon the following HTML file: --- 8 html head titlefoo/title /head body SCRIPT language=JavaScript1.2 var sitems=new Array() var sitemlinks=new Array() ///Edit below/ //extend or shorten this list sitems[0]=15.html sitems[1]=16.html sitems[2]=17.html sitems[3]=18.html sitems[4]=19.html sitems[5]=20.html sitems[6]=21.html sitems[7]=22.html sitems[8]=23.html sitems[9]=24.html sitems[10]=25.html sitems[11]=26.html sitems[12]=27.html //These are the links pertaining to the above text. sitemlinks[0]=31.html sitemlinks[1]=32.html sitemlinks[2]=33.html sitemlinks[3]=34.html sitemlinks[4]=35.html sitemlinks[5]=36.html sitemlinks[6]=37.html sitemlinks[7]=38.html sitemlinks[8]=39.html sitemlinks[9]=40.html sitemlinks[10]=41.html sitemlinks[11]=42.html sitemlinks[12]=43.html //If you want the links to load in another frame/window, specify name of //target (ie: target=_new) var target= for (i=0;i=sitems.length-1;i++) document.write('a href='+sitemlinks[i]+' target='+target+''+sitems[i]+'/abr') /SCRIPT NOSCRIPT Congratulations, you have turned off JavaScript. /NOSCRIPT /body /html --- 8 I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to parse whatever it's inside. Why was this implemented ? JavaScript is most used to construct links programmatically. wget is likely to find bogus URLs until it can properly parse JavaScript. -- Csaba Ráduly, Software Engineer Sophos Anti-Virus email: [EMAIL PROTECTED]http://www.sophos.com US Support: +1 888 SOPHOS 9 UK Support: +44 1235 559933
GNU wget 1.8.1 - Bug report memory occupied
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hallo specialists, I used wget 1.8.1 on my system to mirror the site www.europa.eu.int. Transfer was throug a proxy and DSL over night. After about 12-13 hours I found following situation: Totally download about 1.8GB data. wget process was increased to approx 75MB RAM occupying! This increasing was fatal for the system, because there where only 32MBRAM in the intel486-maschine. Downlaod rate was dramatically reduced, but system was still running. Did kill the process with Ctrl-C. Everything seems to be o.k. After research the data I found, that redirecting was not good in all ways. Files where set in the right directories, but relinking in the page was often wrong. should be: http://myroot/europa.eu.int/indidivdual_dir but was: http://europa.eu.int/individual_dir problem seems to be, that wget misses the part of directory, that is leading to the downloaded one. My calling conditions for wget where: wget -m http://europa.eu.int/ All parameters have been in downloaded status and where unchanged excep the adres for the proxy, I had to use. compiling was with standard features under linux 2.4.13 kernel. Quesction: Did I make a configuration mistake? If not can cou correct the relinking? Howa to make, that wget will not use so many RAM? Do I have the chance to correct the wrong 'links'. (Not by hand, there are thousands) Mit freundlichem Gruß Dipl. Ing. Hermann Rugen Rugen Consulting Max-Planck-Straße 7 49767 Twist Tel.: 05931 4099 151 Fax: 05931 4099 152 eMail: [EMAIL PROTECTED] Internet: www.rugen-consulting.com -BEGIN PGP SIGNATURE- Version: PGPfreeware 7.0.3 for non-commercial use http://www.pgp.com iQA/AwUBPKBkl0Y5W7VNHjVzEQIPHQCg0xNHFV2Qrf5as2+xwvlK4Uf5Gr0AoMtY RENbT04glmugzL3kiWOh/wG3 =i623 -END PGP SIGNATURE- PGPexch.rtf.asc Description: Binary data
-_-[±¤°í]ÀÚ¼¼°¡ ¹Ù¸£Áö ¸øÇÑ ºÐ¸¸ º¸¼¼¿ä!! Á¤Çü¿Ü°úÀǻ簡 °³¹ßÇÑ ÀÚ¼¼±³Á¤±â
Title: Á¦¸ñ¾øÀ½
Re: wget parsing JavaScript
Csaba Ráduly wrote: I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to parse whatever it's inside. Why was this implemented ? JavaScript is most used to construct links programmatically. wget is likely to find bogus URLs until it can properly parse JavaScript. wget is parsing the attributes within the script tag, i.e., script src=url. It does not examine the content between script and /script. It looks for src=url because the source file is just another file that may need to be copied (along with all the other files that are needed to mirror a site). Tony
Re: wget parsing JavaScript
On 26 Mar 2002 at 7:05, Tony Lewis wrote: Csaba Ráduly wrote: I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to parse whatever it's inside. Why was this implemented ? JavaScript is most used to construct links programmatically. wget is likely to find bogus URLs until it can properly parse JavaScript. wget is parsing the attributes within the script tag, i.e., script src=url. It does not examine the content between script and /script. I think it does, actually, but that is mostly harmless. I haven't heard of any cases where it has caused a problem (assuming the script is well-formed). It's normal good practice to hide the code in a HTML comment anyway, but perhaps that good practice is less common now these days now that virtually every browser out there groks SCRIPT/SCRIPT and NOSCRIPT/NOSCRIPT. Wget's HTML parser doesn't yet have the hooks to allow different elements (such as SCRIPT and STYLE) to be processed differently to normal HTML. If it gets these hooks it could then go off and process the SCRIPT element differently. (The minimal processing for the SCRIPT element, if it is using an an unsupported script language would be to skip it.) If a future version of Wget were to handle JavaScript as an option (perhaps using the GPL'd SpiderMonkey), it would have to parse the default action of the script and also possibly exercise the various event handlers to gather more URLs. I guess this would fail on the more complicated scripts that expect some sort of intelligent being (or a suitably programmed robot) to fill in forms and/or press buttons in the correct sequence to progress to the next page!
Re: spanning hosts: 2 Problems
On 26 Mar 2002 at 19:01, Jens Rösner wrote: I am using wget to parse a local html file which has numerous links into the www. Now, I only want hosts that include certain strings like -H -Daudi,vw,online.de It's probably worth noting that the comparisons between the -D strings and the domains being followed (or not) is anchored at the ends of the strings, i.e. -Dfoo matches bar.foo but not foo.bar. Two things I don't like in the way wget 1.8.1 works on windows: The first page of even the rejected hosts gets saved. That sounds like a bug. This messes up my directory structure as I force directories (which is my default and normally useful) I am aware that wget has switched to breadth first (as opposed to depth-first) retrieval. Now, with downloading from many (20+) different servers, this is a bit frustrating, as I will probably have the first completely downloaded site in a few days... Would that be less of a problem if the first problem (first page from rejected domains) was fixed? Is there any other way to work around this besides installing wget 1.6 (or even 1.5?) No, but note that if you pass several starting URLs to Wget, it will complete the first before moving on to the second. That also works for the URLs in the file specified by the --input-file parameter. However, if all the sites are interlinked, you would be no better off with this. The other alternative is to run wget several times in sequence with different starting URLs and restrictions, perhaps using the --timestamping or --no-clobber options to avoid downloading things more than once.
Re: ´äº¯ÀÔ´Ï´Ù.
Title: Àý´ë ¼ºÀθ¸ Ŭ¸¯ÇÏ½Ã°í ¹Ì¼º³âÀÚ´Â Áö¿ì¼¼¿ä. Àý´ë ¼ºÀθ¸ Ŭ¸¯ÇÏ½Ã°í ¹Ì¼º³âÀÚ´Â Áö¿ì¼¼¿ä. »õ·Î¿î ½ÅÁ¾ ¼ºÀε¿¿µ»ó À帣
Re: wget parsing JavaScript
I wrote: wget is parsing the attributes within the script tag, i.e., script src=url. It does not examine the content between script and /script. and Ian Abbott responded: I think it does, actually, but that is mostly harmless. You're right. What I meant was that it does not examine the JavaScript looking for URLs. Tony