wget parsing JavaScript

2002-03-26 Thread csaba . raduly

wget stumbled upon the following HTML file:

--- 8 
html
head
titlefoo/title
/head

body

SCRIPT language=JavaScript1.2
var sitems=new Array()
var sitemlinks=new Array()

///Edit below/

//extend or shorten this list
sitems[0]=15.html
sitems[1]=16.html
sitems[2]=17.html
sitems[3]=18.html
sitems[4]=19.html
sitems[5]=20.html
sitems[6]=21.html
sitems[7]=22.html
sitems[8]=23.html
sitems[9]=24.html
sitems[10]=25.html
sitems[11]=26.html
sitems[12]=27.html


//These are the links pertaining to the above text.
sitemlinks[0]=31.html
sitemlinks[1]=32.html
sitemlinks[2]=33.html
sitemlinks[3]=34.html
sitemlinks[4]=35.html
sitemlinks[5]=36.html
sitemlinks[6]=37.html
sitemlinks[7]=38.html
sitemlinks[8]=39.html
sitemlinks[9]=40.html
sitemlinks[10]=41.html
sitemlinks[11]=42.html
sitemlinks[12]=43.html

//If you want the links to load in another frame/window, specify name of
//target (ie: target=_new)
var target=

for (i=0;i=sitems.length-1;i++)
document.write('a href='+sitemlinks[i]+'
target='+target+''+sitems[i]+'/abr')

/SCRIPT
NOSCRIPT
Congratulations, you have turned off JavaScript.
/NOSCRIPT
/body

/html
--- 8 

I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to
parse whatever it's inside.
Why was this implemented ? JavaScript is most
used to construct links programmatically. wget is likely to find
bogus URLs until it can properly parse JavaScript.

--
Csaba Ráduly, Software Engineer   Sophos Anti-Virus
email: [EMAIL PROTECTED]http://www.sophos.com
US Support: +1 888 SOPHOS 9 UK Support: +44 1235 559933





GNU wget 1.8.1 - Bug report memory occupied

2002-03-26 Thread Dipl. Ing. Hermann Rugen





-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hallo specialists,
I used wget 1.8.1 on my system to mirror the site www.europa.eu.int.
Transfer was throug a proxy and DSL over night.
After about 12-13 hours I found following situation:
Totally download about 1.8GB data.
wget process was increased to approx 75MB RAM occupying!

This increasing was fatal for the system, because there where only
32MBRAM in the intel486-maschine.
Downlaod rate was dramatically reduced, but system was still running.
Did kill the process with Ctrl-C.
Everything seems to be o.k.
After research the data I found, that redirecting was not good in all
ways.
Files where set in the right directories, but relinking in the page
was often wrong.
should be: http://myroot/europa.eu.int/indidivdual_dir

but was:
http://europa.eu.int/individual_dir

problem seems to be, that wget misses the part of directory, that is
leading to the downloaded one.

My calling conditions for wget where:

wget -m http://europa.eu.int/

All parameters have been in downloaded status and where unchanged
excep the adres for the proxy, I had to use.
compiling was with standard features under linux 2.4.13 kernel.

Quesction:
Did I make a configuration mistake?
If not can cou correct the relinking?
Howa to make, that wget will not use so many RAM?
Do I have the chance to correct the wrong 'links'. (Not by hand,
there are thousands)

Mit freundlichem Gruß

Dipl. Ing. Hermann Rugen

Rugen Consulting
Max-Planck-Straße 7
49767 Twist

Tel.: 05931 4099 151
Fax: 05931 4099 152

eMail: [EMAIL PROTECTED]
Internet: www.rugen-consulting.com


-BEGIN PGP SIGNATURE-
Version: PGPfreeware 7.0.3 for non-commercial use http://www.pgp.com

iQA/AwUBPKBkl0Y5W7VNHjVzEQIPHQCg0xNHFV2Qrf5as2+xwvlK4Uf5Gr0AoMtY
RENbT04glmugzL3kiWOh/wG3
=i623
-END PGP SIGNATURE-






PGPexch.rtf.asc
Description: Binary data


-_-[±¤°í]ÀÚ¼¼°¡ ¹Ù¸£Áö ¸øÇÑ ºÐ¸¸ º¸¼¼¿ä!! Á¤Çü¿Ü°úÀǻ簡 °³¹ßÇÑ ÀÚ¼¼±³Á¤±â

2002-03-26 Thread ½ÅÁ¦Ç°
Title: Á¦¸ñ¾øÀ½









   


	








Re: wget parsing JavaScript

2002-03-26 Thread Tony Lewis

Csaba Ráduly wrote:

 I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to
 parse whatever it's inside.
 Why was this implemented ? JavaScript is most
 used to construct links programmatically. wget is likely to find
 bogus URLs until it can properly parse JavaScript.

wget is parsing the attributes within the script tag, i.e., script
src=url. It does not examine the content between script and /script.

It looks for src=url because the source file is just another file that may
need to be copied (along with all the other files that are needed to mirror
a site).

Tony





Re: wget parsing JavaScript

2002-03-26 Thread Ian Abbott

On 26 Mar 2002 at 7:05, Tony Lewis wrote:

 Csaba Ráduly wrote:
 
  I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to
  parse whatever it's inside.
  Why was this implemented ? JavaScript is most
  used to construct links programmatically. wget is likely to find
  bogus URLs until it can properly parse JavaScript.
 
 wget is parsing the attributes within the script tag, i.e., script
 src=url. It does not examine the content between script and /script.

I think it does, actually, but that is mostly harmless. I haven't
heard of any cases where it has caused a problem (assuming the
script is well-formed). It's normal good practice to hide the code
in a HTML comment anyway, but perhaps that good practice is less
common now these days now that virtually every browser out there
groks SCRIPT/SCRIPT and NOSCRIPT/NOSCRIPT.

Wget's HTML parser doesn't yet have the hooks to allow different
elements (such as SCRIPT and STYLE) to be processed differently to
normal HTML. If it gets these hooks it could then go off and
process the SCRIPT element differently. (The minimal processing
for the SCRIPT element, if it is using an an unsupported script
language would be to skip it.)

If a future version of Wget were to handle JavaScript as an option
(perhaps using the GPL'd SpiderMonkey), it would have to parse the
default action of the script and also possibly exercise the various
event handlers to gather more URLs. I guess this would fail on
the more complicated scripts that expect some sort of intelligent
being (or a suitably programmed robot) to fill in forms and/or
press buttons in the correct sequence to progress to the next page!



Re: spanning hosts: 2 Problems

2002-03-26 Thread Ian Abbott

On 26 Mar 2002 at 19:01, Jens Rösner wrote:

 I am using wget to parse a local html file which has numerous links into
 the www.
 Now, I only want hosts that include certain strings like 
 -H -Daudi,vw,online.de

It's probably worth noting that the comparisons between the -D
strings and the domains being followed (or not) is anchored at
the ends of the strings, i.e. -Dfoo matches bar.foo but not
foo.bar.

 Two things I don't like in the way wget 1.8.1 works on windows:
 
 The first page of even the rejected hosts gets saved.

That sounds like a bug.

 This messes up my directory structure as I force directories 
 (which is my default and normally useful)
 
 I am aware that wget has switched to breadth first (as opposed to
 depth-first) 
 retrieval.
 Now, with downloading from many (20+) different servers, this is a bit
 frustrating, 
 as I will probably have the first completely downloaded site in a few
 days...

Would that be less of a problem if the first problem (first page
from rejected domains) was fixed?

 Is there any other way to work around this besides installing wget 1.6
 (or even 1.5?)

No, but note that if you pass several starting URLs to Wget, it
will complete the first before moving on to the second. That also
works for the URLs in the file specified by the --input-file
parameter. However, if all the sites are interlinked, you would be
no better off with this. The other alternative is to run wget
several times in sequence with different starting URLs and restrictions, perhaps using 
the --timestamping or --no-clobber
options to avoid downloading things more than once.



Re: ´äº¯ÀÔ´Ï´Ù.

2002-03-26 Thread ¿©·Ã
Title: Àý´ë ¼ºÀθ¸ Ŭ¸¯ÇÏ½Ã°í ¹Ì¼º³âÀÚ´Â Áö¿ì¼¼¿ä.



Àý´ë ¼ºÀθ¸ Ŭ¸¯ÇÏ½Ã°í ¹Ì¼º³âÀÚ´Â Áö¿ì¼¼¿ä.
»õ·Î¿î 
½ÅÁ¾ ¼ºÀε¿¿µ»ó À帣




Re: wget parsing JavaScript

2002-03-26 Thread Tony Lewis

I wrote:

  wget is parsing the attributes within the script tag, i.e., script
  src=url. It does not examine the content between script and
/script.

and Ian Abbott responded:

 I think it does, actually, but that is mostly harmless.

You're right. What I meant was that it does not examine the JavaScript
looking for URLs.

Tony