Feature suggestion: change detection for wget -c

2006-08-31 Thread John McCabe-Dansted

Wget has no way of verifying that the local file is
  really a valid prefix of the remote file

Couldn't wget redownload the last 4 bytes (or so) of the file?

For a few bytes per file we could detect changes to almost all
compressed files and the majority of uncompressed files.

--
John C. McCabe-Dansted
PhD Student
University of Western Australia


wget silently overwrites a file when using -c and the server does not support resuming

2006-08-31 Thread Ori Avtalion
Using 1.10.2

To reproduce:
1) Download a video from Google Video:
$ wget -O Test.resume_me.avi
http://vp05.video.l.google.com/videodownload?version=0secureurl=twAAAKKXmJe_gUGC30JVHiQCrmBhoU7JEoYkn1zkPRI9Vm4nYjXB_Lconoy-Fwa2rg40mCn-w3frP3K4KTW7vxmD2bubcJainv-i4vxBqUS_k2VtLtsJI04UFSYcVQVESuIqHZfGuToqj3r3HkfzbKYgoRSzAEI6xUl3-jQKsKAgpQzwoaRbExjhOU2kup9A0VxOlC_KdqG2QWMejRjLZZEfCDb4ETaWEBT0qIGq3W_GS6sKcx6dKXYGMuiGbd4Wf9v3Mgsigh=ongRDut1aAA_QP6pwGRnwIWO2k0begin=0len=1221999docid=9076288729387457440rdc=1;
2) Cancel the download after a few seconds.
3) Re-download, using the -c flag.

Result:
The old file will be silently overwritten.

Wget should refuse downloading the file.

The docs specifically state:
Beginning with Wget 1.7, if you use -c on a non-empty file, and it turns
out that the server does not support continued downloading, Wget will
refuse to start the download from scratch, which would effectively ruin
existing contents.  If you really want the download to start from
scratch, remove the file.


Re: wget silently overwrites a file when using -c and the server does not support resuming

2006-08-31 Thread Gerard Seibert
Ori Avtalion wrote:

 Using 1.10.2
 
 To reproduce:
 1) Download a video from Google Video:
 $ wget -O Test.resume_me.avi
 http://vp05.video.l.google.com/videodownload?version=0secureurl=twAAAKKXmJe_gUGC30JVHiQCrmBhoU7JEoYkn1zkPRI9Vm4nYjXB_Lconoy-Fwa2rg40mCn-w3frP3K4KTW7vxmD2bubcJainv-i4vxBqUS_k2VtLtsJI04UFSYcVQVESuIqHZfGuToqj3r3HkfzbKYgoRSzAEI6xUl3-jQKsKAgpQzwoaRbExjhOU2kup9A0VxOlC_KdqG2QWMejRjLZZEfCDb4ETaWEBT0qIGq3W_GS6sKcx6dKXYGMuiGbd4Wf9v3Mgsigh=ongRDut1aAA_QP6pwGRnwIWO2k0begin=0len=1221999docid=9076288729387457440rdc=1;
 2) Cancel the download after a few seconds.
 3) Re-download, using the -c flag.
 
 Result:
 The old file will be silently overwritten.
 
 Wget should refuse downloading the file.
 
 The docs specifically state:
 Beginning with Wget 1.7, if you use -c on a non-empty file, and it turns
 out that the server does not support continued downloading, Wget will
 refuse to start the download from scratch, which would effectively ruin
 existing contents.  If you really want the download to start from
 scratch, remove the file.

Did you actually confirm that a partially downloaded file existed? I
have canceled downloads and no trace of the partially downloaded file
was to be found.


-- 
Gerard Seibert
[EMAIL PROTECTED]



Feature request : save the charset of the pages

2006-08-31 Thread Pierre reniƩ

Hi,

I think that wget should include a charset declaration in the html
page if it don't exist.

The charset of a web page can be found in 2 ways :
-In the http header (example : Content-Type: text/html; charset=ISO-8859-1 )
-In the html header (example : meta http-equiv=Content-Type
content=text/html; charset=UTF-8 )
For browsing, it's enough to have the charser only in the http header.
The browser is informed. But after download with wget, there is no
longer charset if it wasn't in the html header.

Example :
$ wget -SEk http://www.la-croix.com/
--00:08:33--  http://www.la-croix.com/
  = `index.html.2'
Resolving www.la-croix.com... 160.92.103.70
Connecting to www.la-croix.com|160.92.103.70|:80... connected.
HTTP request sent, awaiting response...
 HTTP/1.1 200 OK
 Date: Thu, 31 Aug 2006 22:06:18 GMT
 Server: Apache
 Set-Cookie: JSESSIONID=41649A198F5523A8E970C25FDFB02A9E.C5067890C9167DD999;
Path=/
 Last-Modified: Thu, 31 Aug 2006 22:02:49 GMT
 Connection: close
 Content-Type: text/html; charset=ISO-8859-15
Length: unspecified [text/html]

   [ =
] 51,974   280.97K/s

00:08:34 (280.84 KB/s) - `index.html.2.4.html' saved [51974]

Converting index.html.2.4.html... 3-246
Converted 1 files in 0.006 seconds.


The charset of this page is ISO-8859-15, but this information is now
lost because the file don't contain any information about it. If after
I parse this file, the parser won't know the charset.
If I submit now the file to the html walidator http://validator.w3.org
it's printing :
Result:  Failed validation
File:   index.html.2.4.html
Encoding:   utf-8
Doctype:
Sorry, I am unable to validate this document because on line 19,
182-183, 211, 215, 220, 225, 232, 236, 246, 286, 328, 403, 448, 455,
483, 519, 539, 547, 606, 643, 657-658, 660, 675, 679, 690, 701, 711,
720, 724, 732-733, 764 it contained one or more bytes that I cannot
interpret as utf-8 (in other words, the bytes found are not valid
values in the specified Character Encoding). Please check both the
content of the file and the character encoding indication.


I think if a html header don't declare a charset, wget should include it.


Re: wget silently overwrites a file when using -c and the server does not support resuming

2006-08-31 Thread Steven M. Schweda
From: Ori Avtalion

 wget -O Test.resume_me.avi [...]
 [...]
 Result:
 The old file will be silently overwritten.
 [...]

   You're working too hard.  Using -O will overwrite the output file
no matter what happens, whether the download works or not.  That's what
-O does.  If you don't like it, don't use -O.

   If you look through the archive, you can find many other cases where
-O caused various effects which various users did not like.  It's a
characteristic of -O.

   If you can see the same problem when you don't specify -O, feel
free to re-complain.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547