Hi,

I think that wget should include a charset declaration in the html
page if it don't exist.

The charset of a web page can be found in 2 ways :
-In the http header (example : "Content-Type: text/html; charset=ISO-8859-1" )
-In the html header (example : <meta http-equiv="Content-Type"
content="text/html; charset=UTF-8"> )
For browsing, it's enough to have the charser only in the http header.
The browser is informed. But after download with wget, there is no
longer charset if it wasn't in the html header.

Example :
$ wget -SEk http://www.la-croix.com/
--00:08:33--  http://www.la-croix.com/
          => `index.html.2'
Resolving www.la-croix.com... 160.92.103.70
Connecting to www.la-croix.com|160.92.103.70|:80... connected.
HTTP request sent, awaiting response...
 HTTP/1.1 200 OK
 Date: Thu, 31 Aug 2006 22:06:18 GMT
 Server: Apache
 Set-Cookie: JSESSIONID=41649A198F5523A8E970C25FDFB02A9E.C5067890C9167DD999;
Path=/
 Last-Modified: Thu, 31 Aug 2006 22:02:49 GMT
 Connection: close
 Content-Type: text/html; charset=ISO-8859-15
Length: unspecified [text/html]

   [ <=>
        ] 51,974       280.97K/s

00:08:34 (280.84 KB/s) - `index.html.2.4.html' saved [51974]

Converting index.html.2.4.html... 3-246
Converted 1 files in 0.006 seconds.


The charset of this page is ISO-8859-15, but this information is now
lost because the file don't contain any information about it. If after
I parse this file, the parser won't know the charset.
If I submit now the file to the html walidator http://validator.w3.org
it's printing :
Result:          Failed validation
File:   index.html.2.4.html
Encoding:       utf-8
Doctype:        
Sorry, I am unable to validate this document because on line 19,
182-183, 211, 215, 220, 225, 232, 236, 246, 286, 328, 403, 448, 455,
483, 519, 539, 547, 606, 643, 657-658, 660, 675, 679, 690, 701, 711,
720, 724, 732-733, 764 it contained one or more bytes that I cannot
interpret as utf-8 (in other words, the bytes found are not valid
values in the specified Character Encoding). Please check both the
content of the file and the character encoding indication.


I think if a html header don't declare a charset, wget should include it.

Reply via email to