Hi, I think that wget should include a charset declaration in the html page if it don't exist.
The charset of a web page can be found in 2 ways : -In the http header (example : "Content-Type: text/html; charset=ISO-8859-1" ) -In the html header (example : <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> ) For browsing, it's enough to have the charser only in the http header. The browser is informed. But after download with wget, there is no longer charset if it wasn't in the html header. Example : $ wget -SEk http://www.la-croix.com/ --00:08:33-- http://www.la-croix.com/ => `index.html.2' Resolving www.la-croix.com... 160.92.103.70 Connecting to www.la-croix.com|160.92.103.70|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Thu, 31 Aug 2006 22:06:18 GMT Server: Apache Set-Cookie: JSESSIONID=41649A198F5523A8E970C25FDFB02A9E.C5067890C9167DD999; Path=/ Last-Modified: Thu, 31 Aug 2006 22:02:49 GMT Connection: close Content-Type: text/html; charset=ISO-8859-15 Length: unspecified [text/html] [ <=> ] 51,974 280.97K/s 00:08:34 (280.84 KB/s) - `index.html.2.4.html' saved [51974] Converting index.html.2.4.html... 3-246 Converted 1 files in 0.006 seconds. The charset of this page is ISO-8859-15, but this information is now lost because the file don't contain any information about it. If after I parse this file, the parser won't know the charset. If I submit now the file to the html walidator http://validator.w3.org it's printing : Result: Failed validation File: index.html.2.4.html Encoding: utf-8 Doctype: Sorry, I am unable to validate this document because on line 19, 182-183, 211, 215, 220, 225, 232, 236, 246, 286, 328, 403, 448, 455, 483, 519, 539, 547, 606, 643, 657-658, 660, 675, 679, 690, 701, 711, 720, 724, 732-733, 764 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. I think if a html header don't declare a charset, wget should include it.