any news on this?

hrvoje, you say that wget will presume utf-8, but then wget should have decoded %C3%AD to an accented i (í). but today wget simply decodes the characters one by one, creating a mess.

how can wget assume anything about encoding by the way? the filename could be encoded as anything, right? for instance, the filename "bl%E5%F8yd.zip" encoded in iso-8859-1 would suggest the filename "blåøyd.zip" (blue-eyed in norwegian), which in utf-8 it would mean some other character.

wouldn't the correct thing be NOT to decode escaped characters (at least over 127), because it could mean anything depending on page authors intention of assumed encoding.

anyway, is there maybe a separate mailaccount for bugs that would be
more appropriate to use than this list?

Olav Mørkrid wrote:
wget saves the accented "i" in the filaname as the 8-bit utf-8 characters C3 and AD (unescaped), which results in garble since windows file system is not utf-8 based.

so either some form of character conversion needs to take place (from utf-8 to filesystem), or wget should save the filename percent-escaped.

VÃ-­ctor_Jara (today)
V%C3%ADctor_Jara (escaped)
Víctor_Jara (converted)

Hrvoje Niksic wrote:

Olav Mørkrid <[EMAIL PROTECTED]> writes:


problem: international characters cause problems

  the image of victor jara in article is lost
  int. chars. in filename saved on local disk is garble



Wget saves exactly the characters it finds in the URL.  If the URL
contains the sequence (presumably UTF-8) %C3%AD, that is what Wget
will write to the file name.

What characters did you expect to find in local file names?






Reply via email to