Aw: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

Jörg Knappen Wed, 30 Oct 2013 08:24:38 -0700

Thanks again!

My updated sed pattern generator now looks like:

r = range(0xa0, 0x170)
file = open("fixu8.sed", "w")
for i in r:
pat1 = "s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
print >>file, pat1
try:
    pat2 = "s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
except:
    pat2 = pat1
if (pat1 != pat2):
    print >>file, pat2

doing both latin-1 and windows-1252 mangled double utf-8. This is probably enough for now, the rate of errors is low

enough for practical purposes (i.e., lower than the natural error rate introduced by typing errors)

--Jörg Knappen

Gesendet: Mittwoch, 30. Oktober 2013 um 15:34 Uhr
Von: "Frédéric Grosshans" <[email protected]>
An: [email protected]
Betreff: Re: Aw: Re: Re: Do you know a tool to decode "UTF-8 twice"

Le 29/10/2013 17:15, "Jörg Knappen" a écrit :
> After running this script, a few more things were there:
> Non-normalised accents and some really strange
> encodings I could not really explain but rather guess their meanings, like
> s/Ãœ/Ü/g
> s/Ã‰/É/g
> s/AÌ€/À/g
> s/aÌ€/à/g
> s/EÌ€/È/g
> s/eÌ€/è/g
> s/â€ž/„/g
> s/â€œ/“/g
> s/ÃŸ/ß/g
> s/â€™/’/g
> s/Ã„/Æ/g

It was probably not utf8 read as latin 1 and reencoded in utf8, but
utf_8 encoding read as Windows 1252 (
http://en.wikipedia.org/wiki/Windows-1252 ) and reencoded as utf-8. Each
of the combination above contains a character absent in latin-1
(œ‰€žŸ™„), and some of them are only present in Windows-1252 (‰™„) and
not in Latin-15, the other possible mistake.

I'v e check that this is consistent with Ü É and ß but not with your Æ.
This double encoding would give Ä :
Ã„=Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4
=Ä (and not Æ)

Frédéric

Aw: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

Reply via email to