Thanks again!
My updated sed pattern generator now looks like:
r = range(0xa0, 0x170)
file = open("fixu8.sed", "w")
for i in r:
pat1 = "s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
print >>file, pat1
try:
pat2 = "s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
except:
pat2 = pat1
if (pat1 != pat2):
print >>file, pat2
file = open("fixu8.sed", "w")
for i in r:
pat1 = "s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
print >>file, pat1
try:
pat2 = "s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
except:
pat2 = pat1
if (pat1 != pat2):
print >>file, pat2
doing both latin-1 and windows-1252 mangled double utf-8. This is probably enough for now, the rate of errors is low
enough for practical purposes (i.e., lower than the natural error rate introduced by typing errors)
--Jörg Knappen
Gesendet: Mittwoch, 30. Oktober 2013 um 15:34 Uhr
Von: "Frédéric Grosshans" <[email protected]>
An: [email protected]
Betreff: Re: Aw: Re: Re: Do you know a tool to decode "UTF-8 twice"
Von: "Frédéric Grosshans" <[email protected]>
An: [email protected]
Betreff: Re: Aw: Re: Re: Do you know a tool to decode "UTF-8 twice"
Le 29/10/2013 17:15, "Jörg Knappen" a écrit :
> After running this script, a few more things were there:
> Non-normalised accents and some really strange
> encodings I could not really explain but rather guess their meanings, like
> s/Ü/Ü/g
> s/É/É/g
> s/AÌ€/À/g
> s/aÌ€/à/g
> s/EÌ€/È/g
> s/eÌ€/è/g
> s/„/„/g
> s/“/“/g
> s/ß/ß/g
> s/’/’/g
> s/Ä/Æ/g
It was probably not utf8 read as latin 1 and reencoded in utf8, but
utf_8 encoding read as Windows 1252 (
http://en.wikipedia.org/wiki/Windows-1252 ) and reencoded as utf-8. Each
of the combination above contains a character absent in latin-1
(œ‰€žŸ™„), and some of them are only present in Windows-1252 (‰™„) and
not in Latin-15, the other possible mistake.
I'v e check that this is consistent with Ü É and ß but not with your Æ.
This double encoding would give Ä :
Ä=Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4
=Ä (and not Æ)
Frédéric
> After running this script, a few more things were there:
> Non-normalised accents and some really strange
> encodings I could not really explain but rather guess their meanings, like
> s/Ü/Ü/g
> s/É/É/g
> s/AÌ€/À/g
> s/aÌ€/à/g
> s/EÌ€/È/g
> s/eÌ€/è/g
> s/„/„/g
> s/“/“/g
> s/ß/ß/g
> s/’/’/g
> s/Ä/Æ/g
It was probably not utf8 read as latin 1 and reencoded in utf8, but
utf_8 encoding read as Windows 1252 (
http://en.wikipedia.org/wiki/Windows-1252 ) and reencoded as utf-8. Each
of the combination above contains a character absent in latin-1
(œ‰€žŸ™„), and some of them are only present in Windows-1252 (‰™„) and
not in Latin-15, the other possible mistake.
I'v e check that this is consistent with Ü É and ß but not with your Æ.
This double encoding would give Ä :
Ä=Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4
=Ä (and not Æ)
Frédéric

