Re: Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

Frédéric Grosshans Wed, 30 Oct 2013 10:03:27 -0700

Le 30/10/2013 17:32, "Jörg Knappen" a écrit :

The data did not only contain latin-1 type mangling for thenon-existent Windows characters, but also sequences with the raw
C1 control characters for all of latin-1. So I had to do them, too.
The data weren't consistent at all, not even in their errors.
--Jörg Knappen

Your question helped me dust off and repair a non working python snippetI wrote for a similar problem. I was stuck with the mixing ofwindows-1252 and latin1 controls (linked with a chinese characters). Iwrite it below for reference.

The python snippet below does not need sed, defines a function(unscramble(S)) which works on strings. The extension to files should beeasy.


    Frédéric Grosshans


def Step1Filter(S):
    for c in S :
    #works character/character because of the cp1252/latin1 ambiguity
        try :
            yield c.encode('cp1252')
        except UnicodeEncodeError :
            yield c.encode('latin1')
            #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)

def unscramble(S):
    return b''.join(c for c in Step1Filter(S)).decode('utf8')

PS: If anyone is interested in a licence, I consider this simple enoughto be in the public domain an uncopyrightable.

Re: Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

Reply via email to