On Mon, Oct 28, 2013 at 6:06 AM, "Jörg Knappen" <[email protected]> wrote:
> Hi Steffen, > > data aren't that easy. There are non-latin1-characters encoded in the UTF8 > part. I expect > among others typographic apostrophes, polish characters, some mediaevalist > characters like > ũ (u with tilde). Maybe, there is also some greek inside, but I am not > sure about that. > > --Jörg Knappen > > *Gesendet:* Montag, 28. Oktober 2013 um 12:34 Uhr > *Von:* "Steffen \"Daode\" Nurpmeso" <[email protected]> > *An:* "Jörg Knappen" <[email protected]> > *Cc:* [email protected] > *Betreff:* Re: Do you know a tool to decode "UTF-8 twice" > "Jörg Knappen" <[email protected]> wrote: > | Is there a ready made tool that decodes "UTF-8 twice" while keeping > | UTF-8 proper in place? > > Isn't a shell script with a truly validating iconv(1) enough? > This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run > > ?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2 > > As in > > for i in utf8.1 utf8.2; do > if iconv -f utf8 -t latin1 < ${i} | > iconv -f utf8 -t utf8 >/dev/null 2>&1; then > echo ${i}: bummer, going home by one > iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1 > else > echo ${i}: valid UTF-8 > fi > done > > i'll end up as > > ?0[steffen@sherwood tmp]$ sh utf8dec.sh > utf8.1: valid UTF-8 > utf8.2: bummer, going home by one > ?0[steffen@sherwood tmp]$ > > Ciao, > > | --Jörg Knappen > > --steffen > Jörg: There's no ready-made tool, but it's easy to write in python. I'll provide you a well-tested function in a few minutes.

