On Thu, Nov 12, 2009 at 2:36 AM, DM Smith <[email protected]> wrote: > > On Nov 11, 2009, at 9:59 AM, Karl Kleinpaste wrote: > >> DM Smith <[email protected]> writes: >>> U+00E5 is the unicode code point, not the encoding. In hex the utf-8 >>> encoding would be C3 A5. In ISO-8859-1, it would be E5. >> >> XEmacs tells me that the buffer is UTF-8. Manually re-asserting it... >> >> M-x set-buffer-file-coding-system RET utf-8 RET >> >> ...and re-saving the file makes no change to the content, yet that's >> exactly the mechanism I've used in the past to convert ISO-8859 to UTF-8. >> >>> So I'd suggest looking at a hex dump to see what the encoding is. >> >> BTDT. "od -c" of this... >> >> # correct: Norwegian Bokmål >> #nb Norsk Bokmål >> # a hack while g_utf8_validate() dislikes 'å': Norwegian Bokmaal >> nb Norsk Bokmaal >> >> ...produces this... >> >> 0007300 o e r o \n # c o r r e c t : >> 0007320 N o r w e g i a n B o k m 303 245 >> 0007340 l \n # n b \t N o r s k B o k m >> 0007360 303 245 l \n # a h a c k w h i >> 0007400 l e g _ u t f 8 _ v a l i d a >> 0007420 t e ( ) d i s l i k e s ' 303 >> 0007440 245 ' : N o r w e g i a n B o >> 0007460 k m a a l \n n b \t N o r s k B >> >> For a-ring, the character map application observes... >> C octal escaped UTF-8: \303\245 >> ...so I'm pretty well convinced that the content is right. > > You've convinced me. I'm curious as to whether this is a reported GTK bug? > > I'm also curious as to whether it handles the decomposed form. The following > is \141\314\212: > Bokmål
Surely we have had this reported as a problem or causing an issue with the following: * JSword (Java UTF-8 processing?) * BPBible (Python UTF-8 processing) * Xiphos (glib) While it's not impossible that these are all wrong it seems a little improbable. Also, reading it in as cp1252 and writing it out as UTF-8 in vim did change the file (or at least, it did with the version of that file we had with BPBible - I'm afraid I can't test the version in Sword right now). I still incline to the view that it's more likely the encoding is wrong than several (possibly) independent implementations of UTF-8. Jon _______________________________________________ sword-devel mailing list: [email protected] http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
