Re: [sword-devel] module making problem - U_INVALID_CHAR_FOUND

DM Smith Wed, 13 Apr 2005 15:55:51 -0700

Thanks Chris for the clarification. I did not find anywhere on the website where it is mentioned that Latin-1 means cp1252. Silly me for assuming that it meant what the ISO board meant it to be and not what MS co-opted it for.

In one of the archived messages, it mentioned that the filters always go to UTF-8 first. Is this the case with imp2ld? I saw that it is using the module code to do the writing, but I did not dig further to see where it would be.

I just searched the archives and I see that as early as 2000 there is a desire to migrate all modules to UTF-8. After we release JSword 1.0, is this something that I can help with?


Chris Little wrote:

DM Smith wrote:
I am not entirely sure that it is a bug in ICU. I think it is a "feature".
I didn't say it was a bug, but an error. It is an error message being printed to cerr.

I'm unclear as to WHY ICU is printing an error message, since I can't think of when it would actually get to process data coming from an IMP file. But the export matches the import (for me), so data isn't being mangled. Hence I don't believe it's an issue at all. None of the importers do encoding conversions.

ICU does not recognize any valid characters in the reserved ranges of an encoding. (Not sure I am using proper terminology here.) For example ISO-8859-1 (aka Latin 1) identifies everything between 128 and 159 as undefined. However, this range is used by cp1250 (and other cp125x and cp1521), which are Microsofts variants on ISO8859. Many people mistakenly refer to cp1250 as Latin-1. It is not.

Many of the non UTF-8 modules contain non Latin-1 characters. When converted to UTF-8, it will fail. And when coming back to Latin-1, it will not be present.
Sword modules and .conf files come in exactly two different encodings: UTF-8 and Codepage 1252. If a module is encoded as UTF-8, it is noted in the encoding line of the .conf. If there is no encoding line, the module is Codepage 1252.

There are various places in the library where we may refer to Latin-1, but what is always meant is "Codepage 1252" (not "ISO-8859-1"). The same goes for discussion on the list. If we talk about Latin-1 in connection with Sword, we really mean Codepage 1252.

If we were to identify to the conversion routine what encoding was used, then it might work. I say might, because I ran across a few OSes that did not have the MS encodings on them. (e.g. IBM mainframe, Sun Solaris at least through 7, early versions of Linux [ but have not looked lately ]).
Modern Linux definitely carries CP1252. Many other vendors rename CP1252 to things like "ibm-1252" before using them on their systems. In this case, Sword knows how to convert CP1252 to UTF-8 and can also use ICU (which is also capable of CP1252 conversions). But, again, none of the importers are actually doing encoding conversions.

--Chris


_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] module making problem - U_INVALID_CHAR_FOUND

Reply via email to