On Feb 24, 2008, at 4:46 PM, Chris Little wrote:



DM Smith wrote:
I have added a -n flag to osis2mod.

I'm going to add it to the other major importers (osis2gbs & imp2*) just
as soon as I get things into a fairly stable state.

This flag, to be enabled, requires osis2mod to be compiled with ICU
support enabled.

-n stands for normalized to NFC, the agreed upon UTF-8 encoding

When should this flag be used?
1) When the input is UTF-8
and
2) It is not known to be NFC

First, I feel like there's really no reason NOT to perform
normalization, provided that the input is UTF-8. Even if the input is
already in NFC, it won't hurt anything to do it again. It will take
extra time to compile the module, but I feel like it's better to be safe
than sorry in this case.


I mostly agree. But once I know that the module is NFC, I'd rather not take the hit. I must have made the KJV into a module 100 or more times before I got it right.




Second, your comment about needing UTF-8 input makes me think we should go ahead and add encoding conversion to the importers as well, possibly
with automatic charset detection.

I'd like to see OSIS modules also be UTF-8.

What mechanism were you thinking of for automatic charset detection? I have a buggy routine to detect whether something is UTF-8, 7-bit ascii or other. We could use that (once I fix it).

As to automatic charset detection, could we require that every input to osis2mod have:
<?xml version="1.0" encoding="UTF-8"?>
or
<?xml version="1.0" encoding="cp1252"?>
and use whatever is the value for the encoding attribute?


-- DM

_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to