We have a few modules that have entities in them. These are of the fashion 
  (a character entity), U (a numeric decimal entity) and Å (a 
numeric hex entity).

These cause various problems:
a) If a module is encoded in Latin-1, there may be entities that do not fall 
within that encoding. In a HTML viewer, which does substitutions, the resultant 
text may have mixed latin-1 and UTF-8, causing display problems.

b) If a module is searched, then these will cause search problems. For example, 
if one is searching for BokmÃ¥l and the text is encoded with Bokmål, it 
won't be found. When indexing with clucene, it will be broken into three words 
Bokm, aring and l. Searching for aring will find it as a word.

c) Transliteration won't work on words with entities.

d) Removing decorations (umlauts, rings, accents, ....) on words won't work.

e) It is legal to have numeric entities for &, <, >, " and ', but SWORD has no 
recognition of these.

And so forth.

When we create a module, we should make sure to replace entities with their 
UTF-8 equivalent. (of course making sure that the text is UTF-8 first).

To that end, I have written a Perl utility, EntityReplacer, that will normalize 
the entities for <, >, &, " and ', and replace most other entities (about 2700) 
with their UTF-8 equivalents.

You can get the code here:
www.crosswire.org/~dmsmith/perl

Like Chris' perl code, I have put it under the BSD license and copyrighted to 
CrossWire.

It is packaged for CPAN, so you can install it in the usual way:
perl Makefile.PL
make
make test
make install

Or you can grab the EntityReplacer.pm and put it in the same folder that you 
have a program and call it in the following fashion:
#!/usr/bin/perl -w
use strict;

use FindBin qw($Bin);
use lib "$Bin";

use EntityReplacer;

binmode(STDOUT, ":utf8");

# Read the input, one line at a time, replacing on each line all entities, 
except ones for <, >, &, ' and ".
while (<>) {
        s/(\&#?[a-zA-Z0-9-]+;)/EntityReplacer::toReplacement($1)/geo;
        print STDOUT;
}

Hope you find it useful.

In His Service,
        DM
_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to