Re: [sword-devel] Unicode supprt in lexicons

DM Smith Thu, 23 Jun 2005 05:04:46 -0700

My experience is from perl and java, but it may have bearing.
Collation is language dependent. English, French and German collate their accented characters differently than each other. In Spanish "ch" is sorted at the beginning of the "c" (though this may be changing).
In Java collation uses the provided locale and failing that the program's default locale, which unless set is the user's locale.
I found that the same logic was needed to do a binary search. So if ICU is needed for sorting, then ICU will be needed for a bin search.

On a project I was on we had two fundamental requirements for a list of 40K+ international publication titles:
1) For each supported locale, present the lists and sublists of publications in the order that is appropriate for that locale.
2) Provide efficient searching.

To accomplish this we first had to normalize the name of each publication. This requires knowing the language of the title of the publication so that that languages stop words could be used (Het Dagblad, and The Podunk Times needed to sort under Dagblad and Podunk Times, respectively, because Het and The are stop words in their languages.) We had decided that while an English speaker might look for Het Dagblad under the "H" that the publication's locale was more important. We had tried a universal list of stop words as the union of every language's stop words, but that did not work LA could be Spanish or it could be an abbreviation for Los Angeles, Die in English and German are very different.
We 0 padded numbers, removed stop words, single cased everything, removed some punctuation, and removed redundant spacing. There were other normalizations, but these are the obvious ones we can all think of.
We then created a text table with the normalized title, the original title, the other columns were numeric sort keys for each supported language.
(This could have been done with parallel tables)
This table was sorted on the normalized title but using a 8-bit ascii collation.

To do a search for an exact match, the user's input was normalized with the exact same rules and then did a binary search.
When the user wanted to do a free text search, we used something like Lucene to index the titles. With each title was the normalized form.
To sort a list of titles in the fashion that the user wants to see, we used the appropriate column from the table (using the default column, if the user's locale was not supported.)

We ultimately used Java to do the collation because Perl's UTF-8 support was not quite there (5.6 was the latest version at the time) and we found that we needed ICU for some of the more specialized rules that I did not present here. And ICU was not supported for perl at the time. I don't know where perl stands now.

BTW, this is something that I could throw together in Java, if it is ok to have some Sword tools in something other than C++.

Daniel Glassey wrote:

fwiw here's my opinion on what the standards should be. I definitely
agree that there should be standards.


On 22/06/05, Joachim Ansorg <[EMAIL PROTECTED]> wrote:

Hi,
I'm struggling with the unicode stuff of lexicons and lexicons in general.

Currently a frontend doesn't know whether to expect keys as utf8 or as
something else. because there's no standard defined. The same is valid of
GenBooks.


It seems reasonable to me that all text, keys, everything in all types
of modules should be in UTF-8.

Secondly, the sort oder is not valid for unicode if unicode characters are
used in the entry names.
That way unicode strings like the german "a umlaut" appear in the end, but
they should be among the firtst entries of the list. Sorting in the frontend
moves the lexicon intro somewhere into the middle of the list and is
slow(er).


Unicode defines collation(sorting). 
http://www.unicode.org/reports/tr10/

The entries should be sorted using something that implements the
algorithm by the module creation app. ICU should do the job and
doesn't have to be linked into the runtime lib to be able to do this.
It only needs to be linked into the module creation app. The way it
collates is language specific so it should get German right.

I think perl and python should also be able to do collation so they
are another option.

Thirdly, the lexicon intro is a hack, it uses a lot of prepended spaces to be
in the first place of the list.
We need to find a better solution for that.


Agreed (sorry, I don't have one offhand)

I'm missing defined standards for the API and the modules. That would make
frontend development a lot easier.


Agreed,
Daniel

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Unicode supprt in lexicons

Reply via email to