On 02/20/2010 04:58 AM, David Haslam wrote:
Please first read this article.

http://en.wikipedia.org/wiki/Dotted_and_dotless_I
http://en.wikipedia.org/wiki/Dotted_and_dotless_I

Languages such as Turkish and Northern Azeri have both dotted and dotless
letter I in their Latin-based alphabets.

This has implications for letter case.
Such alphabets break the familiar case relationship between uppercase I and
lowercase i.
Instead they have as a upper and lower case pairs:

I and ı
İ and i
This and related issues have been discussed in recent issues for Java Lucene. See the following for discussions regarding Java Lucene and Turkish (the gossamer search is durable so will return any new/future conversations):
http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=turkish&search_type=AND
The Jira issue for Java Lucene that corrected this is:
http://issues.apache.org/jira/browse/LUCENE-2102

Another interesting Jira issue for Java Lucene that discusses using ICU for normalization:
http://issues.apache.org/jira/browse/LUCENE-1488

The upshot is that previously in Java Lucene Turkish was handled inappropriately. The basic problem is that the lower case filters were not locale sensitive. That is 'I' was always lower cased to 'i'.

This was not the only issue. Here are some more:
The success of the filter depends upon whether the character is composed or decomposed. If it were decomposed, the combining mark was handled separately.

It is important to control the locale of lowercase based upon the text not the user's location.

The placement of the filter in the analyzer is critical. (See LUCENE-1488 above for a discussion).
Questions:
Does sword_icu properly address this in terms of case folding?
I don't think that SWORD's icu library handles case folding, but rather transliterations.

How does each front-end application address these issues, e.g. in terms of
case-insensitive searches, etc?
If clucene is used for searches, it is simply wrong for these cases. SWORD uses the StandardAnalyzer for all texts. This analyzer uses the LowerCaseFilter, which is not sensitive to the user's or the text's locale.

As I said earlier, SWORD needs to have an analyzer picked by language. StandardAnalyzer is not appropriate for many if not most of the modules at CrossWire.

It should not be to hard for someone (i.e. someone else, not me) to back port Java Lucene Turkish analyzer to clucene, whether contributed to clucene or put into SWORD lib. I say back port because it is part of Java Lucene 3.0 which is significantly different that Java Lucene 2.9 and clucene is in the 2.x series.

In Him,
    DM

cf.  We already have two Turkish Bible modules, and work is about to start
on a Bible module for Northern Azeri.

Working on the Go Bible for the Azerbaijani translation is how I became
alerted to this issue.

David




_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to