Re: [sword-devel] Dotted and dotless I in Turkish & Azerbaijani, etc.

DM Smith Sat, 20 Feb 2010 07:10:12 -0800

On 02/20/2010 04:58 AM, David Haslam wrote:

Please first read this article.


http://en.wikipedia.org/wiki/Dotted_and_dotless_I
http://en.wikipedia.org/wiki/Dotted_and_dotless_I

Languages such as Turkish and Northern Azeri have both dotted and dotless
letter I in their Latin-based alphabets.

This has implications for letter case.
Such alphabets break the familiar case relationship between uppercase I and
lowercase i.
Instead they have as a upper and lower case pairs:

I and ı
İ and i

This and related issues have been discussed in recent issues for JavaLucene. See the following for discussions regarding Java Lucene andTurkish (the gossamer search is durable so will return any new/futureconversations):

http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=turkish&search_type=AND
The Jira issue for Java Lucene that corrected this is:
http://issues.apache.org/jira/browse/LUCENE-2102

Another interesting Jira issue for Java Lucene that discusses using ICUfor normalization:

http://issues.apache.org/jira/browse/LUCENE-1488

The upshot is that previously in Java Lucene Turkish was handledinappropriately. The basic problem is that the lower case filters werenot locale sensitive. That is 'I' was always lower cased to 'i'.


This was not the only issue. Here are some more:

The success of the filter depends upon whether the character is composedor decomposed. If it were decomposed, the combining mark was handledseparately.

It is important to control the locale of lowercase based upon the textnot the user's location.

The placement of the filter in the analyzer is critical. (SeeLUCENE-1488 above for a discussion).

Questions:
Does sword_icu properly address this in terms of case folding?

I don't think that SWORD's icu library handles case folding, but rathertransliterations.

How does each front-end application address these issues, e.g. in terms of
case-insensitive searches, etc?

If clucene is used for searches, it is simply wrong for these cases.SWORD uses the StandardAnalyzer for all texts. This analyzer uses theLowerCaseFilter, which is not sensitive to the user's or the text's locale.

As I said earlier, SWORD needs to have an analyzer picked by language.StandardAnalyzer is not appropriate for many if not most of the modulesat CrossWire.

It should not be to hard for someone (i.e. someone else, not me) to backport Java Lucene Turkish analyzer to clucene, whether contributed toclucene or put into SWORD lib. I say back port because it is part ofJava Lucene 3.0 which is significantly different that Java Lucene 2.9and clucene is in the 2.x series.


In Him,
    DM

cf.  We already have two Turkish Bible modules, and work is about to start
on a Bible module for Northern Azeri.

Working on the Go Bible for the Azerbaijani translation is how I became
alerted to this issue.

David



_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Dotted and dotless I in Turkish & Azerbaijani, etc.

Reply via email to