On 02/20/2010 04:58 AM, David Haslam wrote:
Please first read this article.
http://en.wikipedia.org/wiki/Dotted_and_dotless_I
http://en.wikipedia.org/wiki/Dotted_and_dotless_I
Languages such as Turkish and Northern Azeri have both dotted and dotless
letter I in their Latin-based alphabets.
This has implications for letter case.
Such alphabets break the familiar case relationship between uppercase I and
lowercase i.
Instead they have as a upper and lower case pairs:
I and ı
İ and i
This and related issues have been discussed in recent issues for Java
Lucene. See the following for discussions regarding Java Lucene and
Turkish (the gossamer search is durable so will return any new/future
conversations):
http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=turkish&search_type=AND
The Jira issue for Java Lucene that corrected this is:
http://issues.apache.org/jira/browse/LUCENE-2102
Another interesting Jira issue for Java Lucene that discusses using ICU
for normalization:
http://issues.apache.org/jira/browse/LUCENE-1488
The upshot is that previously in Java Lucene Turkish was handled
inappropriately. The basic problem is that the lower case filters were
not locale sensitive. That is 'I' was always lower cased to 'i'.
This was not the only issue. Here are some more:
The success of the filter depends upon whether the character is composed
or decomposed. If it were decomposed, the combining mark was handled
separately.
It is important to control the locale of lowercase based upon the text
not the user's location.
The placement of the filter in the analyzer is critical. (See
LUCENE-1488 above for a discussion).
Questions:
Does sword_icu properly address this in terms of case folding?
I don't think that SWORD's icu library handles case folding, but rather
transliterations.
How does each front-end application address these issues, e.g. in terms of
case-insensitive searches, etc?
If clucene is used for searches, it is simply wrong for these cases.
SWORD uses the StandardAnalyzer for all texts. This analyzer uses the
LowerCaseFilter, which is not sensitive to the user's or the text's locale.
As I said earlier, SWORD needs to have an analyzer picked by language.
StandardAnalyzer is not appropriate for many if not most of the modules
at CrossWire.
It should not be to hard for someone (i.e. someone else, not me) to back
port Java Lucene Turkish analyzer to clucene, whether contributed to
clucene or put into SWORD lib. I say back port because it is part of
Java Lucene 3.0 which is significantly different that Java Lucene 2.9
and clucene is in the 2.x series.
In Him,
DM
cf. We already have two Turkish Bible modules, and work is about to start
on a Bible module for Northern Azeri.
Working on the Go Bible for the Azerbaijani translation is how I became
alerted to this issue.
David
_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page