[jira] Created: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Robert Muir (JIRA) Sat, 18 Sep 2010 09:19:13 -0700

ThaiAnalyzer assumes things about your jre
------------------------------------------


                 Key: LUCENE-2653
                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
            Reporter: Robert Muir


The ThaiAnalyzer/ThaiWordFilter depends on the fact that 
BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based 
break iterator that can segment thai phrases into words (it does not use 
whitespace).

But this is non-standard that the JRE will specialize this locale in this way, 
its nice, but you can't depend on it.
For example, if you are running on IBM JRE, this analyzer/wordfilter is 
completely "broken" in the sense it won't do what it claims to do.

At the minimum, we need to document this and suggest users look at ICUTokenizer 
for thai, which always has this breakiterator and is not jre-dependent.

Better, would be to check statically that the thing actually works.
when creating a new ThaiWordFilter we could clone() the BreakIterator, which is 
often cheaper than making a new one anyway.
we could throw an exception, if its not supported, and add a boolean so the 
user knows it works.
and we could refer to this boolean with Assert.assume in its tests.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Reply via email to