[ 
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804288#action_12804288
 ] 

Ken Krugler edited comment on TIKA-369 at 1/24/10 7:39 PM:
-----------------------------------------------------------

Karl Wettin had contributed a language detector to Lucene, though it was never 
rolled in. See [/LUCENE-826]. This might be an interesting alternative.

Jean-François Halleux also contributed a "language guesser" to Lucene a while 
back. See [LUCENE-180]. This was markd as duplication of [LUCENE-826].

      was (Author: kkrugler):
    Karl Wettin had contributed a language detector to Lucene, though it was 
never rolled in. See [https://issues.apache.org/jira/browse/LUCENE-826]. This 
might be an interesting alternative.

Jean-François Halleux also contributed a "language guesser" to Lucene a while 
back. See [https://issues.apache.org/jira/browse/LUCENE-180]. This was markd as 
duplication of [LUCENE-826].
  
> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: dunning94-trimmed.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language 
> profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper 
> (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, 
> which would then make language detection faster due to less text needing to 
> be processed. It might be sufficient to re-enable support for 1..4-grams 
> (similar to original Nutch code) to improve quality.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
> value as a threshold for certainty. This is very sensitive to the amount of 
> text being processed, and thus gives false negative results for short runs of 
> text.
> 3. Certainty should also be based on how much better the result is for 
> language X, compared to the next best language. If two languages both had 
> identical sum-of-squares values, and this value was below the threshold, then 
> the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to