[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804288#action_12804288 ]
Ken Krugler edited comment on TIKA-369 at 1/24/10 7:39 PM: ----------------------------------------------------------- Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See [/LUCENE-826]. This might be an interesting alternative. Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See [LUCENE-180]. This was markd as duplication of [LUCENE-826]. was (Author: kkrugler): Karl Wettin had contributed a language detector to Lucene, though it was never rolled in. See [https://issues.apache.org/jira/browse/LUCENE-826]. This might be an interesting alternative. Jean-François Halleux also contributed a "language guesser" to Lucene a while back. See [https://issues.apache.org/jira/browse/LUCENE-180]. This was markd as duplication of [LUCENE-826]. > Improve accuracy of language detection > -------------------------------------- > > Key: TIKA-369 > URL: https://issues.apache.org/jira/browse/TIKA-369 > Project: Tika > Issue Type: Improvement > Components: languageidentifier > Affects Versions: 0.6 > Reporter: Ken Krugler > Assignee: Ken Krugler > Attachments: dunning94-trimmed.pdf > > > Currently the LanguageProfile code uses 3-grams to find the best language > profile using Pearson's chi-square test. This has three issues: > 1. The results aren't very good for short runs of text. Ted Dunning's paper > (attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, > which would then make language detection faster due to less text needing to > be processed. It might be sufficient to re-enable support for 1..4-grams > (similar to original Nutch code) to improve quality. > 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact > value as a threshold for certainty. This is very sensitive to the amount of > text being processed, and thus gives false negative results for short runs of > text. > 3. Certainty should also be based on how much better the result is for > language X, compared to the next best language. If two languages both had > identical sum-of-squares values, and this value was below the threshold, then > the result is still not very certain. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.