[ 
https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler updated TIKA-369:
-----------------------------

    Description: 
Currently the LanguageProfile code uses 3-grams to find the best language 
profile using Pearson's chi-square test. This has three issues:

1. The results aren't very good for short runs of text. Ted Dunning's paper 
(attached) indicates that a log-likelihood ratio (LLR) test works much better, 
which would then make language detection faster due to less text needing to be 
processed.
2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
value as a threshold for certainty. This is very sensitive to the amount of 
text being processed, and thus gives false negative results for short runs of 
text.
3. Certainty should also be based on how much better the result is for language 
X, compared to the next best language. If two languages both had identical 
sum-of-squares values, and this value was below the threshold, then the result 
is still not very certain.



  was:
Currently the LanguageProfile code uses 3-grams to find the best language 
profile using Pearson's chi-square test. This has three issues:

1. The results aren't very good for short runs of text. Ted Dunning's paper 
(attached) indicates that a Lucas-Lehmer-Riesel (LLR) test works much better, 
which would then make language detection faster due to less text needing to be 
processed. It might be sufficient to re-enable support for 1..4-grams (similar 
to original Nutch code) to improve quality.
2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
value as a threshold for certainty. This is very sensitive to the amount of 
text being processed, and thus gives false negative results for short runs of 
text.
3. Certainty should also be based on how much better the result is for language 
X, compared to the next best language. If two languages both had identical 
sum-of-squares values, and this value was below the threshold, then the result 
is still not very certain.




> Improve accuracy of language detection
> --------------------------------------
>
>                 Key: TIKA-369
>                 URL: https://issues.apache.org/jira/browse/TIKA-369
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 0.6
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: lingdet-mccs.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language 
> profile using Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper 
> (attached) indicates that a log-likelihood ratio (LLR) test works much 
> better, which would then make language detection faster due to less text 
> needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact 
> value as a threshold for certainty. This is very sensitive to the amount of 
> text being processed, and thus gives false negative results for short runs of 
> text.
> 3. Certainty should also be based on how much better the result is for 
> language X, compared to the next best language. If two languages both had 
> identical sum-of-squares values, and this value was below the threshold, then 
> the result is still not very certain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to