Lang id performance degrades towards 100k characters?!

Tim Allison Tue, 14 May 2019 07:36:18 -0700

All,
  Joern Kottman is working with us on [1], but I thought I'd do the
proper community thing and raise this here as well.  On Apache Tika,
we're considering switching over to OpenNLP for language detection for
tika-eval.
  We know that dumping a 100k chunk of text into OpenNLP for language
detection is stupid (thanks to Joern), however, we found that if you
do that, the detector's performance degrades dramatically, with it
detecting "che" for many languages.
  Is this expected?  Is this a bug?
  Many, many thanks for all of your work and for making your slice of
the Leipzig corpus so readily accessible!


          Cheers,

                    Tim


[1] https://issues.apache.org/jira/browse/TIKA-2790

esp:
https://issues.apache.org/jira/browse/TIKA-2790?focusedCommentId=16839443&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16839443

and
https://issues.apache.org/jira/browse/TIKA-2790?focusedCommentId=16839413&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16839413

Lang id performance degrades towards 100k characters?!

Reply via email to