All, Joern Kottman is working with us on [1], but I thought I'd do the proper community thing and raise this here as well. On Apache Tika, we're considering switching over to OpenNLP for language detection for tika-eval. We know that dumping a 100k chunk of text into OpenNLP for language detection is stupid (thanks to Joern), however, we found that if you do that, the detector's performance degrades dramatically, with it detecting "che" for many languages. Is this expected? Is this a bug? Many, many thanks for all of your work and for making your slice of the Leipzig corpus so readily accessible!
Cheers, Tim [1] https://issues.apache.org/jira/browse/TIKA-2790 esp: https://issues.apache.org/jira/browse/TIKA-2790?focusedCommentId=16839443&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16839443 and https://issues.apache.org/jira/browse/TIKA-2790?focusedCommentId=16839413&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16839413