On the sections 7.2 (pg. 115) ... of "tika in action", they talk in very general terms about that theme and mentioned that tika currently uses n-grams but may change the underlying algorithm in the future
Could you/committers/the autors share a little more about tika's language detection internals and/or your probable future decisions/plans? thanks lbrtchx
