[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

2018-11-11 Thread HansBrende
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/131 @lewismc I've simplified the code a lot so it should be a whole lot easier to see what's going on now. Also, I improved the UTF-8 detector by reverse engineering jchardet's methodology

[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

2018-11-08 Thread lewismc
Github user lewismc commented on the issue: https://github.com/apache/any23/pull/131 You've brought up an excellent topic for conversation. Tika currently has a batch, regression job which essentially enables them to run over loads of documents and analyze the output. The result

[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/131 @lewismc I've added some additional unit tests which test against the main issues we've been having with encoding detection. Unfortunately, the only real way to comprehensively test this

[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

2018-11-07 Thread lewismc
Github user lewismc commented on the issue: https://github.com/apache/any23/pull/131 There is a fair bit of code here but I am not really sure how to test it. Unfortunately I am going to say, please provide unit test. I've been aware of some encoding detection issues previously with

[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

2018-11-06 Thread HansBrende
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/131 @lewismc any thoughts about this? ---