Github user HansBrende commented on the issue:
https://github.com/apache/any23/pull/131
@lewismc I've simplified the code a lot so it should be a whole lot easier
to see what's going on now.
Also, I improved the UTF-8 detector by reverse engineering jchardet's
methodology
Github user lewismc commented on the issue:
https://github.com/apache/any23/pull/131
You've brought up an excellent topic for conversation. Tika currently has a
batch, regression job which essentially enables them to run over loads of
documents and analyze the output. The result
Github user HansBrende commented on the issue:
https://github.com/apache/any23/pull/131
@lewismc I've added some additional unit tests which test against the main
issues we've been having with encoding detection.
Unfortunately, the only real way to comprehensively test this
Github user lewismc commented on the issue:
https://github.com/apache/any23/pull/131
There is a fair bit of code here but I am not really sure how to test it.
Unfortunately I am going to say, please provide unit test. I've been aware of
some encoding detection issues previously with
Github user HansBrende commented on the issue:
https://github.com/apache/any23/pull/131
@lewismc any thoughts about this?
---