CharsetDetector is our copy/paste of ICU4j’s encoding detector. We wrap it as an EncodingDetector in our org.apache.tika.parser.txt.Icu4jEncodingDetector.
The AutoDetectReader loads 3 EncodingDetectors specified in the tika/parsers/resources/META-INF/services/o.a.t.detect.EncodingDetector service file: org.apache.tika.parser.html.HtmlEncodingDetector org.apache.tika.parser.txt.UniversalEncodingDetector org.apache.tika.parser.txt.Icu4jEncodingDetector It runs through them in order, and whichever one has a non-null value first is the value that is returned. We did do a fresh copy/paste of ICU4j before 1.16 IIRC. You can modify the order of the encoding detectors or even which ones are used via tika-config.xml. See e.g.: https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-no-icu4j-encoding-detector.xml In short you can experiment with each of the 3 to figure out which one works best and then determine the best order in which to apply them. 😊 If you have time and the interest, I’d run each of the 3 (or just 2 if you know you don’t have html) and then use tika-eval [1] to see which gives you higher “common words” scores (where “common words” is the count of words in your extracts that are in the top 20000 common words extracted from Wikipedia for the detected language). You have the rare opportunity to be the 2nd person in the world to use tika-eval. Oh, and once you’ve done that, you can chip in on TIKA-2038. Cheers, Tim [1] https://wiki.apache.org/tika/TikaEval From: Brian Young [mailto:[email protected]] Sent: Friday, September 22, 2017 10:28 AM To: [email protected] Subject: CharsetDetector vs EncodingDetector Hello, We had code that was using CharsetDetector and after upgrading to 1.16 it is now returning different answers than it did in older versions. After digging in a bit I noticed that AutoDetectReader uses EncodingDetector, which seems to mirror my primary use case so I am switching to that. So I can surmise that I was likely using the wrong class/wrong approach before. What we are doing now is creating a throw away AutoDetectReader and grabbing the detected charset from it. However that leaves me wondering, how/where is CharsetDetector used? I've been studying CharsetDetector and EncodingDetector and trying to find some information on when I would use one vs. the other and it isn't clear to me yet. Thank you, Brian
