RE: CharsetDetector vs EncodingDetector

Allison, Timothy B. Fri, 22 Sep 2017 09:12:06 -0700

CharsetDetector is our copy/paste of ICU4j’s encoding detector.  We wrap it as 
an EncodingDetector in our org.apache.tika.parser.txt.Icu4jEncodingDetector.


The AutoDetectReader loads 3 EncodingDetectors specified in the 
tika/parsers/resources/META-INF/services/o.a.t.detect.EncodingDetector service 
file:

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

It runs through them in order, and whichever one has a non-null value first is 
the value that is returned.

We did do a fresh copy/paste of ICU4j before 1.16 IIRC.

You can modify the order of the encoding detectors or even which ones are used 
via tika-config.xml. See e.g.: 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2273-no-icu4j-encoding-detector.xml

In short you can experiment with each of the 3 to figure out which one works 
best and then determine the best order in which to apply them. 😊

If you have time and the interest, I’d run each of the 3 (or just 2 if you know 
you don’t have html) and then use tika-eval [1] to see which gives you higher 
“common words” scores (where “common words” is the count of words in your 
extracts that are in the top 20000 common words extracted from Wikipedia for 
the detected language).  You have the rare opportunity to be the 2nd person in 
the world to use tika-eval.

Oh, and once you’ve done that, you can chip in on TIKA-2038.

Cheers,

         Tim

[1] https://wiki.apache.org/tika/TikaEval

From: Brian Young [mailto:[email protected]]
Sent: Friday, September 22, 2017 10:28 AM
To: [email protected]
Subject: CharsetDetector vs EncodingDetector

Hello,

We had code that was using CharsetDetector and after upgrading to 1.16 it is 
now returning different answers than it did in older versions.  After digging 
in a bit I noticed that AutoDetectReader uses EncodingDetector, which seems to 
mirror my primary use case so I am switching to that.   So I can surmise that I 
was likely using the wrong class/wrong approach before.

What we are doing now is creating a throw away AutoDetectReader and grabbing 
the detected charset from it.

However that leaves me wondering, how/where is CharsetDetector used?  I've been 
studying CharsetDetector and EncodingDetector and trying to find some 
information on when I would use one vs. the other and it isn't clear to me yet.

Thank you,
Brian

RE: CharsetDetector vs EncodingDetector

Reply via email to