In the tika-app.jar, go to WEB-INF/services; there's a file that specifies the 
order of the application of the encoding detectors 
(org.apache.tika.detect.EncodingDetector).  The AutoDetectReader applies these 
in order and stops as soon as one of the detectors thinks that it detects an 
encoding.

If you flip the order so that icu4j is first (as below), you should be set.

org.apache.tika.parser.txt.Icu4jEncodingDetector
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector

You could also create your own dummy EncodingDetector (always returns "UTF-8") 
and register it in the service file.

From: Dave French [mailto:[email protected]]
Sent: Thursday, June 20, 2013 11:33 AM
To: [email protected]
Subject: Html Parser autodetect charset

Hey,

In my use case of tika, I am rendering a webpage, taking the contents of the 
page and feeding this into tika.  The contents of the webpage are encoded in 
UTF-8 when I feed it into tika, but the HtmlParser is using the 
AutoDetectReader to try and determine the charset.  This means tika is using 
the meta-data tag of the page to determine the charset.

Is there a way to not use this AutoDetectReader and just specify the charset?  
Or better yet, inject the Detector that will be used?

Thanks for your help,
Dave


Reply via email to