In the tika-app.jar, go to WEB-INF/services; there's a file that specifies the order of the application of the encoding detectors (org.apache.tika.detect.EncodingDetector). The AutoDetectReader applies these in order and stops as soon as one of the detectors thinks that it detects an encoding.
If you flip the order so that icu4j is first (as below), you should be set. org.apache.tika.parser.txt.Icu4jEncodingDetector org.apache.tika.parser.html.HtmlEncodingDetector org.apache.tika.parser.txt.UniversalEncodingDetector You could also create your own dummy EncodingDetector (always returns "UTF-8") and register it in the service file. From: Dave French [mailto:[email protected]] Sent: Thursday, June 20, 2013 11:33 AM To: [email protected] Subject: Html Parser autodetect charset Hey, In my use case of tika, I am rendering a webpage, taking the contents of the page and feeding this into tika. The contents of the webpage are encoded in UTF-8 when I feed it into tika, but the HtmlParser is using the AutoDetectReader to try and determine the charset. This means tika is using the meta-data tag of the page to determine the charset. Is there a way to not use this AutoDetectReader and just specify the charset? Or better yet, inject the Detector that will be used? Thanks for your help, Dave
