The AutoDetectReader (within TXTParser) runs the encoding detectors in order specified in tika-parsers...resources/META-INF/services/o.a.t.detect.EncodingDetector.
The AutoDetectReaders picks the first non-null response to detect. The current order is: org.apache.tika.parser.html.HtmlEncodingDetector org.apache.tika.parser.txt.UniversalEncodingDetector org.apache.tika.parser.txt.Icu4jEncodingDetector I've had some luck in some situations flipping the order so that Icu4j is run before Mozilla's UniversalEncodingDetector. If that doesn't work, <shudder/> you can create your own CP1256 detector that returns cp1256 all the time and then put that in the services file. We had someone hit this issue a year or so ago with UTF-8 (where he know absolutely that the files were, no doubt about it, UTF-8). We've talked about having and "override" detector, but we haven't implemented that yet. -----Original Message----- From: Ben Gould [mailto:[email protected]] Sent: Thursday, July 30, 2015 2:34 PM To: [email protected] Subject: Charset Encoding Hi all, I'm working on dynamically parsing a large set of Farsi documents (mostly txt, pdf, doc and docx), and am having issues when I come across text files encoded in CP1256 (an old windows-arabic format). I'm using the Tika facade to return a Reader implementation (wrapping the input in a TikaInputStream) and then tokenizing the Reader using a Lucene Analyzer. However, whenever it hits CP1256 encoded text files, it tries to decode them as (Content-Type -> text/plain; charset=x-MacCyrillic). In the input metadata, I do provide the following properties: Content-Encoding: CP1256 Content-Type: text/plain; charset=CP1256 Content-Type-Hint: text/plain; charset=CP1256 Any ideas on how I can force the TXTParser to use CP1256? Thanks, -Ben
