RE: Charset Encoding

Allison, Timothy B. Thu, 30 Jul 2015 11:45:02 -0700

The AutoDetectReader (within TXTParser) runs the encoding detectors in order 
specified in 
tika-parsers...resources/META-INF/services/o.a.t.detect.EncodingDetector.


The AutoDetectReaders picks the first non-null response to detect.

The current order is:
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

I've had some luck in some situations flipping the order so that Icu4j is run 
before Mozilla's UniversalEncodingDetector.

If that doesn't work, <shudder/> you can create your own CP1256 detector that 
returns cp1256 all the time and then put that in the services file.

We had someone hit this issue a year or so ago with UTF-8 (where he know 
absolutely that the files were, no doubt about it, UTF-8).  

We've talked about having and "override" detector, but we haven't implemented 
that yet.



-----Original Message-----
From: Ben Gould [mailto:[email protected]] 
Sent: Thursday, July 30, 2015 2:34 PM
To: [email protected]
Subject: Charset Encoding

Hi all,

I'm working on dynamically parsing a large set of Farsi documents 
(mostly txt, pdf, doc and docx), and am having issues when I come across 
text files encoded in CP1256 (an old windows-arabic format).

I'm using the Tika facade to return a Reader implementation (wrapping 
the input in a TikaInputStream) and then tokenizing the Reader using a 
Lucene Analyzer.  However, whenever it hits CP1256 encoded text files, 
it tries to decode them as (Content-Type -> text/plain; 
charset=x-MacCyrillic).  In the input metadata, I do provide the 
following properties:

Content-Encoding: CP1256
Content-Type: text/plain; charset=CP1256
Content-Type-Hint: text/plain; charset=CP1256

Any ideas on how I can force the TXTParser to use CP1256?

Thanks,
-Ben

RE: Charset Encoding

Reply via email to