Ahhhh, switching to use the *Icu4jEncodingDetector* over the *Universal *worked perfectly. You're awesome!

-Ben

On 07/30/2015 02:43 PM, Allison, Timothy B. wrote:
The AutoDetectReader (within TXTParser) runs the encoding detectors in order 
specified in 
tika-parsers...resources/META-INF/services/o.a.t.detect.EncodingDetector.

The AutoDetectReaders picks the first non-null response to detect.

The current order is:
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

I've had some luck in some situations flipping the order so that Icu4j is run 
before Mozilla's UniversalEncodingDetector.

If that doesn't work, <shudder/> you can create your own CP1256 detector that 
returns cp1256 all the time and then put that in the services file.

We had someone hit this issue a year or so ago with UTF-8 (where he know 
absolutely that the files were, no doubt about it, UTF-8).

We've talked about having and "override" detector, but we haven't implemented 
that yet.



-----Original Message-----
From: Ben Gould [mailto:[email protected]]
Sent: Thursday, July 30, 2015 2:34 PM
To: [email protected]
Subject: Charset Encoding

Hi all,

I'm working on dynamically parsing a large set of Farsi documents
(mostly txt, pdf, doc and docx), and am having issues when I come across
text files encoded in CP1256 (an old windows-arabic format).

I'm using the Tika facade to return a Reader implementation (wrapping
the input in a TikaInputStream) and then tokenizing the Reader using a
Lucene Analyzer.  However, whenever it hits CP1256 encoded text files,
it tries to decode them as (Content-Type -> text/plain;
charset=x-MacCyrillic).  In the input metadata, I do provide the
following properties:

Content-Encoding: CP1256
Content-Type: text/plain; charset=CP1256
Content-Type-Hint: text/plain; charset=CP1256

Any ideas on how I can force the TXTParser to use CP1256?

Thanks,
-Ben

--
Ben Gould
iNovex Information Systems, Inc
7240 Parkway Drive, Suite 140
Hanover, MD 21076
(410)292-1332
http://inovexcorp.com

Reply via email to