Ahhhh, switching to use the *Icu4jEncodingDetector* over the *Universal
*worked perfectly. You're awesome!
-Ben
On 07/30/2015 02:43 PM, Allison, Timothy B. wrote:
The AutoDetectReader (within TXTParser) runs the encoding detectors in order
specified in
tika-parsers...resources/META-INF/services/o.a.t.detect.EncodingDetector.
The AutoDetectReaders picks the first non-null response to detect.
The current order is:
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector
I've had some luck in some situations flipping the order so that Icu4j is run
before Mozilla's UniversalEncodingDetector.
If that doesn't work, <shudder/> you can create your own CP1256 detector that
returns cp1256 all the time and then put that in the services file.
We had someone hit this issue a year or so ago with UTF-8 (where he know
absolutely that the files were, no doubt about it, UTF-8).
We've talked about having and "override" detector, but we haven't implemented
that yet.
-----Original Message-----
From: Ben Gould [mailto:[email protected]]
Sent: Thursday, July 30, 2015 2:34 PM
To: [email protected]
Subject: Charset Encoding
Hi all,
I'm working on dynamically parsing a large set of Farsi documents
(mostly txt, pdf, doc and docx), and am having issues when I come across
text files encoded in CP1256 (an old windows-arabic format).
I'm using the Tika facade to return a Reader implementation (wrapping
the input in a TikaInputStream) and then tokenizing the Reader using a
Lucene Analyzer. However, whenever it hits CP1256 encoded text files,
it tries to decode them as (Content-Type -> text/plain;
charset=x-MacCyrillic). In the input metadata, I do provide the
following properties:
Content-Encoding: CP1256
Content-Type: text/plain; charset=CP1256
Content-Type-Hint: text/plain; charset=CP1256
Any ideas on how I can force the TXTParser to use CP1256?
Thanks,
-Ben
--
Ben Gould
iNovex Information Systems, Inc
7240 Parkway Drive, Suite 140
Hanover, MD 21076
(410)292-1332
http://inovexcorp.com