Hi all,
I'm working on dynamically parsing a large set of Farsi documents
(mostly txt, pdf, doc and docx), and am having issues when I come across
text files encoded in CP1256 (an old windows-arabic format).
I'm using the Tika facade to return a Reader implementation (wrapping
the input in a TikaInputStream) and then tokenizing the Reader using a
Lucene Analyzer. However, whenever it hits CP1256 encoded text files,
it tries to decode them as (Content-Type -> text/plain;
charset=x-MacCyrillic). In the input metadata, I do provide the
following properties:
Content-Encoding: CP1256
Content-Type: text/plain; charset=CP1256
Content-Type-Hint: text/plain; charset=CP1256
Any ideas on how I can force the TXTParser to use CP1256?
Thanks,
-Ben