Hi all,

I'm working on dynamically parsing a large set of Farsi documents (mostly txt, pdf, doc and docx), and am having issues when I come across text files encoded in CP1256 (an old windows-arabic format).

I'm using the Tika facade to return a Reader implementation (wrapping the input in a TikaInputStream) and then tokenizing the Reader using a Lucene Analyzer. However, whenever it hits CP1256 encoded text files, it tries to decode them as (Content-Type -> text/plain; charset=x-MacCyrillic). In the input metadata, I do provide the following properties:

Content-Encoding: CP1256
Content-Type: text/plain; charset=CP1256
Content-Type-Hint: text/plain; charset=CP1256

Any ideas on how I can force the TXTParser to use CP1256?

Thanks,
-Ben

Reply via email to