Charset Encoding

Ben Gould Thu, 30 Jul 2015 11:35:10 -0700

Hi all,

I'm working on dynamically parsing a large set of Farsi documents(mostly txt, pdf, doc and docx), and am having issues when I come acrosstext files encoded in CP1256 (an old windows-arabic format).

I'm using the Tika facade to return a Reader implementation (wrappingthe input in a TikaInputStream) and then tokenizing the Reader using aLucene Analyzer. However, whenever it hits CP1256 encoded text files,it tries to decode them as (Content-Type -> text/plain;charset=x-MacCyrillic). In the input metadata, I do provide thefollowing properties:


Content-Encoding: CP1256
Content-Type: text/plain; charset=CP1256
Content-Type-Hint: text/plain; charset=CP1256

Any ideas on how I can force the TXTParser to use CP1256?

Thanks,
-Ben

Charset Encoding

Reply via email to