Re: Different behaviour in different system

Andreas Lehmkuehler Thu, 05 Aug 2010 23:25:10 -0700

Hi,

Am 04.08.2010 17:23, schrieb vishwa bhat:


Hello,

I am trying to use pdfbox to extract text. Input pdf contains English
and Indic (Kannada) characters. I have two systems running XP. One with
MS Office and one with Open office. Where I have open office, extracted
text is correct (Both English and Indic characters show up). But when I
run the same program in system with MS Office, Indic characters not
extracted. Please suggest what might be wrong.

I guess MS Office and OpenOffice are using different font types to create
the pdfs from your documents. Probably MS Office uses embedded subsets
of true type fonts which are'nt yet supported by PDFBox [1]. Have a look at
the properties of your pdfs to check which kind of font is used.

I have attached the input file. Let me know if you need any more info.

Due to some restrictions your attachement didn't make it.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-490

Re: Different behaviour in different system

Reply via email to