Hi, Am 04.08.2010 17:23, schrieb vishwa bhat:
Hello, I am trying to use pdfbox to extract text. Input pdf contains English and Indic (Kannada) characters. I have two systems running XP. One with MS Office and one with Open office. Where I have open office, extracted text is correct (Both English and Indic characters show up). But when I run the same program in system with MS Office, Indic characters not extracted. Please suggest what might be wrong.
I guess MS Office and OpenOffice are using different font types to create the pdfs from your documents. Probably MS Office uses embedded subsets of true type fonts which are'nt yet supported by PDFBox [1]. Have a look at the properties of your pdfs to check which kind of font is used.
I have attached the input file. Let me know if you need any more info.
Due to some restrictions your attachement didn't make it. BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX-490

