PdfParser giving garbage character

Mohit Goyal Thu, 12 May 2016 23:29:36 -0700

Hi,

I have one pdf which has data in Malyalam(Indian Language). I tried to parse 
this data using apache Tika I got garbage character '?' in output.



I checked Pdf using pdffont utility seems like some tounicodetable is missing.
Output of pdffont
Config Error: No display font for 'Symbol' Config Error: No display font for 
'ZapfDingbats'
**name                                 type              emb sub uni object I**D
------------------------------------ ----------------- --- --- --- ---------
YTLJPR+AnjaliOldLipi                 CID TrueType      yes yes yes   1671  0
Times-Roman                          Type 1            no  no  no    1672  0
Times-Bold                           Type 1            no  no  no     127  0


Please find attached pdf.

Code:

                BufferedWriter writer=  Files.newWriter(new 
File("file-output.txt"), Charset.forName("UTF-8"));
BodyContentHandler handler = new BodyContentHandler(writer);
ParseContext pcontext = new ParseContext();
Metadata metadata = new Metadata();
       PDFParser pdfparser = new PDFParser();
       pdfparser.parse(inputstream, handler, metadata,pcontext);

Any suggestions??

Thanks
Mohit Goyal

________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

PdfParser giving garbage character

Reply via email to