> Are you sure that you are using PDFBox. The code doesn't look like ours.
That’s Tika. -----Original Message----- From: Andreas Lehmkühler [mailto:[email protected]] Sent: Friday, May 13, 2016 5:53 AM To: Mohit Goyal <[email protected]>; [email protected] Subject: Re: PdfParser giving garbage character > Mohit Goyal <[email protected]> hat am 13. Mai 2016 um 08:28 geschrieben: > > > Hi, > > I have one pdf which has data in Malyalam(Indian Language). I tried to > parse this data using apache Tika I got garbage character '?' in output. > > > I checked Pdf using pdffont utility seems like some tounicodetable is missing. > Output of pdffont > Config Error: No display font for 'Symbol' Config Error: No display > font for 'ZapfDingbats' > **name type emb sub uni object > I**D > ------------------------------------ ----------------- --- --- --- > --------- > YTLJPR+AnjaliOldLipi CID TrueType yes yes yes 1671 0 > Times-Roman Type 1 no no no 1672 0 > Times-Bold Type 1 no no no 127 0 > > > Please find attached pdf. The pdf didn't make it due to some restrictions to the mailing list. You have to provide a link to a public download. > > Code: > > BufferedWriter writer= Files.newWriter(new > File("file-output.txt"), Charset.forName("UTF-8")); BodyContentHandler > handler = new BodyContentHandler(writer); ParseContext pcontext = new > ParseContext(); Metadata metadata = new Metadata(); > PDFParser pdfparser = new PDFParser(); > pdfparser.parse(inputstream, handler, metadata,pcontext); > > Any suggestions?? Are you sure that you are using PDFBox. The code doesn't look like ours. > > Thanks > Mohit Goyal BR Andreas --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

