> Mohit Goyal <[email protected]> hat am 13. Mai 2016 um 08:28 geschrieben: > > > Hi, > > I have one pdf which has data in Malyalam(Indian Language). I tried to parse > this data using apache Tika I got garbage character '?' in output. > > > I checked Pdf using pdffont utility seems like some tounicodetable is missing. > Output of pdffont > Config Error: No display font for 'Symbol' Config Error: No display font for > 'ZapfDingbats' > **name type emb sub uni object > I**D > ------------------------------------ ----------------- --- --- --- --------- > YTLJPR+AnjaliOldLipi CID TrueType yes yes yes 1671 0 > Times-Roman Type 1 no no no 1672 0 > Times-Bold Type 1 no no no 127 0 > > > Please find attached pdf. The pdf didn't make it due to some restrictions to the mailing list. You have to provide a link to a public download. > > Code: > > BufferedWriter writer= Files.newWriter(new > File("file-output.txt"), Charset.forName("UTF-8")); > BodyContentHandler handler = new BodyContentHandler(writer); > ParseContext pcontext = new ParseContext(); > Metadata metadata = new Metadata(); > PDFParser pdfparser = new PDFParser(); > pdfparser.parse(inputstream, handler, metadata,pcontext); > > Any suggestions?? Are you sure that you are using PDFBox. The code doesn't look like ours. > > Thanks > Mohit Goyal
BR Andreas --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

