Hi,
I have one pdf which has data in Malyalam(Indian Language). I tried to parse
this data using apache Tika I got garbage character '?' in output.
I checked Pdf using pdffont utility seems like some tounicodetable is missing.
Output of pdffont
Config Error: No display font for 'Symbol' Config Error: No display font for
'ZapfDingbats'
**name type emb sub uni object I**D
------------------------------------ ----------------- --- --- --- ---------
YTLJPR+AnjaliOldLipi CID TrueType yes yes yes 1671 0
Times-Roman Type 1 no no no 1672 0
Times-Bold Type 1 no no no 127 0
Please find attached pdf.
Code:
BufferedWriter writer= Files.newWriter(new
File("file-output.txt"), Charset.forName("UTF-8"));
BodyContentHandler handler = new BodyContentHandler(writer);
ParseContext pcontext = new ParseContext();
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);
Any suggestions??
Thanks
Mohit Goyal
________________________________
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]