> Mohit Goyal hat am 13. Mai 2016 um 08:28 geschrieben:
>
>
> Hi,
>
> I have one pdf which has data in Malyalam(Indian Language). I tried to parse
> this data using apache Tika I got garbage character '?' in output.
>
>
> I checked Pdf using pdffont utility seems like some tounicodetable is missing.
> Output of pdffont
> Config Error: No display font for 'Symbol' Config Error: No display font for
> 'ZapfDingbats'
> **name type emb sub uni object
> I**D
> - --- --- --- -
> YTLJPR+AnjaliOldLipi CID TrueType yes yes yes 1671 0
> Times-Roman Type 1no no no1672 0
> Times-Bold Type 1no no no 127 0
>
>
> Please find attached pdf.
The pdf didn't make it due to some restrictions to the mailing list. You have to
provide a link to a public download.
>
> Code:
>
> BufferedWriter writer= Files.newWriter(new
> File("file-output.txt"), Charset.forName("UTF-8"));
> BodyContentHandler handler = new BodyContentHandler(writer);
> ParseContext pcontext = new ParseContext();
> Metadata metadata = new Metadata();
>PDFParser pdfparser = new PDFParser();
>pdfparser.parse(inputstream, handler, metadata,pcontext);
>
> Any suggestions??
Are you sure that you are using PDFBox. The code doesn't look like ours.
>
> Thanks
> Mohit Goyal
BR
Andreas
-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org