[ 
https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782100#action_12782100
 ] 

Ken Krugler commented on TIKA-331:
----------------------------------

I believe this is an issue for the PDF parser (PDFBox) that Tika "wraps".

Please check https://issues.apache.org/jira/browse/PDFBOX to see if this is 
already filed, and if not, refile it there.


> Windings font recognition in Tika parsing + spacing issue
> ---------------------------------------------------------
>
>                 Key: TIKA-331
>                 URL: https://issues.apache.org/jira/browse/TIKA-331
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Windows XP / Java JDK 1.6.0_15
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, 
> test2.pdf
>
>
> I have PDF files that include some characters in Windings font.
> Tika parser replaces them with some Unicode characters that have nothing to 
> do with the original, and, in some cases, replaces them with alphabetic 
> characters (that is normal regarding these characters codes).
> Would it be possible to improve the parsing and remplace these characters 
> with more accurate Unicode characters ?
> (see http://www.alanwood.net/demos/wingdings.html for possible 
> correspondences).
> I will attach examples files when this issue will be created  (would it be 
> possible to attach files directly when creating issues ?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to