[ 
https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MRIT64 updated TIKA-331:
------------------------

    Attachment: Parsing_Result2.txt
                test2.pdf

Another example with the same WORD source file converted into PDF with another 
tool, and the Tika parsing result. Windings characters are translated into 
different Unicode characters than with the previous version.

> Windings font recognition in Tika parsing + spacing issue
> ---------------------------------------------------------
>
>                 Key: TIKA-331
>                 URL: https://issues.apache.org/jira/browse/TIKA-331
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Windows XP / Java JDK 1.6.0_15
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, 
> test2.pdf
>
>
> I have PDF files that include some characters in Windings font.
> Tika parser replaces them with some Unicode characters that have nothing to 
> do with the original, and, in some cases, replaces them with alphabetic 
> characters (that is normal regarding these characters codes).
> Would it be possible to improve the parsing and remplace these characters 
> with more accurate Unicode characters ?
> (see http://www.alanwood.net/demos/wingdings.html for possible 
> correspondences).
> I will attach examples files when this issue will be created  (would it be 
> possible to attach files directly when creating issues ?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to