[ https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782097#action_12782097 ]
MRIT64 commented on TIKA-331: ----------------------------- Spacing issue -------------------- Look at lines 10 and 11 in test2.pdf. Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) : ðLocalisation des zones de livraison et de stockage ðLocalisation des zones dangereuses There is no space between ð and Localisation (ð is the translation of Winding's "Rightwards white arrow" by Tika). If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get : ð Localisation des zones de livraison et de stockage ð Localisation des zones dangereuses ...with a space between ð and Localisation. In my case, the missing space after Tika parsing result in considering "ðLocalisation" as a word in following processes. Regards > Windings font recognition in Tika parsing + spacing issue > --------------------------------------------------------- > > Key: TIKA-331 > URL: https://issues.apache.org/jira/browse/TIKA-331 > Project: Tika > Issue Type: Wish > Components: parser > Affects Versions: 0.4 > Environment: Windows XP / Java JDK 1.6.0_15 > Reporter: MRIT64 > Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, > test2.pdf > > > I have PDF files that include some characters in Windings font. > Tika parser replaces them with some Unicode characters that have nothing to > do with the original, and, in some cases, replaces them with alphabetic > characters (that is normal regarding these characters codes). > Would it be possible to improve the parsing and remplace these characters > with more accurate Unicode characters ? > (see http://www.alanwood.net/demos/wingdings.html for possible > correspondences). > I will attach examples files when this issue will be created (would it be > possible to attach files directly when creating issues ?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.