[ 
https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Angela Onslow updated TIKA-2054:
--------------------------------
    Attachment: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf

Here is a file which demonstrates this problem

> Problem with ligatures converting from PDF to HTML with Tika
> ------------------------------------------------------------
>
>                 Key: TIKA-2054
>                 URL: https://issues.apache.org/jira/browse/TIKA-2054
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11, 1.13
>            Reporter: Angela Onslow
>         Attachments: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf
>
>
> When converting certain PDFs from PDF to HTML I am having trouble with 
> ligature characters being displayed as U+FFFD � REPLACEMENT CHARACTER
> I have tried using Apache Tika 1.11 and 1.13, converting on the command line 
> using the .jar and get the same results.
> If I use pdfbox-app-2.0.1.jar and 'ExtractText' with the icu4j-57_1.jar in 
> the path and I convert to text rather than HTML then I am able to at least 
> preserve information about what each ligature was originally, even if they 
> are still represented as unprintable characters. 
> I.e. if I run the following from the command line:
> java -jar pdfbox-app-1.8.12.jar ExtractText 'test.pdf' 'test.txt'
> Then the resulting test.txt when viewed in Sublime2 has "fi" represented as 
> the  US (unit separator character), "ff" represented as RS, "fl" represented 
> as GS and "ffl" reperesented as FS, which I could then replace with the 
> appropriate characters.
> I was under the impression Tika uses icu4j, is there a way to get the same 
> behaviour I see with PDFBox with Tika when converting from PDF to HTML? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to