Hi,

Yushuang Hao <[email protected]> hat am 11. Juli 2012 um 12:08
geschrieben:

> Dear Sir/Madam,
>
> I experienced two issues when I was using the PDFBOX 1.7.0 to convert the
> PDF to Text:
>
> Firstly, the PDF is purely in English but after conversion I get random CJK
> characters in it. I have figured out this as under UTF-8 the Latin
> character takes 1 bit ranging from 0x0000 to 0x00FF in Unicode, somehow the
> conversion randomly compressed two Latin characters together as a 2 bits
> CJK character. For example, I got "?" (0x5365) rather than getting
> "S"(0x0053) and "e"(0x0065). I don't know how this happened but I managed
> to convert this to the right ones.
>
> My second issue is in the same document the "?" was produced for where it
> should be 3,4,6,7,8,9,),* or %, see below example. Can you give me some
> hints how to solve this? Many thanks.


Hmm, it's not that easy to say without having a hand on the pdf. If you can
share the doc in question with us, create an issue on JIRA [1] and attach the
pdf to it.


>
> In PDF:
> TERM C1 EUR 591736DB6 LX038684 07-Jun-2016 Shadow Shadow 450.0 0.00 0.404
> 4.9040 0.00 0.00 462,025.59 462,025.59
>
> Conversion:
> 07-Jun-201?TERM C1 EUR  462,025.5?Shadow Shadow  0.00 0.40? 450.0
> 0.00591736DB?  4.9040  0.00  462,025.5?LX03868?


Looks like you are not using the sort-option, are you?


>
> Kind regards,
> Yushuang
>
> --
>
> *Yushuang Hao*
> Codean
> King's Gate
> 1 Bravingtons Walk
> London, N1 9AE, UK
> [email protected]
>
> tel. +44 (0)20 3475 3548
> mob. +44 (0)7973 816 879
>
> www.codean.com

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX

Reply via email to