Dear Sir/Madam, I experienced two issues when I was using the PDFBOX 1.7.0 to convert the PDF to Text:
Firstly, the PDF is purely in English but after conversion I get random CJK characters in it. I have figured out this as under UTF-8 the Latin character takes 1 bit ranging from 0x0000 to 0x00FF in Unicode, somehow the conversion randomly compressed two Latin characters together as a 2 bits CJK character. For example, I got "�k" (0x5365) rather than getting "S"(0x0053) and "e"(0x0065). I don't know how this happened but I managed to convert this to the right ones. My second issue is in the same document the "?" was produced for where it should be 3,4,6,7,8,9,),* or %, see below example. Can you give me some hints how to solve this? Many thanks. In PDF: TERM C1 EUR 591736DB6 LX038684 07-Jun-2016 Shadow Shadow 450.0 0.00 0.404 4.9040 0.00 0.00 462,025.59 462,025.59 Conversion: 07-Jun-201?TERM C1 EUR 462,025.5?Shadow Shadow 0.00 0.40? 450.0 0.00591736DB? 4.9040 0.00 462,025.5?LX03868? Kind regards, Yushuang -- *Yushuang Hao* Codean King's Gate 1 Bravingtons Walk London, N1 9AE, UK [email protected] tel. +44 (0)20 3475 3548 mob. +44 (0)7973 816 879 www.codean.com

