Re: Chinese document: mangled characters, ASCII block code points off by 1

Tilman Hausherr Thu, 03 Aug 2017 11:52:26 -0700

Hi,

I just tested the files... bad news: only the digits can be extracted.The reason that the Chinese characters don't extract is similar to thecase here:

https://issues.apache.org/jira/browse/PDFBOX-3886
Feel free to ask further questions.

That you got some output is because in 1.8 a lot of assumptions weredone when ToUnicode was missing. Sometimes these were right, andsometimes not. The 2.0 versions don't make such assumptions so you getnothing.


Tilman

Am 03.08.2017 um 20:35 schrieb Zubiri, Tomas:

Hey Tilman,
I am sorry for the delay.
I am indeed using version 1.8.3, I will update to 2.0.7 in order to solve the 
off by one bug.
Regarding the Chinese characters bug. I am extracting text from a pdf, not 
rendering.
Here is what the documents look like.

http://www.filedropper.com/1341025263
http://www.filedropper.com/1308134649

Here is the text I am extracting with our custom text extractor based on 
TextPosition and PDFTextStripper from version 1.8.3
http://www.filedropper.com/1341025263_1
http://www.filedropper.com/1308134649_1

Let me know if I missed something or if you need any additional info.

Thanks!


Tomas Zubiri
Research Associate, Ownership
S&P Global Market Intelligence
Buenos Aires, Argentina
[email protected]
www.spglobal.com/marketintelligence




-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Thursday, August 03, 2017 1:41 PM
To: [email protected]
Subject: Re: Chinese document: mangled characters, ASCII block code points off 
by 1

Am 02.08.2017 um 00:16 schrieb Zubiri, Tomas:

Good afternoon,


http://www.filedropper.com/1308134649

The document linked above isn't being read correctly by PDFBox.
Characters in the ASCII block appear to be off by 1, for example,
numbers appear to be one value higher.

Should I upload this as a bug in JIRA?

Despite you not answering, I was able to guess what you're trying to tell us.

1) You are using 1.8.* version. This is not very good in rendering, and it
can't render the chinese glyphs at all, and the numbers are off by one. Use
2.0.7.
2) The 2.0.7 renders the numbers correctly. (The cause in 1.8.* is that the
internal code is indeed off by one, this is a weirdness in the file and a bug
in 1.8.*, but not a broken PDF) The chinese glyphs do look chinese but in poor
quality. This is a known and unsolved problem and is described here:
https://issues.apache.org/jira/browse/PDFBOX-3293

Tilman

________________________________

The information contained in this message is intended only for the recipient, and may be a
confidential attorney-client communication or may otherwise be privileged and confidential and
protected from disclosure. If the reader of this message is not the intended recipient, or an
employee or agent responsible for delivering this message to the intended recipient, please be
aware that any dissemination or copying of this communication is strictly prohibited. If you
have received this communication in error, please immediately notify us by replying to the
message and deleting it from your computer. S&P Global Inc. reserves the right, subject to
applicable local law, to monitor, review and process the content of any electronic message or
information sent to or from S&P Global Inc. e-mail addresses without informing the sender
or recipient of the message. By sending electronic message or information to S&P Global
Inc. e-mail addresses you, as the sender, are consenting to S&P Global Inc. processing any
of your personal data therein.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Chinese document: mangled characters, ASCII block code points off by 1

Reply via email to