Re: Regression in 2.0.19

Andreas Lehmkuehler Tue, 07 Apr 2020 23:11:47 -0700

Am 05.04.20 um 23:17 schrieb Joel Hirsh:

Thank you.


Are those changes likely to be a problem in the future though?  I had
noticed that the changes did get slightly better results when reading PDF's
from OCR scans which had lots of extraneous text from hand writing on the
paper document.  So I assume there is a good reason for them.

Some time ago there was a proposal to change that part of the text extraction toget better results for some corner cases (once I found the related thread I'mgoing to post a pointer to it). I experimented with some changes and ended upwith those I've accidentally committed. They had no influence on many cases andworked well for the given corner case but obviously the other side of the coinled to the current regression.

We are all aware that we have to overhaul the whole text extraction stuff. IMHOit doesn't make that much sense to put to much effort into changes with a smalleffect but a huge potential to introduce a regression.


Andreas


On Tue, Mar 31, 2020 at 3:24 AM Andreas Lehmkuehler <andr...@lehmi.de>
wrote:

I've fixed the issue in the 2.0 branch, the trunk isn't affected, see
PDFBOX-4805.

@Joel: thanks for reporting and debugging the issue, especially as it was
limited to some corner cases. Sorry for the inconvience.

Andreas

Am 30.03.20 um 11:03 schrieb Andreas Lehmkuehler:

Looks like I accidentally committed some unrelated code :-(
I've to check that.

Am 30.03.20 um 10:56 schrieb Andreas Lehmkuehler:

Thanks for the debugging. Those changes were made in PDFBOX-4760, that

should

help us find the issue.

Andreas

Am 30.03.20 um 06:38 schrieb Joel Hirsh:

I did try to create a test case by taking out most of the text on a

page,

but that also fixed the problem.

I did verify that neither of the changes to PDTrueTypeFont for

PDFBOX-4755

/ PDF.js #5501 are coming into play.
Set a breakpoint at those lines, and no breaks. Also, one file that has
trouble is using a PDType0Font called 'fon2',
another uses a PDTrueTypeFont.

I just started counting bad Unicode characters for other reasons, by
overriding PDFTextProcessor.showText().
I put in a change to test the return from font.toUnicode(code) to see

if it

is null, and just count them. And there are no nulls coming back.
But the text breakup occurs with or without my override.

So I did compare and there are not a whole lot of other changes from

2.0.18

to 2.0.19. Turns out that if I
revert to the old version of PDFTextStripper.overlap()  (two lines of

code)

then the problem goes away.
What were they supposed to address?

Regards

On Wed, Mar 4, 2020 at 8:34 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

Hi,

Please try to submit a test case.

My guess is that this is related to bad /ToUnicode streams.

Tilman

Am 05.03.2020 um 03:09 schrieb Joel Hirsh:

I just started testing with version 2.0.19.

I am using PDFTextStripper and some files that gave back fine

results in

2.0.18 are completely useless with 2.0.19.  As an example, I have one

file

that gets about 600 phrases in 2.0.18.  In 2.0.19 it gets over 16,000
phrases the majority of which of are a zero length string, and most

of

the

rest are single characters making up the phrase, rather than a

phrase.


The file is confidential, so I cannot just post it.

Am I telling you something that you already know about, or should I

try

to

submit a test case? Or is there some new option I am unaware of?

Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Regression in 2.0.19

Reply via email to