Looks like I accidentally committed some unrelated code :-(
I've to check that.

Am 30.03.20 um 10:56 schrieb Andreas Lehmkuehler:
Thanks for the debugging. Those changes were made in PDFBOX-4760, that should help us find the issue.

Andreas

Am 30.03.20 um 06:38 schrieb Joel Hirsh:
I did try to create a test case by taking out most of the text on a page,
but that also fixed the problem.

I did verify that neither of the changes to PDTrueTypeFont for PDFBOX-4755
/ PDF.js #5501 are coming into play.
Set a breakpoint at those lines, and no breaks. Also, one file that has
trouble is using a PDType0Font called 'fon2',
another uses a PDTrueTypeFont.

I just started counting bad Unicode characters for other reasons, by
overriding PDFTextProcessor.showText().
I put in a change to test the return from font.toUnicode(code) to see if it
is null, and just count them. And there are no nulls coming back.
But the text breakup occurs with or without my override.

So I did compare and there are not a whole lot of other changes from 2.0.18
to 2.0.19. Turns out that if I
revert to the old version of PDFTextStripper.overlap()  (two lines of code)
then the problem goes away.
What were they supposed to address?

Regards

On Wed, Mar 4, 2020 at 8:34 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

Hi,

Please try to submit a test case.

My guess is that this is related to bad /ToUnicode streams.

Tilman

Am 05.03.2020 um 03:09 schrieb Joel Hirsh:
I just started testing with version 2.0.19.

I am using PDFTextStripper and some files that gave back fine results in
2.0.18 are completely useless with 2.0.19.  As an example, I have one
file
that gets about 600 phrases in 2.0.18.  In 2.0.19 it gets over 16,000
phrases the majority of which of are a zero length string, and most of
the
rest are single characters making up the phrase, rather than a phrase.

The file is confidential, so I cannot just post it.

Am I telling you something that you already know about, or should I try
to
submit a test case? Or is there some new option I am unaware of?

Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to