Am 05.04.20 um 23:17 schrieb Joel Hirsh:
Thank you.

Are those changes likely to be a problem in the future though?  I had
noticed that the changes did get slightly better results when reading PDF's
from OCR scans which had lots of extraneous text from hand writing on the
paper document.  So I assume there is a good reason for them.
Some time ago there was a proposal to change that part of the text extraction to get better results for some corner cases (once I found the related thread I'm going to post a pointer to it). I experimented with some changes and ended up with those I've accidentally committed. They had no influence on many cases and worked well for the given corner case but obviously the other side of the coin led to the current regression.

We are all aware that we have to overhaul the whole text extraction stuff. IMHO it doesn't make that much sense to put to much effort into changes with a small effect but a huge potential to introduce a regression.

Andreas


On Tue, Mar 31, 2020 at 3:24 AM Andreas Lehmkuehler <andr...@lehmi.de>
wrote:

I've fixed the issue in the 2.0 branch, the trunk isn't affected, see
PDFBOX-4805.

@Joel: thanks for reporting and debugging the issue, especially as it was
limited to some corner cases. Sorry for the inconvience.

Andreas

Am 30.03.20 um 11:03 schrieb Andreas Lehmkuehler:
Looks like I accidentally committed some unrelated code :-(
I've to check that.

Am 30.03.20 um 10:56 schrieb Andreas Lehmkuehler:
Thanks for the debugging. Those changes were made in PDFBOX-4760, that
should
help us find the issue.

Andreas

Am 30.03.20 um 06:38 schrieb Joel Hirsh:
I did try to create a test case by taking out most of the text on a
page,
but that also fixed the problem.

I did verify that neither of the changes to PDTrueTypeFont for
PDFBOX-4755
/ PDF.js #5501 are coming into play.
Set a breakpoint at those lines, and no breaks. Also, one file that has
trouble is using a PDType0Font called 'fon2',
another uses a PDTrueTypeFont.

I just started counting bad Unicode characters for other reasons, by
overriding PDFTextProcessor.showText().
I put in a change to test the return from font.toUnicode(code) to see
if it
is null, and just count them. And there are no nulls coming back.
But the text breakup occurs with or without my override.

So I did compare and there are not a whole lot of other changes from
2.0.18
to 2.0.19. Turns out that if I
revert to the old version of PDFTextStripper.overlap()  (two lines of
code)
then the problem goes away.
What were they supposed to address?

Regards

On Wed, Mar 4, 2020 at 8:34 PM Tilman Hausherr <thaush...@t-online.de>
wrote:

Hi,

Please try to submit a test case.

My guess is that this is related to bad /ToUnicode streams.

Tilman

Am 05.03.2020 um 03:09 schrieb Joel Hirsh:
I just started testing with version 2.0.19.

I am using PDFTextStripper and some files that gave back fine
results in
2.0.18 are completely useless with 2.0.19.  As an example, I have one
file
that gets about 600 phrases in 2.0.18.  In 2.0.19 it gets over 16,000
phrases the majority of which of are a zero length string, and most
of
the
rest are single characters making up the phrase, rather than a
phrase.

The file is confidential, so I cannot just post it.

Am I telling you something that you already know about, or should I
try
to
submit a test case? Or is there some new option I am unaware of?

Thanks



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to