Thank you.

Are those changes likely to be a problem in the future though?  I had
noticed that the changes did get slightly better results when reading PDF's
from OCR scans which had lots of extraneous text from hand writing on the
paper document.  So I assume there is a good reason for them.

On Tue, Mar 31, 2020 at 3:24 AM Andreas Lehmkuehler <andr...@lehmi.de>
wrote:

> I've fixed the issue in the 2.0 branch, the trunk isn't affected, see
> PDFBOX-4805.
>
> @Joel: thanks for reporting and debugging the issue, especially as it was
> limited to some corner cases. Sorry for the inconvience.
>
> Andreas
>
> Am 30.03.20 um 11:03 schrieb Andreas Lehmkuehler:
> > Looks like I accidentally committed some unrelated code :-(
> > I've to check that.
> >
> > Am 30.03.20 um 10:56 schrieb Andreas Lehmkuehler:
> >> Thanks for the debugging. Those changes were made in PDFBOX-4760, that
> should
> >> help us find the issue.
> >>
> >> Andreas
> >>
> >> Am 30.03.20 um 06:38 schrieb Joel Hirsh:
> >>> I did try to create a test case by taking out most of the text on a
> page,
> >>> but that also fixed the problem.
> >>>
> >>> I did verify that neither of the changes to PDTrueTypeFont for
> PDFBOX-4755
> >>> / PDF.js #5501 are coming into play.
> >>> Set a breakpoint at those lines, and no breaks. Also, one file that has
> >>> trouble is using a PDType0Font called 'fon2',
> >>> another uses a PDTrueTypeFont.
> >>>
> >>> I just started counting bad Unicode characters for other reasons, by
> >>> overriding PDFTextProcessor.showText().
> >>> I put in a change to test the return from font.toUnicode(code) to see
> if it
> >>> is null, and just count them. And there are no nulls coming back.
> >>> But the text breakup occurs with or without my override.
> >>>
> >>> So I did compare and there are not a whole lot of other changes from
> 2.0.18
> >>> to 2.0.19. Turns out that if I
> >>> revert to the old version of PDFTextStripper.overlap()  (two lines of
> code)
> >>> then the problem goes away.
> >>> What were they supposed to address?
> >>>
> >>> Regards
> >>>
> >>> On Wed, Mar 4, 2020 at 8:34 PM Tilman Hausherr <thaush...@t-online.de>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Please try to submit a test case.
> >>>>
> >>>> My guess is that this is related to bad /ToUnicode streams.
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 05.03.2020 um 03:09 schrieb Joel Hirsh:
> >>>>> I just started testing with version 2.0.19.
> >>>>>
> >>>>> I am using PDFTextStripper and some files that gave back fine
> results in
> >>>>> 2.0.18 are completely useless with 2.0.19.  As an example, I have one
> >>>> file
> >>>>> that gets about 600 phrases in 2.0.18.  In 2.0.19 it gets over 16,000
> >>>>> phrases the majority of which of are a zero length string, and most
> of
> >>>> the
> >>>>> rest are single characters making up the phrase, rather than a
> phrase.
> >>>>>
> >>>>> The file is confidential, so I cannot just post it.
> >>>>>
> >>>>> Am I telling you something that you already know about, or should I
> try
> >>>> to
> >>>>> submit a test case? Or is there some new option I am unaware of?
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>>>
> >>>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Reply via email to