Thank you. Are those changes likely to be a problem in the future though? I had noticed that the changes did get slightly better results when reading PDF's from OCR scans which had lots of extraneous text from hand writing on the paper document. So I assume there is a good reason for them.
On Tue, Mar 31, 2020 at 3:24 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote: > I've fixed the issue in the 2.0 branch, the trunk isn't affected, see > PDFBOX-4805. > > @Joel: thanks for reporting and debugging the issue, especially as it was > limited to some corner cases. Sorry for the inconvience. > > Andreas > > Am 30.03.20 um 11:03 schrieb Andreas Lehmkuehler: > > Looks like I accidentally committed some unrelated code :-( > > I've to check that. > > > > Am 30.03.20 um 10:56 schrieb Andreas Lehmkuehler: > >> Thanks for the debugging. Those changes were made in PDFBOX-4760, that > should > >> help us find the issue. > >> > >> Andreas > >> > >> Am 30.03.20 um 06:38 schrieb Joel Hirsh: > >>> I did try to create a test case by taking out most of the text on a > page, > >>> but that also fixed the problem. > >>> > >>> I did verify that neither of the changes to PDTrueTypeFont for > PDFBOX-4755 > >>> / PDF.js #5501 are coming into play. > >>> Set a breakpoint at those lines, and no breaks. Also, one file that has > >>> trouble is using a PDType0Font called 'fon2', > >>> another uses a PDTrueTypeFont. > >>> > >>> I just started counting bad Unicode characters for other reasons, by > >>> overriding PDFTextProcessor.showText(). > >>> I put in a change to test the return from font.toUnicode(code) to see > if it > >>> is null, and just count them. And there are no nulls coming back. > >>> But the text breakup occurs with or without my override. > >>> > >>> So I did compare and there are not a whole lot of other changes from > 2.0.18 > >>> to 2.0.19. Turns out that if I > >>> revert to the old version of PDFTextStripper.overlap() (two lines of > code) > >>> then the problem goes away. > >>> What were they supposed to address? > >>> > >>> Regards > >>> > >>> On Wed, Mar 4, 2020 at 8:34 PM Tilman Hausherr <thaush...@t-online.de> > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> Please try to submit a test case. > >>>> > >>>> My guess is that this is related to bad /ToUnicode streams. > >>>> > >>>> Tilman > >>>> > >>>> Am 05.03.2020 um 03:09 schrieb Joel Hirsh: > >>>>> I just started testing with version 2.0.19. > >>>>> > >>>>> I am using PDFTextStripper and some files that gave back fine > results in > >>>>> 2.0.18 are completely useless with 2.0.19. As an example, I have one > >>>> file > >>>>> that gets about 600 phrases in 2.0.18. In 2.0.19 it gets over 16,000 > >>>>> phrases the majority of which of are a zero length string, and most > of > >>>> the > >>>>> rest are single characters making up the phrase, rather than a > phrase. > >>>>> > >>>>> The file is confidential, so I cannot just post it. > >>>>> > >>>>> Am I telling you something that you already know about, or should I > try > >>>> to > >>>>> submit a test case? Or is there some new option I am unaware of? > >>>>> > >>>>> Thanks > >>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: users-h...@pdfbox.apache.org > >>>> > >>>> > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: users-h...@pdfbox.apache.org > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: users-h...@pdfbox.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >