Ok, this is very interesting.

So we have one text render that adds a bunch of spaces.

Then we have another text render that puts visible characters on top of
those spaces.

Then we positional-sort the text positions, and they wind up interleaving,
which kills the text extraction accuracy.

Would a reasonable refinement to the algorithm be to look for overlapping
glyphs? And if there is a space that overlaps a non-space, then ignore the
space?

I know that text extraction is a black art (disclosure: I wrote the
original text extraction modules for iText, so lots of hard-earned
experience here), but I think the above would be appropriate for all
situations...

Would you be up for reviewing a patch if I implement this?

PS I also think that having an option to ignore spaces in rendering (the
issue you linked to) would be a good idea. But that should be optional. I'm
happy to include that in my patch if you would like.

K

On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de> wrote:

> Hi,
>
> These spaces are really in the PDF:
>
>     BT
>     /Content <</MCID 224 >>BDC
>     1 i
>     /T1_4 1 Tf
>     7 0 0 7 *195.4 110.502* Tm
>     *(\(                                                     \))Tj*
>     EMC
>     /Content <</MCID 225 >>BDC
>     /T1_0 1 Tf
>     -19.686 6.786 Td
>     (Beginning capital account)Tj
>     /T1_4 1 Tf
>     ( )Tj
>     14.057 0 Td
>     (.)Tj
>     1.714 0 Td
>     (.)Tj
>     1.714 0 Td
>     (.)Tj
>     EMC
>     ET
>
>
> And *195.4 110.502* is really the position, you can move your mouse
> there in PDFDebugger.
>
> The font messages are not important here.
> There is a way to get rid of such spaces, but it requires a source code
> change, it is described here:
> https://issues.apache.org/jira/browse/PDFBOX-3774
>
> However it's possible that other files would have a bad text extraction.
>
> Tilman
>
> On 13.12.2024 03:29, Kevin Day wrote:
> > Hello-
> >
> > We are using PDFTextStripper, and have found some cases where there are a
> > *lot* of extraneous spaces being added to the output.  It almost acts
> like
> > the stripper is thinking that the space width of the font is super tiny.
> >
> > I managed to get a document that exhibits the behavior:
> >
> >
> https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing
> >
> > The easiest way to see the behavior is in PDFDebugger, View->Show
> Stripper
> > Text Positions.
> >
> > Note in the lower left corner of the document, there is text "999".  The
> > text above and below that is fine, but the line with 999 has a *ton* of
> > extra space rectangles displated.
> >
> > The extract text function in PDFDebugger doesn't sort, so that one comes
> > out fine, but if you use PDFTextStripper with sorting enabled (), the
> line
> > renders like this:
> >
> > Withdrawals and distributions . . . $ ( 9 9 9 )
> >
> > Note the many space characters, and that there are even spaces between
> each
> > 9.
> >
> > I also observe that the PDF has warning messages about fonts (not sure if
> > this might be involved):
> >
> > [main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback
> > font ArialMT for HelveticaLTStd-Roman
> >
> > [main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table is
> > not supported and will be ignored
> >
> >
> >
> > It almost acts like the parenthesis on the line are triggering some
> > different detection mode where the font's space width is computing to be
> > much smaller than it should be.
> >
> > Any ideas on what is going on or if it is fixable?
> >
> > Thanks!
> >
> > - K
> >
>

Reply via email to