Ok, this is very interesting. So we have one text render that adds a bunch of spaces.
Then we have another text render that puts visible characters on top of those spaces. Then we positional-sort the text positions, and they wind up interleaving, which kills the text extraction accuracy. Would a reasonable refinement to the algorithm be to look for overlapping glyphs? And if there is a space that overlaps a non-space, then ignore the space? I know that text extraction is a black art (disclosure: I wrote the original text extraction modules for iText, so lots of hard-earned experience here), but I think the above would be appropriate for all situations... Would you be up for reviewing a patch if I implement this? PS I also think that having an option to ignore spaces in rendering (the issue you linked to) would be a good idea. But that should be optional. I'm happy to include that in my patch if you would like. K On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de> wrote: > Hi, > > These spaces are really in the PDF: > > BT > /Content <</MCID 224 >>BDC > 1 i > /T1_4 1 Tf > 7 0 0 7 *195.4 110.502* Tm > *(\( \))Tj* > EMC > /Content <</MCID 225 >>BDC > /T1_0 1 Tf > -19.686 6.786 Td > (Beginning capital account)Tj > /T1_4 1 Tf > ( )Tj > 14.057 0 Td > (.)Tj > 1.714 0 Td > (.)Tj > 1.714 0 Td > (.)Tj > EMC > ET > > > And *195.4 110.502* is really the position, you can move your mouse > there in PDFDebugger. > > The font messages are not important here. > There is a way to get rid of such spaces, but it requires a source code > change, it is described here: > https://issues.apache.org/jira/browse/PDFBOX-3774 > > However it's possible that other files would have a bad text extraction. > > Tilman > > On 13.12.2024 03:29, Kevin Day wrote: > > Hello- > > > > We are using PDFTextStripper, and have found some cases where there are a > > *lot* of extraneous spaces being added to the output. It almost acts > like > > the stripper is thinking that the space width of the font is super tiny. > > > > I managed to get a document that exhibits the behavior: > > > > > https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing > > > > The easiest way to see the behavior is in PDFDebugger, View->Show > Stripper > > Text Positions. > > > > Note in the lower left corner of the document, there is text "999". The > > text above and below that is fine, but the line with 999 has a *ton* of > > extra space rectangles displated. > > > > The extract text function in PDFDebugger doesn't sort, so that one comes > > out fine, but if you use PDFTextStripper with sorting enabled (), the > line > > renders like this: > > > > Withdrawals and distributions . . . $ ( 9 9 9 ) > > > > Note the many space characters, and that there are even spaces between > each > > 9. > > > > I also observe that the PDF has warning messages about fonts (not sure if > > this might be involved): > > > > [main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback > > font ArialMT for HelveticaLTStd-Roman > > > > [main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table is > > not supported and will be ignored > > > > > > > > It almost acts like the parenthesis on the line are triggering some > > different detection mode where the font's space width is computing to be > > much smaller than it should be. > > > > Any ideas on what is going on or if it is fixable? > > > > Thanks! > > > > - K > > >