FYI - what I'm proposing is a slightly more refined approach than what is suggested in the linked issue: https://issues.apache.org/jira/browse/PDFBOX-3774
Here's how I would do this: In writePage(), keep track of the previous character added to the line (around line 696). Then at the top of the loop (around line 551) check if the current character is a space. If it is a space, check to see if it geometrically completely overlaps the previous character (probably checking only in X direction - and it will need to account for directional adjustments ). If it does, then continue with the next TextPosition. Please let me know your thoughts. - K Kevin Day *trumpet**p| *480.961.6003 x1002 *e| *ke...@trumpetinc.com *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn <https://www.linkedin.com/company/trumpet-inc.> On Fri, Dec 13, 2024 at 7:41 AM Kevin Day <ke...@trumpetinc.com> wrote: > Ok, this is very interesting. > > So we have one text render that adds a bunch of spaces. > > Then we have another text render that puts visible characters on top of > those spaces. > > Then we positional-sort the text positions, and they wind up interleaving, > which kills the text extraction accuracy. > > Would a reasonable refinement to the algorithm be to look for overlapping > glyphs? And if there is a space that overlaps a non-space, then ignore the > space? > > I know that text extraction is a black art (disclosure: I wrote the > original text extraction modules for iText, so lots of hard-earned > experience here), but I think the above would be appropriate for all > situations... > > Would you be up for reviewing a patch if I implement this? > > PS I also think that having an option to ignore spaces in rendering (the > issue you linked to) would be a good idea. But that should be optional. I'm > happy to include that in my patch if you would like. > > K > > On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de> > wrote: > >> Hi, >> >> These spaces are really in the PDF: >> >> BT >> /Content <</MCID 224 >>BDC >> 1 i >> /T1_4 1 Tf >> 7 0 0 7 *195.4 110.502* Tm >> *(\( \))Tj* >> EMC >> /Content <</MCID 225 >>BDC >> /T1_0 1 Tf >> -19.686 6.786 Td >> (Beginning capital account)Tj >> /T1_4 1 Tf >> ( )Tj >> 14.057 0 Td >> (.)Tj >> 1.714 0 Td >> (.)Tj >> 1.714 0 Td >> (.)Tj >> EMC >> ET >> >> >> And *195.4 110.502* is really the position, you can move your mouse >> there in PDFDebugger. >> >> The font messages are not important here. >> There is a way to get rid of such spaces, but it requires a source code >> change, it is described here: >> https://issues.apache.org/jira/browse/PDFBOX-3774 >> >> However it's possible that other files would have a bad text extraction. >> >> Tilman >> >> On 13.12.2024 03:29, Kevin Day wrote: >> > Hello- >> > >> > We are using PDFTextStripper, and have found some cases where there are >> a >> > *lot* of extraneous spaces being added to the output. It almost acts >> like >> > the stripper is thinking that the space width of the font is super tiny. >> > >> > I managed to get a document that exhibits the behavior: >> > >> > >> https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing >> > >> > The easiest way to see the behavior is in PDFDebugger, View->Show >> Stripper >> > Text Positions. >> > >> > Note in the lower left corner of the document, there is text "999". The >> > text above and below that is fine, but the line with 999 has a *ton* of >> > extra space rectangles displated. >> > >> > The extract text function in PDFDebugger doesn't sort, so that one comes >> > out fine, but if you use PDFTextStripper with sorting enabled (), the >> line >> > renders like this: >> > >> > Withdrawals and distributions . . . $ ( 9 9 9 ) >> > >> > Note the many space characters, and that there are even spaces between >> each >> > 9. >> > >> > I also observe that the PDF has warning messages about fonts (not sure >> if >> > this might be involved): >> > >> > [main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback >> > font ArialMT for HelveticaLTStd-Roman >> > >> > [main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table >> is >> > not supported and will be ignored >> > >> > >> > >> > It almost acts like the parenthesis on the line are triggering some >> > different detection mode where the font's space width is computing to be >> > much smaller than it should be. >> > >> > Any ideas on what is going on or if it is fixable? >> > >> > Thanks! >> > >> > - K >> > >> >