I will also review that jira commit you referenced. It seems like it might be related.
On Sat, Dec 14, 2024, 9:59 AM Kevin Day <ke...@trumpetinc.com> wrote: > Great, I will put that together in the next couple of days. > > I haven't looked at the iText code in several years - this will be > greenfield development work. > > Take care, > > K > > On Fri, Dec 13, 2024, 9:21 PM Tilman Hausherr <thaush...@t-online.de> > wrote: > >> On 13.12.2024 15:41, Kevin Day wrote: >> > Ok, this is very interesting. >> > >> > So we have one text render that adds a bunch of spaces. >> > >> > Then we have another text render that puts visible characters on top of >> > those spaces. >> > >> > Then we positional-sort the text positions, and they wind up >> interleaving, >> > which kills the text extraction accuracy. >> > >> > Would a reasonable refinement to the algorithm be to look for >> overlapping >> > glyphs? And if there is a space that overlaps a non-space, then ignore >> the >> > space? >> > >> > I know that text extraction is a black art (disclosure: I wrote the >> > original text extraction modules for iText, so lots of hard-earned >> > experience here), but I think the above would be appropriate for all >> > situations... >> > >> > Would you be up for reviewing a patch if I implement this? >> > >> > PS I also think that having an option to ignore spaces in rendering (the >> > issue you linked to) would be a good idea. But that should be optional. >> I'm >> > happy to include that in my patch if you would like. >> >> Yes to both! >> >> Note that there is a pending patch about a corner case: >> >> https://github.com/apache/pdfbox/pull/155 >> >> But don't use code from itext unless you are allowed to, i.e. check the >> papers that you signed. If you patch is more than just a small bugfix >> you may have to sign something with us as well. >> >> https://www.apache.org/licenses/icla.pdf >> >> Tilman >> >> >> >> > >> > K >> > >> > On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de> >> wrote: >> > >> >> Hi, >> >> >> >> These spaces are really in the PDF: >> >> >> >> BT >> >> /Content <</MCID 224 >>BDC >> >> 1 i >> >> /T1_4 1 Tf >> >> 7 0 0 7 *195.4 110.502* Tm >> >> *(\( \))Tj* >> >> EMC >> >> /Content <</MCID 225 >>BDC >> >> /T1_0 1 Tf >> >> -19.686 6.786 Td >> >> (Beginning capital account)Tj >> >> /T1_4 1 Tf >> >> ( )Tj >> >> 14.057 0 Td >> >> (.)Tj >> >> 1.714 0 Td >> >> (.)Tj >> >> 1.714 0 Td >> >> (.)Tj >> >> EMC >> >> ET >> >> >> >> >> >> And *195.4 110.502* is really the position, you can move your mouse >> >> there in PDFDebugger. >> >> >> >> The font messages are not important here. >> >> There is a way to get rid of such spaces, but it requires a source code >> >> change, it is described here: >> >> https://issues.apache.org/jira/browse/PDFBOX-3774 >> >> >> >> However it's possible that other files would have a bad text >> extraction. >> >> >> >> Tilman >> >> >> >> On 13.12.2024 03:29, Kevin Day wrote: >> >>> Hello- >> >>> >> >>> We are using PDFTextStripper, and have found some cases where there >> are a >> >>> *lot* of extraneous spaces being added to the output. It almost acts >> >> like >> >>> the stripper is thinking that the space width of the font is super >> tiny. >> >>> >> >>> I managed to get a document that exhibits the behavior: >> >>> >> >>> >> >> >> https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing >> >>> The easiest way to see the behavior is in PDFDebugger, View->Show >> >> Stripper >> >>> Text Positions. >> >>> >> >>> Note in the lower left corner of the document, there is text "999". >> The >> >>> text above and below that is fine, but the line with 999 has a *ton* >> of >> >>> extra space rectangles displated. >> >>> >> >>> The extract text function in PDFDebugger doesn't sort, so that one >> comes >> >>> out fine, but if you use PDFTextStripper with sorting enabled (), the >> >> line >> >>> renders like this: >> >>> >> >>> Withdrawals and distributions . . . $ ( 9 9 9 ) >> >>> >> >>> Note the many space characters, and that there are even spaces between >> >> each >> >>> 9. >> >>> >> >>> I also observe that the PDF has warning messages about fonts (not >> sure if >> >>> this might be involved): >> >>> >> >>> [main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using >> fallback >> >>> font ArialMT for HelveticaLTStd-Roman >> >>> >> >>> [main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap >> table is >> >>> not supported and will be ignored >> >>> >> >>> >> >>> >> >>> It almost acts like the parenthesis on the line are triggering some >> >>> different detection mode where the font's space width is computing to >> be >> >>> much smaller than it should be. >> >>> >> >>> Any ideas on what is going on or if it is fixable? >> >>> >> >>> Thanks! >> >>> >> >>> - K >> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: users-h...@pdfbox.apache.org >> >>