Hi,
These spaces are really in the PDF:
BT
/Content <</MCID 224 >>BDC
1 i
/T1_4 1 Tf
7 0 0 7 *195.4 110.502* Tm
*(\( \))Tj*
EMC
/Content <</MCID 225 >>BDC
/T1_0 1 Tf
-19.686 6.786 Td
(Beginning capital account)Tj
/T1_4 1 Tf
( )Tj
14.057 0 Td
(.)Tj
1.714 0 Td
(.)Tj
1.714 0 Td
(.)Tj
EMC
ET
And *195.4 110.502* is really the position, you can move your mouse
there in PDFDebugger.
The font messages are not important here.
There is a way to get rid of such spaces, but it requires a source code
change, it is described here:
https://issues.apache.org/jira/browse/PDFBOX-3774
However it's possible that other files would have a bad text extraction.
Tilman
On 13.12.2024 03:29, Kevin Day wrote:
Hello-
We are using PDFTextStripper, and have found some cases where there are a
*lot* of extraneous spaces being added to the output. It almost acts like
the stripper is thinking that the space width of the font is super tiny.
I managed to get a document that exhibits the behavior:
https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing
The easiest way to see the behavior is in PDFDebugger, View->Show Stripper
Text Positions.
Note in the lower left corner of the document, there is text "999". The
text above and below that is fine, but the line with 999 has a *ton* of
extra space rectangles displated.
The extract text function in PDFDebugger doesn't sort, so that one comes
out fine, but if you use PDFTextStripper with sorting enabled (), the line
renders like this:
Withdrawals and distributions . . . $ ( 9 9 9 )
Note the many space characters, and that there are even spaces between each
9.
I also observe that the PDF has warning messages about fonts (not sure if
this might be involved):
[main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback
font ArialMT for HelveticaLTStd-Roman
[main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table is
not supported and will be ignored
It almost acts like the parenthesis on the line are triggering some
different detection mode where the font's space width is computing to be
much smaller than it should be.
Any ideas on what is going on or if it is fixable?
Thanks!
- K