Re: Text extraction adding lots of strange spaces

Kevin Day Fri, 13 Dec 2024 15:02:14 -0800

FYI - what I'm proposing is a slightly more refined approach than what is
suggested in the linked issue:
https://issues.apache.org/jira/browse/PDFBOX-3774


Here's how I would do this:  In writePage(), keep track of the previous
character added to the line (around line 696).  Then at the top of the loop
(around line 551) check if the current character is a space.  If it is a
space, check to see if it geometrically completely overlaps the previous
character (probably checking only in X direction - and it will need to
account for directional adjustments ).  If it does, then continue with the
next TextPosition.

Please let me know your thoughts.

- K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>


On Fri, Dec 13, 2024 at 7:41 AM Kevin Day <ke...@trumpetinc.com> wrote:

> Ok, this is very interesting.
>
> So we have one text render that adds a bunch of spaces.
>
> Then we have another text render that puts visible characters on top of
> those spaces.
>
> Then we positional-sort the text positions, and they wind up interleaving,
> which kills the text extraction accuracy.
>
> Would a reasonable refinement to the algorithm be to look for overlapping
> glyphs? And if there is a space that overlaps a non-space, then ignore the
> space?
>
> I know that text extraction is a black art (disclosure: I wrote the
> original text extraction modules for iText, so lots of hard-earned
> experience here), but I think the above would be appropriate for all
> situations...
>
> Would you be up for reviewing a patch if I implement this?
>
> PS I also think that having an option to ignore spaces in rendering (the
> issue you linked to) would be a good idea. But that should be optional. I'm
> happy to include that in my patch if you would like.
>
> K
>
> On Thu, Dec 12, 2024, 9:40 PM Tilman Hausherr <thaush...@t-online.de>
> wrote:
>
>> Hi,
>>
>> These spaces are really in the PDF:
>>
>>     BT
>>     /Content <</MCID 224 >>BDC
>>     1 i
>>     /T1_4 1 Tf
>>     7 0 0 7 *195.4 110.502* Tm
>>     *(\(                                                     \))Tj*
>>     EMC
>>     /Content <</MCID 225 >>BDC
>>     /T1_0 1 Tf
>>     -19.686 6.786 Td
>>     (Beginning capital account)Tj
>>     /T1_4 1 Tf
>>     ( )Tj
>>     14.057 0 Td
>>     (.)Tj
>>     1.714 0 Td
>>     (.)Tj
>>     1.714 0 Td
>>     (.)Tj
>>     EMC
>>     ET
>>
>>
>> And *195.4 110.502* is really the position, you can move your mouse
>> there in PDFDebugger.
>>
>> The font messages are not important here.
>> There is a way to get rid of such spaces, but it requires a source code
>> change, it is described here:
>> https://issues.apache.org/jira/browse/PDFBOX-3774
>>
>> However it's possible that other files would have a bad text extraction.
>>
>> Tilman
>>
>> On 13.12.2024 03:29, Kevin Day wrote:
>> > Hello-
>> >
>> > We are using PDFTextStripper, and have found some cases where there are
>> a
>> > *lot* of extraneous spaces being added to the output.  It almost acts
>> like
>> > the stripper is thinking that the space width of the font is super tiny.
>> >
>> > I managed to get a document that exhibits the behavior:
>> >
>> >
>> https://drive.google.com/file/d/1B2Mc4mMdsYfk9jKVqQ9OxEhKLRAxprrU/view?usp=sharing
>> >
>> > The easiest way to see the behavior is in PDFDebugger, View->Show
>> Stripper
>> > Text Positions.
>> >
>> > Note in the lower left corner of the document, there is text "999".  The
>> > text above and below that is fine, but the line with 999 has a *ton* of
>> > extra space rectangles displated.
>> >
>> > The extract text function in PDFDebugger doesn't sort, so that one comes
>> > out fine, but if you use PDFTextStripper with sorting enabled (), the
>> line
>> > renders like this:
>> >
>> > Withdrawals and distributions . . . $ ( 9 9 9 )
>> >
>> > Note the many space characters, and that there are even spaces between
>> each
>> > 9.
>> >
>> > I also observe that the PDF has warning messages about fonts (not sure
>> if
>> > this might be involved):
>> >
>> > [main] WARN org.apache.pdfbox.pdmodel.font.PDType1Font - Using fallback
>> > font ArialMT for HelveticaLTStd-Roman
>> >
>> > [main] WARN org.apache.fontbox.ttf.CmapSubtable - Format 14 cmap table
>> is
>> > not supported and will be ignored
>> >
>> >
>> >
>> > It almost acts like the parenthesis on the line are triggering some
>> > different detection mode where the font's space width is computing to be
>> > much smaller than it should be.
>> >
>> > Any ideas on what is going on or if it is fixable?
>> >
>> > Thanks!
>> >
>> > - K
>> >
>>
>

Re: Text extraction adding lots of strange spaces

Reply via email to