Same line calculation of PDFTextStripper

Kevin Day Mon, 07 Apr 2025 12:19:58 -0700

I've got an interesting problem.

We are running into scenarios where parsing fails to treat consecutive
words as being on the same line.


Here is an example:
https://drive.google.com/file/d/1XRd6itkNzXCd9CbPuGSmB6lYZMMMHzvH/view?usp=drive_link

If you extract the text, it comes out:

B checked . . . . .
Partnership’s name, address, city, state, and ZIP code

but it should come out:

B Partnership’s name, address, city, state, and ZIP code



The problem appears to be caused by this code in the TextPositionComparator:

// we will do a simple tolerance comparison

if (yDifference < .1 ||

pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||

pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)

{


The issue is that the above logic doesn't take the X position of the text
into account.  So we have text on the left side of the page: "B
Partnership’s name", and completely unrelated text on the right side of the
page: "checked . ."

The text on the right side of the page has a Y position that is inside the
top/bottom of the left side of page content, so it is being treated as the
same line - with preference over the content on the left side of the page.

Because the mergesort is not necessarily comparing adjacent glyphs, large
mistakes can be made.

I understand the goal of the bracketing of the Y position (text near each
other should be treated as continuous, even if there are slight differences
in the Y position).  But this is causing a lot of problems with accurate
extraction of adjacent words - even words that are on exactly the same
y-position, and appear next to each other.

In the problem document, the letter "B" is at y position 602.0.  The
"Partnership's name" is at y position 602.0.  The 'checked . . .' is at
position 605.997.



I suspect this might be caused by the non-transitive nature of the
comparator.  This has always felt a bit sketchy to me anyway, and it looks
like it is biting us now.

Here is my suggestion for a potential fix if preserving the "y position
window" behavior is necessary:

boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;

// we will do a simple tolerance comparison

if (yDifference < .1 ||

nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||

nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)

{

return Float.compare(x1, x2);

}



Either that, or base it just on yDifference < *some configurable threshold*
and not check for characters with overlapping y ranges.

Please advise!

- K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>

Same line calculation of PDFTextStripper

Reply via email to