I've got an interesting problem. We are running into scenarios where parsing fails to treat consecutive words as being on the same line.
Here is an example: https://drive.google.com/file/d/1XRd6itkNzXCd9CbPuGSmB6lYZMMMHzvH/view?usp=drive_link If you extract the text, it comes out: B checked . . . . . Partnership’s name, address, city, state, and ZIP code but it should come out: B Partnership’s name, address, city, state, and ZIP code The problem appears to be caused by this code in the TextPositionComparator: // we will do a simple tolerance comparison if (yDifference < .1 || pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom || pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom) { The issue is that the above logic doesn't take the X position of the text into account. So we have text on the left side of the page: "B Partnership’s name", and completely unrelated text on the right side of the page: "checked . ." The text on the right side of the page has a Y position that is inside the top/bottom of the left side of page content, so it is being treated as the same line - with preference over the content on the left side of the page. Because the mergesort is not necessarily comparing adjacent glyphs, large mistakes can be made. I understand the goal of the bracketing of the Y position (text near each other should be treated as continuous, even if there are slight differences in the Y position). But this is causing a lot of problems with accurate extraction of adjacent words - even words that are on exactly the same y-position, and appear next to each other. In the problem document, the letter "B" is at y position 602.0. The "Partnership's name" is at y position 602.0. The 'checked . . .' is at position 605.997. I suspect this might be caused by the non-transitive nature of the comparator. This has always felt a bit sketchy to me anyway, and it looks like it is biting us now. Here is my suggestion for a potential fix if preserving the "y position window" behavior is necessary: boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4; // we will do a simple tolerance comparison if (yDifference < .1 || nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom || nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom) { return Float.compare(x1, x2); } Either that, or base it just on yDifference < *some configurable threshold* and not check for characters with overlapping y ranges. Please advise! - K Kevin Day *trumpet**p| *480.961.6003 x1002 *e| *ke...@trumpetinc.com *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn <https://www.linkedin.com/company/trumpet-inc.>