I tried this and get lots of differences, obviously. I looked at two files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but there's a new problem: the segments are not separated.

PDFBOX-2991:
Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin

PDFBOX-3019:
Originally from Dallas, I have since moved throughout the U.S. and have spent mos*t r*hodescc3(ätt)vcu.edu

The part in bold is where I'd expect to have a better separation.

I haven't dealt much with this algorithm... I think the ideal solution would be some sort of block detection that goes first, and in a next step collect these blocks separately (like the "bead" logic that already exists)

Tilman


On 07.04.2025 21:19, Kevin Day wrote:
Here is my suggestion for a potential fix if preserving the "y position
window" behavior is necessary:

boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;

// we will do a simple tolerance comparison

if (yDifference < .1 ||

nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||

nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)

{

return Float.compare(x1, x2);

}

Reply via email to