I tried this and get lots of differences, obviously. I looked at two
files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but
there's a new problem: the segments are not separated.
PDFBOX-2991:
Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin
PDFBOX-3019:
Originally from Dallas, I have since moved throughout the U.S. and have
spent mos*t r*hodescc3(ätt)vcu.edu
The part in bold is where I'd expect to have a better separation.
I haven't dealt much with this algorithm... I think the ideal solution
would be some sort of block detection that goes first, and in a next
step collect these blocks separately (like the "bead" logic that already
exists)
Tilman
On 07.04.2025 21:19, Kevin Day wrote:
Here is my suggestion for a potential fix if preserving the "y position
window" behavior is necessary:
boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;
// we will do a simple tolerance comparison
if (yDifference < .1 ||
nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
{
return Float.compare(x1, x2);
}