Re: Same line calculation of PDFTextStripper

Tilman Hausherr Tue, 08 Apr 2025 01:14:12 -0700

I tried this and get lots of differences, obviously. I looked at twofiles (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, butthere's a new problem: the segments are not separated.


PDFBOX-2991:
Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin


PDFBOX-3019:

Originally from Dallas, I have since moved throughout the U.S. and havespent mos*t r*hodescc3(ätt)vcu.edu


The part in bold is where I'd expect to have a better separation.

I haven't dealt much with this algorithm... I think the ideal solutionwould be some sort of block detection that goes first, and in a nextstep collect these blocks separately (like the "bead" logic that alreadyexists)


Tilman


On 07.04.2025 21:19, Kevin Day wrote:

Here is my suggestion for a potential fix if preserving the "y position
window" behavior is necessary:

boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;

// we will do a simple tolerance comparison

if (yDifference < .1 ||

nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||

nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)

{

return Float.compare(x1, x2);

}

Re: Same line calculation of PDFTextStripper

Reply via email to