Re: Same line calculation of PDFTextStripper

Kevin Day Tue, 08 Apr 2025 06:30:30 -0700

Hmmm.

Do you know what the extracted text was for those two examples under the
original sort algorithm? Were those text chunks properly extracted with the
expected space between them?

I'm not very clear on why the examples you show would be missing a word
break detection after changing the sort. Or is it possible that the text
itself has a space glyph in it? I'm wondering if that space is maybe
getting sorted weird because it has zero width...

A few other thoughts:

My proposed change is not a well thought through algorithm - it was a hack
to try to emulate the "block detection" you mention. It may be that fine
tuning the nearX calculation could be what is needed.

For example, change the 4 to a 1. Or possibly take the average (or maybe
the geometric mean) of the two text positions.

Actually, as I write this, I think it may be advisable to use the same
algorithm that determines word breaks... If the x positions of the two TPs
are within that threshold, then nearX will be true and the fuzzy logic
would kick in during the sort.

What are your thoughts?

K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>

On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr <thaush...@t-online.de> wrote:

> I tried this and get lots of differences, obviously. I looked at two
> files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but
> there's a new problem: the segments are not separated.
>
> PDFBOX-2991:
> Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin
>
> PDFBOX-3019:
> Originally from Dallas, I have since moved throughout the U.S. and have
> spent mos*t r*hodescc3(ätt)vcu.edu
>
> The part in bold is where I'd expect to have a better separation.
>
> I haven't dealt much with this algorithm... I think the ideal solution
> would be some sort of block detection that goes first, and in a next
> step collect these blocks separately (like the "bead" logic that already
> exists)
>
> Tilman
>
>
> On 07.04.2025 21:19, Kevin Day wrote:
> > Here is my suggestion for a potential fix if preserving the "y position
> > window" behavior is necessary:
> >
> > boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;
> >
> > // we will do a simple tolerance comparison
> >
> > if (yDifference < .1 ||
> >
> > nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
> >
> > nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
> >
> > {
> >
> > return Float.compare(x1, x2);
> >
> > }
> >
>
>

Re: Same line calculation of PDFTextStripper

Reply via email to