Re: Same line calculation of PDFTextStripper

Kevin Day Wed, 09 Apr 2025 07:38:51 -0700

Understood.

My biggest comment is that having a non-transitive comparator in a sort
algorithm is a really bad idea.  It produces all sorts of non-deterministic
behavior.


So I'm in agreement that a better solution is needed.

Do you have any history of why the fuzzy logic is in that comparator?  Some
sort of git blame or anything that might explain?  I can't imagine this was
part of the original algorithm.

- K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>


On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Oops, the one from PDFBOX-3019 is no longer available. The one from
> PDFBOX-2991 is here, you can test it yourself:
> https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
> The original extraction is
> Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
> I don't have any thoughts about the algorithm because I would have to
> understand it first and I would need a lot of time and quietness for
> this. At this time, all I can do to help is to test changes and then
> tell about problems.
> Another example is the file from
> https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output
> from the text stripper test:
> Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015, SOFTLINE
> Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116
> UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler
> 07.04.2015]] New: [position: 0, size: 4, lines: [Datum :
> 23.04.2015Firma, Kund.Nr. : 44812SOFTLINE Datenverarbeitungsgesellschaft
> mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler
> 07.04.20158200 Gleisdorf]]
> Tilman
>
> On 08.04.2025 15:29, Kevin Day wrote:
> > Hmmm.
> >
> > Do you know what the extracted text was for those two examples under the
> > original sort algorithm? Were those text chunks properly extracted with
> the
> > expected space between them?
> >
> > I'm not very clear on why the examples you show would be missing a word
> > break detection after changing the sort. Or is it possible that the text
> > itself has a space glyph in it? I'm wondering if that space is maybe
> > getting sorted weird because it has zero width...
> >
> >
> >
> > A few other thoughts:
> >
> > My proposed change is not a well thought through algorithm - it was a
> hack
> > to try to emulate the "block detection" you mention. It may be that fine
> > tuning the nearX calculation could be what is needed.
> >
> > For example, change the 4 to a 1. Or possibly take the average (or maybe
> > the geometric mean) of the two text positions.
> >
> > Actually, as I write this, I think it may be advisable to use the same
> > algorithm that determines word breaks... If the x positions of the two
> TPs
> > are within that threshold, then nearX will be true and the fuzzy logic
> > would kick in during the sort.
> >
> >
> > What are your thoughts?
> >
> > K
> >
> > Kevin Day
> >
> > *trumpet**p| *480.961.6003 x1002
> > *e| *ke...@trumpetinc.com
> > *www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn
> > <https://www.linkedin.com/company/trumpet-inc.>
> >
> > On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<thaush...@t-online.de>
> wrote:
> >
> >> I tried this and get lots of differences, obviously. I looked at two
> >> files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but
> >> there's a new problem: the segments are not separated.
> >>
> >> PDFBOX-2991:
> >> Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin
> >>
> >> PDFBOX-3019:
> >> Originally from Dallas, I have since moved throughout the U.S. and have
> >> spent mos*t r*hodescc3(ätt)vcu.edu
> >>
> >> The part in bold is where I'd expect to have a better separation.
> >>
> >> I haven't dealt much with this algorithm... I think the ideal solution
> >> would be some sort of block detection that goes first, and in a next
> >> step collect these blocks separately (like the "bead" logic that already
> >> exists)
> >>
> >> Tilman
> >>
> >>
> >> On 07.04.2025 21:19, Kevin Day wrote:
> >>> Here is my suggestion for a potential fix if preserving the "y position
> >>> window" behavior is necessary:
> >>>
> >>> boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;
> >>>
> >>> // we will do a simple tolerance comparison
> >>>
> >>> if (yDifference < .1 ||
> >>>
> >>> nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
> >>>
> >>> nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
> >>>
> >>> {
> >>>
> >>> return Float.compare(x1, x2);
> >>>
> >>> }
> >>>
> >>
>

Re: Same line calculation of PDFTextStripper

Reply via email to