Understood. My biggest comment is that having a non-transitive comparator in a sort algorithm is a really bad idea. It produces all sorts of non-deterministic behavior.
So I'm in agreement that a better solution is needed. Do you have any history of why the fuzzy logic is in that comparator? Some sort of git blame or anything that might explain? I can't imagine this was part of the original algorithm. - K Kevin Day *trumpet**p| *480.961.6003 x1002 *e| *ke...@trumpetinc.com *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn <https://www.linkedin.com/company/trumpet-inc.> On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr <thaush...@t-online.de> wrote: > Oops, the one from PDFBOX-3019 is no longer available. The one from > PDFBOX-2991 is here, you can test it yourself: > https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf > The original extraction is > Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com > I don't have any thoughts about the algorithm because I would have to > understand it first and I would need a lot of time and quietness for > this. At this time, all I can do to help is to test changes and then > tell about problems. > Another example is the file from > https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output > from the text stripper test: > Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015, SOFTLINE > Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116 > UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler > 07.04.2015]] New: [position: 0, size: 4, lines: [Datum : > 23.04.2015Firma, Kund.Nr. : 44812SOFTLINE Datenverarbeitungsgesellschaft > mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler > 07.04.20158200 Gleisdorf]] > Tilman > > On 08.04.2025 15:29, Kevin Day wrote: > > Hmmm. > > > > Do you know what the extracted text was for those two examples under the > > original sort algorithm? Were those text chunks properly extracted with > the > > expected space between them? > > > > I'm not very clear on why the examples you show would be missing a word > > break detection after changing the sort. Or is it possible that the text > > itself has a space glyph in it? I'm wondering if that space is maybe > > getting sorted weird because it has zero width... > > > > > > > > A few other thoughts: > > > > My proposed change is not a well thought through algorithm - it was a > hack > > to try to emulate the "block detection" you mention. It may be that fine > > tuning the nearX calculation could be what is needed. > > > > For example, change the 4 to a 1. Or possibly take the average (or maybe > > the geometric mean) of the two text positions. > > > > Actually, as I write this, I think it may be advisable to use the same > > algorithm that determines word breaks... If the x positions of the two > TPs > > are within that threshold, then nearX will be true and the fuzzy logic > > would kick in during the sort. > > > > > > What are your thoughts? > > > > K > > > > Kevin Day > > > > *trumpet**p| *480.961.6003 x1002 > > *e| *ke...@trumpetinc.com > > *www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn > > <https://www.linkedin.com/company/trumpet-inc.> > > > > On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<thaush...@t-online.de> > wrote: > > > >> I tried this and get lots of differences, obviously. I looked at two > >> files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but > >> there's a new problem: the segments are not separated. > >> > >> PDFBOX-2991: > >> Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin > >> > >> PDFBOX-3019: > >> Originally from Dallas, I have since moved throughout the U.S. and have > >> spent mos*t r*hodescc3(ätt)vcu.edu > >> > >> The part in bold is where I'd expect to have a better separation. > >> > >> I haven't dealt much with this algorithm... I think the ideal solution > >> would be some sort of block detection that goes first, and in a next > >> step collect these blocks separately (like the "bead" logic that already > >> exists) > >> > >> Tilman > >> > >> > >> On 07.04.2025 21:19, Kevin Day wrote: > >>> Here is my suggestion for a potential fix if preserving the "y position > >>> window" behavior is necessary: > >>> > >>> boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4; > >>> > >>> // we will do a simple tolerance comparison > >>> > >>> if (yDifference < .1 || > >>> > >>> nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom || > >>> > >>> nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom) > >>> > >>> { > >>> > >>> return Float.compare(x1, x2); > >>> > >>> } > >>> > >> >