Thank you for directing me to the discussion. This is pretty much what I expected (the reason for the fuzzy logic is superscript/subscript handling).
I am pretty confident that the problem is not with the comparator. The problem is that we are trying to use a simple sort algorithm to do something that is not simple. I think that a more robust approach would be something like this: Group TextPositons by direction (note: Current TextStripper doesn't do this, which causes a ton of problems for extractions that have watermarks, etc...) Group TextPositions by Y-position. Because these are floats, we will need to do some amount of rounding. So I think we use Math.floor() to make the measurement a whole number Sort each group by X-position This gives us a baseline layout. Now, we need to go through and identify superscript/subscript candidates and merge them into the appropriate line. This is not trivial, but it's a *lot* more deterministic than trying to tune the comparator. Here is a rough idea for a superscript/subscript merge algorithm (this would be applied to each "direction" group): For superscripts: Start at the first Y-position-group. Scan each TextPosition and see if it is a superscript candidate to the next Y position group. If it is a candidate, remove the TextPosition from the current Y-position-group and merge it into the correct location in the *next* Y-position-group. Repeat for the next Y-position-group. The net effect should be that all superscript candidates have been pushed down to the line(s) below. For subscripts: Start at the last Y-position-group. Do a similar scan as before, but looking at the Y-position-group above it. The net effect should be that all subscript candidates have been pushed up to the line(s) above. Finally, we have to define what a superscript candidate is: I think this would be a text position that has a corresponding TextPosition to the left or right of it's X position, but with a Y-Position that is inside the left or right TextPosition's bounding box. This is complex, of course - but I think this is a complex problem. The key here is that we should not be mixing the overall text ordering algorithm, with the specialized superscript/subscript handling algorithm. - K Kevin Day *trumpet**p| *480.961.6003 x1002 *e| *ke...@trumpetinc.com *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn <https://www.linkedin.com/company/trumpet-inc.> On Wed, Apr 9, 2025 at 8:08 AM Tilman Hausherr <thaush...@t-online.de> wrote: > On 09.04.2025 16:36, Kevin Day wrote: > > Understood. > > > > My biggest comment is that having a non-transitive comparator in a sort > > algorithm is a really bad idea. It produces all sorts of > non-deterministic > > behavior. > > > > So I'm in agreement that a better solution is needed. > > > > Do you have any history of why the fuzzy logic is in that comparator? > Some > > sort of git blame or anything that might explain? I can't imagine this > was > > part of the original algorithm. > > This happened before it became an apache project. You would have to get > a CVS client to get the history: > > https://sourceforge.net/p/pdfbox/code/ > > We had a discussion about the non transitivity here and nobody came up > with a better algorithm > > https://issues.apache.org/jira/browse/PDFBOX-1512 > > See the comment by Hannes Erven on 21/Mar/14 > > Tilman > > > > > > > > - K > > > > Kevin Day > > > > *trumpet**p| *480.961.6003 x1002 > > *e| *ke...@trumpetinc.com > > *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn > > <https://www.linkedin.com/company/trumpet-inc.> > > > > > > On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr <thaush...@t-online.de> > > wrote: > > > >> Oops, the one from PDFBOX-3019 is no longer available. The one from > >> PDFBOX-2991 is here, you can test it yourself: > >> > https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf > >> The original extraction is > >> Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com > >> I don't have any thoughts about the algorithm because I would have to > >> understand it first and I would need a lot of time and quietness for > >> this. At this time, all I can do to help is to test changes and then > >> tell about problems. > >> Another example is the file from > >> https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output > >> from the text stripper test: > >> Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015, SOFTLINE > >> Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116 > >> UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler > >> 07.04.2015]] New: [position: 0, size: 4, lines: [Datum : > >> 23.04.2015Firma, Kund.Nr. : 44812SOFTLINE Datenverarbeitungsgesellschaft > >> mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler > >> 07.04.20158200 Gleisdorf]] > >> Tilman > >> > >> On 08.04.2025 15:29, Kevin Day wrote: > >>> Hmmm. > >>> > >>> Do you know what the extracted text was for those two examples under > the > >>> original sort algorithm? Were those text chunks properly extracted with > >> the > >>> expected space between them? > >>> > >>> I'm not very clear on why the examples you show would be missing a word > >>> break detection after changing the sort. Or is it possible that the > text > >>> itself has a space glyph in it? I'm wondering if that space is maybe > >>> getting sorted weird because it has zero width... > >>> > >>> > >>> > >>> A few other thoughts: > >>> > >>> My proposed change is not a well thought through algorithm - it was a > >> hack > >>> to try to emulate the "block detection" you mention. It may be that > fine > >>> tuning the nearX calculation could be what is needed. > >>> > >>> For example, change the 4 to a 1. Or possibly take the average (or > maybe > >>> the geometric mean) of the two text positions. > >>> > >>> Actually, as I write this, I think it may be advisable to use the same > >>> algorithm that determines word breaks... If the x positions of the two > >> TPs > >>> are within that threshold, then nearX will be true and the fuzzy logic > >>> would kick in during the sort. > >>> > >>> > >>> What are your thoughts? > >>> > >>> K > >>> > >>> Kevin Day > >>> > >>> *trumpet**p| *480.961.6003 x1002 > >>> *e| *ke...@trumpetinc.com > >>> *www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn > >>> <https://www.linkedin.com/company/trumpet-inc.> > >>> > >>> On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<thaush...@t-online.de> > >> wrote: > >>>> I tried this and get lots of differences, obviously. I looked at two > >>>> files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but > >>>> there's a new problem: the segments are not separated. > >>>> > >>>> PDFBOX-2991: > >>>> Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin > >>>> > >>>> PDFBOX-3019: > >>>> Originally from Dallas, I have since moved throughout the U.S. and > have > >>>> spent mos*t r*hodescc3(ätt)vcu.edu > >>>> > >>>> The part in bold is where I'd expect to have a better separation. > >>>> > >>>> I haven't dealt much with this algorithm... I think the ideal solution > >>>> would be some sort of block detection that goes first, and in a next > >>>> step collect these blocks separately (like the "bead" logic that > already > >>>> exists) > >>>> > >>>> Tilman > >>>> > >>>> > >>>> On 07.04.2025 21:19, Kevin Day wrote: > >>>>> Here is my suggestion for a potential fix if preserving the "y > position > >>>>> window" behavior is necessary: > >>>>> > >>>>> boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4; > >>>>> > >>>>> // we will do a simple tolerance comparison > >>>>> > >>>>> if (yDifference < .1 || > >>>>> > >>>>> nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom || > >>>>> > >>>>> nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom) > >>>>> > >>>>> { > >>>>> > >>>>> return Float.compare(x1, x2); > >>>>> > >>>>> } > >>>>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >