Re: Same line calculation of PDFTextStripper

Kevin Day Wed, 09 Apr 2025 14:04:12 -0700

Thank you for directing me to the discussion.  This is pretty much what I
expected (the reason for the fuzzy logic is superscript/subscript handling).


I am pretty confident that the problem is not with the comparator.  The
problem is that we are trying to use a simple sort algorithm to do
something that is not simple.

I think that a more robust approach would be something like this:

Group TextPositons by direction (note: Current TextStripper doesn't do
this, which causes a ton of problems for extractions that have watermarks,
etc...)
Group TextPositions by Y-position.  Because these are floats, we will need
to do some amount of rounding.  So I think we use Math.floor() to make the
measurement a whole number
Sort each group by X-position

This gives us a baseline layout.  Now, we need to go through and identify
superscript/subscript candidates and merge them into the appropriate line.
This is not trivial, but it's a *lot* more deterministic than trying to
tune the comparator.


Here is a rough idea for a superscript/subscript merge algorithm (this
would be applied to each "direction" group):

For superscripts:

Start at the first Y-position-group.  Scan each TextPosition and see if it
is a superscript candidate to the next Y position group.  If it is a
candidate, remove the TextPosition from the current Y-position-group and
merge it into the correct location in the *next* Y-position-group.
Repeat for the next Y-position-group.

The net effect should be that all superscript candidates have been pushed
down to the line(s) below.

For subscripts:

Start at the last Y-position-group.  Do a similar scan as before, but
looking at the Y-position-group above it.

The net effect should be that all subscript candidates have been pushed up
to the line(s) above.


Finally, we have to define what a superscript candidate is:  I think this
would be a text position that has a corresponding TextPosition to the left
or right of it's X position, but with a Y-Position that is inside the left
or right TextPosition's bounding box.


This is complex, of course - but I think this is a complex problem.  The
key here is that we should not be mixing the overall text ordering
algorithm, with the specialized superscript/subscript handling algorithm.

- K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>


On Wed, Apr 9, 2025 at 8:08 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

> On 09.04.2025 16:36, Kevin Day wrote:
> > Understood.
> >
> > My biggest comment is that having a non-transitive comparator in a sort
> > algorithm is a really bad idea.  It produces all sorts of
> non-deterministic
> > behavior.
> >
> > So I'm in agreement that a better solution is needed.
> >
> > Do you have any history of why the fuzzy logic is in that comparator?
> Some
> > sort of git blame or anything that might explain?  I can't imagine this
> was
> > part of the original algorithm.
>
> This happened before it became an apache project. You would have to get
> a CVS client to get the history:
>
> https://sourceforge.net/p/pdfbox/code/
>
> We had a discussion about the non transitivity here and nobody came up
> with a better algorithm
>
> https://issues.apache.org/jira/browse/PDFBOX-1512
>
> See the comment by Hannes Erven on 21/Mar/14
>
> Tilman
>
>
>
>
> >
> > - K
> >
> > Kevin Day
> >
> > *trumpet**p| *480.961.6003 x1002
> > *e| *ke...@trumpetinc.com
> > *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
> > <https://www.linkedin.com/company/trumpet-inc.>
> >
> >
> > On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr <thaush...@t-online.de>
> > wrote:
> >
> >> Oops, the one from PDFBOX-3019 is no longer available. The one from
> >> PDFBOX-2991 is here, you can test it yourself:
> >>
> https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
> >> The original extraction is
> >> Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
> >> I don't have any thoughts about the algorithm because I would have to
> >> understand it first and I would need a lot of time and quietness for
> >> this. At this time, all I can do to help is to test changes and then
> >> tell about problems.
> >> Another example is the file from
> >> https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output
> >> from the text stripper test:
> >> Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015, SOFTLINE
> >> Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116
> >> UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler
> >> 07.04.2015]] New: [position: 0, size: 4, lines: [Datum :
> >> 23.04.2015Firma, Kund.Nr. : 44812SOFTLINE Datenverarbeitungsgesellschaft
> >> mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler
> >> 07.04.20158200 Gleisdorf]]
> >> Tilman
> >>
> >> On 08.04.2025 15:29, Kevin Day wrote:
> >>> Hmmm.
> >>>
> >>> Do you know what the extracted text was for those two examples under
> the
> >>> original sort algorithm? Were those text chunks properly extracted with
> >> the
> >>> expected space between them?
> >>>
> >>> I'm not very clear on why the examples you show would be missing a word
> >>> break detection after changing the sort. Or is it possible that the
> text
> >>> itself has a space glyph in it? I'm wondering if that space is maybe
> >>> getting sorted weird because it has zero width...
> >>>
> >>>
> >>>
> >>> A few other thoughts:
> >>>
> >>> My proposed change is not a well thought through algorithm - it was a
> >> hack
> >>> to try to emulate the "block detection" you mention. It may be that
> fine
> >>> tuning the nearX calculation could be what is needed.
> >>>
> >>> For example, change the 4 to a 1. Or possibly take the average (or
> maybe
> >>> the geometric mean) of the two text positions.
> >>>
> >>> Actually, as I write this, I think it may be advisable to use the same
> >>> algorithm that determines word breaks... If the x positions of the two
> >> TPs
> >>> are within that threshold, then nearX will be true and the fuzzy logic
> >>> would kick in during the sort.
> >>>
> >>>
> >>> What are your thoughts?
> >>>
> >>> K
> >>>
> >>> Kevin Day
> >>>
> >>> *trumpet**p| *480.961.6003 x1002
> >>> *e| *ke...@trumpetinc.com
> >>> *www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn
> >>> <https://www.linkedin.com/company/trumpet-inc.>
> >>>
> >>> On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<thaush...@t-online.de>
> >> wrote:
> >>>> I tried this and get lots of differences, obviously. I looked at two
> >>>> files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but
> >>>> there's a new problem: the segments are not separated.
> >>>>
> >>>> PDFBOX-2991:
> >>>> Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin
> >>>>
> >>>> PDFBOX-3019:
> >>>> Originally from Dallas, I have since moved throughout the U.S. and
> have
> >>>> spent mos*t r*hodescc3(ätt)vcu.edu
> >>>>
> >>>> The part in bold is where I'd expect to have a better separation.
> >>>>
> >>>> I haven't dealt much with this algorithm... I think the ideal solution
> >>>> would be some sort of block detection that goes first, and in a next
> >>>> step collect these blocks separately (like the "bead" logic that
> already
> >>>> exists)
> >>>>
> >>>> Tilman
> >>>>
> >>>>
> >>>> On 07.04.2025 21:19, Kevin Day wrote:
> >>>>> Here is my suggestion for a potential fix if preserving the "y
> position
> >>>>> window" behavior is necessary:
> >>>>>
> >>>>> boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;
> >>>>>
> >>>>> // we will do a simple tolerance comparison
> >>>>>
> >>>>> if (yDifference < .1 ||
> >>>>>
> >>>>> nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
> >>>>>
> >>>>> nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
> >>>>>
> >>>>> {
> >>>>>
> >>>>> return Float.compare(x1, x2);
> >>>>>
> >>>>> }
> >>>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: Same line calculation of PDFTextStripper

Reply via email to