Re: Same line calculation of PDFTextStripper

sahy...@fileaffairs.de Tue, 13 May 2025 00:18:59 -0700

There is no public test suite other than files attached to Jira
tickets. There is a suite of some 10000 files which we use for
regression tests prior to new releases but also when doing larger
changes but that can not be shared due to data privacy, licensing ...
(it has been public but with new acts like GDPR we removed public
access)


BR
Maruan

Am Montag, dem 12.05.2025 um 12:37 -0700 schrieb Kevin Day:
> Are there test files that exercise the superscript/subscript
> correction
> that the non-transitive comparator is supposed to address?  And is
> there
> some way that I can get access to the test suite that includes 2991? 
> I can
> copy the file down from the Jira ticket, but I hate to do a ton of
> development without being able to run proper tests...
> 
> This is becoming a pretty big issue for us (we now have hundreds of
> files
> where the sorted text extraction is producing very bad results), so
> I'm
> going to do some work on it in the next couple of weeks. 
> Unfortunately,
> this is probably not going to be a surgical change (i.e. a change to
> the sort algorithm).
> 
> I would like to make sure that any algorithm I come up with doesn't
> introduce big regressions.
> 
> One challenge I am seeing is that all of this logic depends heavily
> on
> whether two text positions are in the same word or not, so it will
> almost
> certainly become necessary to do word clustering and line break
> determination as part of the "sort" (not after the fact like things
> are
> now).  Whatever I come up with, it is probably going to be a clean
> implementation, which is always risky without lots of unit tests.
> 
> Thanks,
> 
> K
> 
> Kevin Day
> 
> *trumpet**p| *480.961.6003 x1002
> *e| *ke...@trumpetinc.com
> *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
> <https://www.linkedin.com/company/trumpet-inc.>
> 
> On Thu, Apr 10, 2025, 6:35 AM Kevin Day <ke...@trumpetinc.com> wrote:
> 
> > I had one other thought on this.
> > 
> > Without question, the ordering of the TextPositions after the JRE
> > sort
> > completes is not consistent with the comparator. It should be easy
> > to just
> > loop the sorted TPs and check to ensure the comparator always
> > returns <=0.
> > 
> > I'm wondering if the slower fallback sort would not have this
> > problem. If
> > not, then it may be faster to do the JRE sort as a pre-sort, then
> > just
> > always call the fallback sort. The hope would be that if elements
> > were
> > already "mostly" sorted, then the fallback sort would be efficient.
> > 
> > I have no idea if that will actually be the case, but it may be
> > something
> > to investigate.  The superscript/subscript algorithm I described in
> > my last
> > post is effectively a merge sort...
> > 
> > Another possibility would be to use the JRE sort without
> > subscript/superscript handling. Then use the fallback sort with
> > subscript/superscript handling.
> > 
> > K
> > 
> > Kevin Day
> > 
> > *trumpet**p| *480.961.6003 x1002
> > *e| *ke...@trumpetinc.com
> > *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
> > <https://www.linkedin.com/company/trumpet-inc.>
> > 
> > On Wed, Apr 9, 2025, 2:03 PM Kevin Day <ke...@trumpetinc.com>
> > wrote:
> > 
> > > Thank you for directing me to the discussion.  This is pretty
> > > much what I
> > > expected (the reason for the fuzzy logic is superscript/subscript
> > > handling).
> > > 
> > > I am pretty confident that the problem is not with the
> > > comparator.  The
> > > problem is that we are trying to use a simple sort algorithm to
> > > do
> > > something that is not simple.
> > > 
> > > I think that a more robust approach would be something like this:
> > > 
> > > Group TextPositons by direction (note: Current TextStripper
> > > doesn't do
> > > this, which causes a ton of problems for extractions that have
> > > watermarks,
> > > etc...)
> > > Group TextPositions by Y-position.  Because these are floats, we
> > > will
> > > need to do some amount of rounding.  So I think we use
> > > Math.floor() to make
> > > the measurement a whole number
> > > Sort each group by X-position
> > > 
> > > This gives us a baseline layout.  Now, we need to go through and
> > > identify
> > > superscript/subscript candidates and merge them into the
> > > appropriate line.
> > > This is not trivial, but it's a *lot* more deterministic than
> > > trying to
> > > tune the comparator.
> > > 
> > > 
> > > Here is a rough idea for a superscript/subscript merge algorithm
> > > (this
> > > would be applied to each "direction" group):
> > > 
> > > For superscripts:
> > > 
> > > Start at the first Y-position-group.  Scan each TextPosition and
> > > see if
> > > it is a superscript candidate to the next Y position group.  If
> > > it is a
> > > candidate, remove the TextPosition from the current Y-position-
> > > group and
> > > merge it into the correct location in the *next* Y-position-
> > > group.
> > > Repeat for the next Y-position-group.
> > > 
> > > The net effect should be that all superscript candidates have
> > > been pushed
> > > down to the line(s) below.
> > > 
> > > For subscripts:
> > > 
> > > Start at the last Y-position-group.  Do a similar scan as before,
> > > but
> > > looking at the Y-position-group above it.
> > > 
> > > The net effect should be that all subscript candidates have been
> > > pushed
> > > up to the line(s) above.
> > > 
> > > 
> > > Finally, we have to define what a superscript candidate is:  I
> > > think this
> > > would be a text position that has a corresponding TextPosition to
> > > the left
> > > or right of it's X position, but with a Y-Position that is inside
> > > the left
> > > or right TextPosition's bounding box.
> > > 
> > > 
> > > This is complex, of course - but I think this is a complex
> > > problem.  The
> > > key here is that we should not be mixing the overall text
> > > ordering
> > > algorithm, with the specialized superscript/subscript handling
> > > algorithm.
> > > 
> > > - K
> > > 
> > > Kevin Day
> > > 
> > > *trumpet**p| *480.961.6003 x1002
> > > *e| *ke...@trumpetinc.com
> > > *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
> > > <https://www.linkedin.com/company/trumpet-inc.>
> > > 
> > > 
> > > On Wed, Apr 9, 2025 at 8:08 AM Tilman Hausherr
> > > <thaush...@t-online.de>
> > > wrote:
> > > 
> > > > On 09.04.2025 16:36, Kevin Day wrote:
> > > > > Understood.
> > > > > 
> > > > > My biggest comment is that having a non-transitive comparator
> > > > > in a sort
> > > > > algorithm is a really bad idea.  It produces all sorts of
> > > > non-deterministic
> > > > > behavior.
> > > > > 
> > > > > So I'm in agreement that a better solution is needed.
> > > > > 
> > > > > Do you have any history of why the fuzzy logic is in that
> > > > > comparator?
> > > > Some
> > > > > sort of git blame or anything that might explain?  I can't
> > > > > imagine
> > > > this was
> > > > > part of the original algorithm.
> > > > 
> > > > This happened before it became an apache project. You would
> > > > have to get
> > > > a CVS client to get the history:
> > > > 
> > > > https://sourceforge.net/p/pdfbox/code/
> > > > 
> > > > We had a discussion about the non transitivity here and nobody
> > > > came up
> > > > with a better algorithm
> > > > 
> > > > https://issues.apache.org/jira/browse/PDFBOX-1512
> > > > 
> > > > See the comment by Hannes Erven on 21/Mar/14
> > > > 
> > > > Tilman
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > - K
> > > > > 
> > > > > Kevin Day
> > > > > 
> > > > > *trumpet**p| *480.961.6003 x1002
> > > > > *e| *ke...@trumpetinc.com
> > > > > *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
> > > > > <https://www.linkedin.com/company/trumpet-inc.>
> > > > > 
> > > > > 
> > > > > On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr
> > > > > <thaush...@t-online.de>
> > > > > wrote:
> > > > > 
> > > > > > Oops, the one from PDFBOX-3019 is no longer available. The
> > > > > > one from
> > > > > > PDFBOX-2991 is here, you can test it yourself:
> > > > > > 
> > > > https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
> > > > > > The original extraction is
> > > > > > Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
> > > > > > I don't have any thoughts about the algorithm because I
> > > > > > would have to
> > > > > > understand it first and I would need a lot of time and
> > > > > > quietness for
> > > > > > this. At this time, all I can do to help is to test changes
> > > > > > and then
> > > > > > tell about problems.
> > > > > > Another example is the file from
> > > > > > https://issues.apache.org/jira/browse/PDFBOX-2794 , here's
> > > > > > the output
> > > > > > from the text stripper test:
> > > > > > Org: [position: 0, size: 4, lines: [Firma Datum :
> > > > > > 23.04.2015,
> > > > SOFTLINE
> > > > > > Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812,
> > > > > > Ungerdorf 116
> > > > > > UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr.
> > > > > > Scharler
> > > > > > 07.04.2015]] New: [position: 0, size: 4, lines: [Datum :
> > > > > > 23.04.2015Firma, Kund.Nr. : 44812SOFTLINE
> > > > Datenverarbeitungsgesellschaft
> > > > > > mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr.
> > > > > > Scharler
> > > > > > 07.04.20158200 Gleisdorf]]
> > > > > > Tilman
> > > > > > 
> > > > > > On 08.04.2025 15:29, Kevin Day wrote:
> > > > > > > Hmmm.
> > > > > > > 
> > > > > > > Do you know what the extracted text was for those two
> > > > > > > examples under
> > > > the
> > > > > > > original sort algorithm? Were those text chunks properly
> > > > > > > extracted
> > > > with
> > > > > > the
> > > > > > > expected space between them?
> > > > > > > 
> > > > > > > I'm not very clear on why the examples you show would be
> > > > > > > missing a
> > > > word
> > > > > > > break detection after changing the sort. Or is it
> > > > > > > possible that the
> > > > text
> > > > > > > itself has a space glyph in it? I'm wondering if that
> > > > > > > space is maybe
> > > > > > > getting sorted weird because it has zero width...
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > A few other thoughts:
> > > > > > > 
> > > > > > > My proposed change is not a well thought through
> > > > > > > algorithm - it was a
> > > > > > hack
> > > > > > > to try to emulate the "block detection" you mention. It
> > > > > > > may be that
> > > > fine
> > > > > > > tuning the nearX calculation could be what is needed.
> > > > > > > 
> > > > > > > For example, change the 4 to a 1. Or possibly take the
> > > > > > > average (or
> > > > maybe
> > > > > > > the geometric mean) of the two text positions.
> > > > > > > 
> > > > > > > Actually, as I write this, I think it may be advisable to
> > > > > > > use the
> > > > same
> > > > > > > algorithm that determines word breaks... If the x
> > > > > > > positions of the
> > > > two
> > > > > > TPs
> > > > > > > are within that threshold, then nearX will be true and
> > > > > > > the fuzzy
> > > > logic
> > > > > > > would kick in during the sort.
> > > > > > > 
> > > > > > > 
> > > > > > > What are your thoughts?
> > > > > > > 
> > > > > > > K
> > > > > > > 
> > > > > > > Kevin Day
> > > > > > > 
> > > > > > > *trumpet**p| *480.961.6003 x1002
> > > > > > > *e| *ke...@trumpetinc.com
> > > > > > > *www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn
> > > > > > > <https://www.linkedin.com/company/trumpet-inc.>
> > > > > > > 
> > > > > > > On Tue, Apr 8, 2025, 1:13 AM Tilman
> > > > > > > Hausherr<thaush...@t-online.de>
> > > > > > wrote:
> > > > > > > > I tried this and get lots of differences, obviously. I
> > > > > > > > looked at two
> > > > > > > > files (PDFBOX-2991 and PDFBOX-3019) and the difference
> > > > > > > > make sense,
> > > > but
> > > > > > > > there's a new problem: the segments are not separated.
> > > > > > > > 
> > > > > > > > PDFBOX-2991:
> > > > > > > > Costa Mesa, California,
> > > > > > > > benjaminmccan(ätt)gmail.co*mB*enjamin
> > > > > > > > 
> > > > > > > > PDFBOX-3019:
> > > > > > > > Originally from Dallas, I have since moved throughout
> > > > > > > > the U.S. and
> > > > have
> > > > > > > > spent mos*t r*hodescc3(ätt)vcu.edu
> > > > > > > > 
> > > > > > > > The part in bold is where I'd expect to have a better
> > > > > > > > separation.
> > > > > > > > 
> > > > > > > > I haven't dealt much with this algorithm... I think the
> > > > > > > > ideal
> > > > solution
> > > > > > > > would be some sort of block detection that goes first,
> > > > > > > > and in a next
> > > > > > > > step collect these blocks separately (like the "bead"
> > > > > > > > logic that
> > > > already
> > > > > > > > exists)
> > > > > > > > 
> > > > > > > > Tilman
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 07.04.2025 21:19, Kevin Day wrote:
> > > > > > > > > Here is my suggestion for a potential fix if
> > > > > > > > > preserving the "y
> > > > position
> > > > > > > > > window" behavior is necessary:
> > > > > > > > > 
> > > > > > > > > boolean nearX = Math.abs(x1-x2) <
> > > > > > > > > pos1.getIndividualWidths()[0] *
> > > > 4;
> > > > > > > > > 
> > > > > > > > > // we will do a simple tolerance comparison
> > > > > > > > > 
> > > > > > > > > if (yDifference < .1 ||
> > > > > > > > > 
> > > > > > > > > nearX && pos2YBottom >= pos1YTop && pos2YBottom <=
> > > > > > > > > pos1YBottom ||
> > > > > > > > > 
> > > > > > > > > nearX && pos1YBottom >= pos2YTop && pos1YBottom <=
> > > > > > > > > pos2YBottom)
> > > > > > > > > 
> > > > > > > > > {
> > > > > > > > > 
> > > > > > > > > return Float.compare(x1, x2);
> > > > > > > > > 
> > > > > > > > > }
> > > > > > > > > 
> > > > 
> > > > 
> > > > ---------------------------------------------------------------
> > > > ------
> > > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > > > For additional commands, e-mail: users-h...@pdfbox.apache.org
> > > > 
> > > > 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Same line calculation of PDFTextStripper

Reply via email to