Re: Same line calculation of PDFTextStripper

Kevin Day Mon, 12 May 2025 12:39:28 -0700

Are there test files that exercise the superscript/subscript correction
that the non-transitive comparator is supposed to address?  And is there
some way that I can get access to the test suite that includes 2991?  I can
copy the file down from the Jira ticket, but I hate to do a ton of
development without being able to run proper tests...


This is becoming a pretty big issue for us (we now have hundreds of files
where the sorted text extraction is producing very bad results), so I'm
going to do some work on it in the next couple of weeks.  Unfortunately,
this is probably not going to be a surgical change (i.e. a change to
the sort algorithm).

I would like to make sure that any algorithm I come up with doesn't
introduce big regressions.

One challenge I am seeing is that all of this logic depends heavily on
whether two text positions are in the same word or not, so it will almost
certainly become necessary to do word clustering and line break
determination as part of the "sort" (not after the fact like things are
now).  Whatever I come up with, it is probably going to be a clean
implementation, which is always risky without lots of unit tests.

Thanks,

K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>

On Thu, Apr 10, 2025, 6:35 AM Kevin Day <ke...@trumpetinc.com> wrote:

> I had one other thought on this.
>
> Without question, the ordering of the TextPositions after the JRE sort
> completes is not consistent with the comparator. It should be easy to just
> loop the sorted TPs and check to ensure the comparator always returns <=0.
>
> I'm wondering if the slower fallback sort would not have this problem. If
> not, then it may be faster to do the JRE sort as a pre-sort, then just
> always call the fallback sort. The hope would be that if elements were
> already "mostly" sorted, then the fallback sort would be efficient.
>
> I have no idea if that will actually be the case, but it may be something
> to investigate.  The superscript/subscript algorithm I described in my last
> post is effectively a merge sort...
>
> Another possibility would be to use the JRE sort without
> subscript/superscript handling. Then use the fallback sort with
> subscript/superscript handling.
>
> K
>
> Kevin Day
>
> *trumpet**p| *480.961.6003 x1002
> *e| *ke...@trumpetinc.com
> *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
> <https://www.linkedin.com/company/trumpet-inc.>
>
> On Wed, Apr 9, 2025, 2:03 PM Kevin Day <ke...@trumpetinc.com> wrote:
>
>> Thank you for directing me to the discussion.  This is pretty much what I
>> expected (the reason for the fuzzy logic is superscript/subscript handling).
>>
>> I am pretty confident that the problem is not with the comparator.  The
>> problem is that we are trying to use a simple sort algorithm to do
>> something that is not simple.
>>
>> I think that a more robust approach would be something like this:
>>
>> Group TextPositons by direction (note: Current TextStripper doesn't do
>> this, which causes a ton of problems for extractions that have watermarks,
>> etc...)
>> Group TextPositions by Y-position.  Because these are floats, we will
>> need to do some amount of rounding.  So I think we use Math.floor() to make
>> the measurement a whole number
>> Sort each group by X-position
>>
>> This gives us a baseline layout.  Now, we need to go through and identify
>> superscript/subscript candidates and merge them into the appropriate line.
>> This is not trivial, but it's a *lot* more deterministic than trying to
>> tune the comparator.
>>
>>
>> Here is a rough idea for a superscript/subscript merge algorithm (this
>> would be applied to each "direction" group):
>>
>> For superscripts:
>>
>> Start at the first Y-position-group.  Scan each TextPosition and see if
>> it is a superscript candidate to the next Y position group.  If it is a
>> candidate, remove the TextPosition from the current Y-position-group and
>> merge it into the correct location in the *next* Y-position-group.
>> Repeat for the next Y-position-group.
>>
>> The net effect should be that all superscript candidates have been pushed
>> down to the line(s) below.
>>
>> For subscripts:
>>
>> Start at the last Y-position-group.  Do a similar scan as before, but
>> looking at the Y-position-group above it.
>>
>> The net effect should be that all subscript candidates have been pushed
>> up to the line(s) above.
>>
>>
>> Finally, we have to define what a superscript candidate is:  I think this
>> would be a text position that has a corresponding TextPosition to the left
>> or right of it's X position, but with a Y-Position that is inside the left
>> or right TextPosition's bounding box.
>>
>>
>> This is complex, of course - but I think this is a complex problem.  The
>> key here is that we should not be mixing the overall text ordering
>> algorithm, with the specialized superscript/subscript handling algorithm.
>>
>> - K
>>
>> Kevin Day
>>
>> *trumpet**p| *480.961.6003 x1002
>> *e| *ke...@trumpetinc.com
>> *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
>> <https://www.linkedin.com/company/trumpet-inc.>
>>
>>
>> On Wed, Apr 9, 2025 at 8:08 AM Tilman Hausherr <thaush...@t-online.de>
>> wrote:
>>
>>> On 09.04.2025 16:36, Kevin Day wrote:
>>> > Understood.
>>> >
>>> > My biggest comment is that having a non-transitive comparator in a sort
>>> > algorithm is a really bad idea.  It produces all sorts of
>>> non-deterministic
>>> > behavior.
>>> >
>>> > So I'm in agreement that a better solution is needed.
>>> >
>>> > Do you have any history of why the fuzzy logic is in that comparator?
>>> Some
>>> > sort of git blame or anything that might explain?  I can't imagine
>>> this was
>>> > part of the original algorithm.
>>>
>>> This happened before it became an apache project. You would have to get
>>> a CVS client to get the history:
>>>
>>> https://sourceforge.net/p/pdfbox/code/
>>>
>>> We had a discussion about the non transitivity here and nobody came up
>>> with a better algorithm
>>>
>>> https://issues.apache.org/jira/browse/PDFBOX-1512
>>>
>>> See the comment by Hannes Erven on 21/Mar/14
>>>
>>> Tilman
>>>
>>>
>>>
>>>
>>> >
>>> > - K
>>> >
>>> > Kevin Day
>>> >
>>> > *trumpet**p| *480.961.6003 x1002
>>> > *e| *ke...@trumpetinc.com
>>> > *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
>>> > <https://www.linkedin.com/company/trumpet-inc.>
>>> >
>>> >
>>> > On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr <thaush...@t-online.de>
>>> > wrote:
>>> >
>>> >> Oops, the one from PDFBOX-3019 is no longer available. The one from
>>> >> PDFBOX-2991 is here, you can test it yourself:
>>> >>
>>> https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
>>> >> The original extraction is
>>> >> Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
>>> >> I don't have any thoughts about the algorithm because I would have to
>>> >> understand it first and I would need a lot of time and quietness for
>>> >> this. At this time, all I can do to help is to test changes and then
>>> >> tell about problems.
>>> >> Another example is the file from
>>> >> https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output
>>> >> from the text stripper test:
>>> >> Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015,
>>> SOFTLINE
>>> >> Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116
>>> >> UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler
>>> >> 07.04.2015]] New: [position: 0, size: 4, lines: [Datum :
>>> >> 23.04.2015Firma, Kund.Nr. : 44812SOFTLINE
>>> Datenverarbeitungsgesellschaft
>>> >> mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler
>>> >> 07.04.20158200 Gleisdorf]]
>>> >> Tilman
>>> >>
>>> >> On 08.04.2025 15:29, Kevin Day wrote:
>>> >>> Hmmm.
>>> >>>
>>> >>> Do you know what the extracted text was for those two examples under
>>> the
>>> >>> original sort algorithm? Were those text chunks properly extracted
>>> with
>>> >> the
>>> >>> expected space between them?
>>> >>>
>>> >>> I'm not very clear on why the examples you show would be missing a
>>> word
>>> >>> break detection after changing the sort. Or is it possible that the
>>> text
>>> >>> itself has a space glyph in it? I'm wondering if that space is maybe
>>> >>> getting sorted weird because it has zero width...
>>> >>>
>>> >>>
>>> >>>
>>> >>> A few other thoughts:
>>> >>>
>>> >>> My proposed change is not a well thought through algorithm - it was a
>>> >> hack
>>> >>> to try to emulate the "block detection" you mention. It may be that
>>> fine
>>> >>> tuning the nearX calculation could be what is needed.
>>> >>>
>>> >>> For example, change the 4 to a 1. Or possibly take the average (or
>>> maybe
>>> >>> the geometric mean) of the two text positions.
>>> >>>
>>> >>> Actually, as I write this, I think it may be advisable to use the
>>> same
>>> >>> algorithm that determines word breaks... If the x positions of the
>>> two
>>> >> TPs
>>> >>> are within that threshold, then nearX will be true and the fuzzy
>>> logic
>>> >>> would kick in during the sort.
>>> >>>
>>> >>>
>>> >>> What are your thoughts?
>>> >>>
>>> >>> K
>>> >>>
>>> >>> Kevin Day
>>> >>>
>>> >>> *trumpet**p| *480.961.6003 x1002
>>> >>> *e| *ke...@trumpetinc.com
>>> >>> *www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn
>>> >>> <https://www.linkedin.com/company/trumpet-inc.>
>>> >>>
>>> >>> On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<thaush...@t-online.de>
>>> >> wrote:
>>> >>>> I tried this and get lots of differences, obviously. I looked at two
>>> >>>> files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense,
>>> but
>>> >>>> there's a new problem: the segments are not separated.
>>> >>>>
>>> >>>> PDFBOX-2991:
>>> >>>> Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin
>>> >>>>
>>> >>>> PDFBOX-3019:
>>> >>>> Originally from Dallas, I have since moved throughout the U.S. and
>>> have
>>> >>>> spent mos*t r*hodescc3(ätt)vcu.edu
>>> >>>>
>>> >>>> The part in bold is where I'd expect to have a better separation.
>>> >>>>
>>> >>>> I haven't dealt much with this algorithm... I think the ideal
>>> solution
>>> >>>> would be some sort of block detection that goes first, and in a next
>>> >>>> step collect these blocks separately (like the "bead" logic that
>>> already
>>> >>>> exists)
>>> >>>>
>>> >>>> Tilman
>>> >>>>
>>> >>>>
>>> >>>> On 07.04.2025 21:19, Kevin Day wrote:
>>> >>>>> Here is my suggestion for a potential fix if preserving the "y
>>> position
>>> >>>>> window" behavior is necessary:
>>> >>>>>
>>> >>>>> boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] *
>>> 4;
>>> >>>>>
>>> >>>>> // we will do a simple tolerance comparison
>>> >>>>>
>>> >>>>> if (yDifference < .1 ||
>>> >>>>>
>>> >>>>> nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
>>> >>>>>
>>> >>>>> nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
>>> >>>>>
>>> >>>>> {
>>> >>>>>
>>> >>>>> return Float.compare(x1, x2);
>>> >>>>>
>>> >>>>> }
>>> >>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>
>>>

Re: Same line calculation of PDFTextStripper

Reply via email to