Re: Same line calculation of PDFTextStripper

Kevin Day Thu, 10 Apr 2025 06:36:02 -0700

I had one other thought on this.

Without question, the ordering of the TextPositions after the JRE sort
completes is not consistent with the comparator. It should be easy to just
loop the sorted TPs and check to ensure the comparator always returns <=0.


I'm wondering if the slower fallback sort would not have this problem. If
not, then it may be faster to do the JRE sort as a pre-sort, then just
always call the fallback sort. The hope would be that if elements were
already "mostly" sorted, then the fallback sort would be efficient.

I have no idea if that will actually be the case, but it may be something
to investigate.  The superscript/subscript algorithm I described in my last
post is effectively a merge sort...

Another possibility would be to use the JRE sort without
subscript/superscript handling. Then use the fallback sort with
subscript/superscript handling.

K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *[email protected]
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>

On Wed, Apr 9, 2025, 2:03 PM Kevin Day <[email protected]> wrote:

> Thank you for directing me to the discussion.  This is pretty much what I
> expected (the reason for the fuzzy logic is superscript/subscript handling).
>
> I am pretty confident that the problem is not with the comparator.  The
> problem is that we are trying to use a simple sort algorithm to do
> something that is not simple.
>
> I think that a more robust approach would be something like this:
>
> Group TextPositons by direction (note: Current TextStripper doesn't do
> this, which causes a ton of problems for extractions that have watermarks,
> etc...)
> Group TextPositions by Y-position.  Because these are floats, we will need
> to do some amount of rounding.  So I think we use Math.floor() to make the
> measurement a whole number
> Sort each group by X-position
>
> This gives us a baseline layout.  Now, we need to go through and identify
> superscript/subscript candidates and merge them into the appropriate line.
> This is not trivial, but it's a *lot* more deterministic than trying to
> tune the comparator.
>
>
> Here is a rough idea for a superscript/subscript merge algorithm (this
> would be applied to each "direction" group):
>
> For superscripts:
>
> Start at the first Y-position-group.  Scan each TextPosition and see if it
> is a superscript candidate to the next Y position group.  If it is a
> candidate, remove the TextPosition from the current Y-position-group and
> merge it into the correct location in the *next* Y-position-group.
> Repeat for the next Y-position-group.
>
> The net effect should be that all superscript candidates have been pushed
> down to the line(s) below.
>
> For subscripts:
>
> Start at the last Y-position-group.  Do a similar scan as before, but
> looking at the Y-position-group above it.
>
> The net effect should be that all subscript candidates have been pushed up
> to the line(s) above.
>
>
> Finally, we have to define what a superscript candidate is:  I think this
> would be a text position that has a corresponding TextPosition to the left
> or right of it's X position, but with a Y-Position that is inside the left
> or right TextPosition's bounding box.
>
>
> This is complex, of course - but I think this is a complex problem.  The
> key here is that we should not be mixing the overall text ordering
> algorithm, with the specialized superscript/subscript handling algorithm.
>
> - K
>
> Kevin Day
>
> *trumpet**p| *480.961.6003 x1002
> *e| *[email protected]
> *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
> <https://www.linkedin.com/company/trumpet-inc.>
>
>
> On Wed, Apr 9, 2025 at 8:08 AM Tilman Hausherr <[email protected]>
> wrote:
>
>> On 09.04.2025 16:36, Kevin Day wrote:
>> > Understood.
>> >
>> > My biggest comment is that having a non-transitive comparator in a sort
>> > algorithm is a really bad idea.  It produces all sorts of
>> non-deterministic
>> > behavior.
>> >
>> > So I'm in agreement that a better solution is needed.
>> >
>> > Do you have any history of why the fuzzy logic is in that comparator?
>> Some
>> > sort of git blame or anything that might explain?  I can't imagine this
>> was
>> > part of the original algorithm.
>>
>> This happened before it became an apache project. You would have to get
>> a CVS client to get the history:
>>
>> https://sourceforge.net/p/pdfbox/code/
>>
>> We had a discussion about the non transitivity here and nobody came up
>> with a better algorithm
>>
>> https://issues.apache.org/jira/browse/PDFBOX-1512
>>
>> See the comment by Hannes Erven on 21/Mar/14
>>
>> Tilman
>>
>>
>>
>>
>> >
>> > - K
>> >
>> > Kevin Day
>> >
>> > *trumpet**p| *480.961.6003 x1002
>> > *e| *[email protected]
>> > *www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
>> > <https://www.linkedin.com/company/trumpet-inc.>
>> >
>> >
>> > On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr <[email protected]>
>> > wrote:
>> >
>> >> Oops, the one from PDFBOX-3019 is no longer available. The one from
>> >> PDFBOX-2991 is here, you can test it yourself:
>> >>
>> https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
>> >> The original extraction is
>> >> Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
>> >> I don't have any thoughts about the algorithm because I would have to
>> >> understand it first and I would need a lot of time and quietness for
>> >> this. At this time, all I can do to help is to test changes and then
>> >> tell about problems.
>> >> Another example is the file from
>> >> https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output
>> >> from the text stripper test:
>> >> Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015, SOFTLINE
>> >> Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116
>> >> UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler
>> >> 07.04.2015]] New: [position: 0, size: 4, lines: [Datum :
>> >> 23.04.2015Firma, Kund.Nr. : 44812SOFTLINE
>> Datenverarbeitungsgesellschaft
>> >> mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler
>> >> 07.04.20158200 Gleisdorf]]
>> >> Tilman
>> >>
>> >> On 08.04.2025 15:29, Kevin Day wrote:
>> >>> Hmmm.
>> >>>
>> >>> Do you know what the extracted text was for those two examples under
>> the
>> >>> original sort algorithm? Were those text chunks properly extracted
>> with
>> >> the
>> >>> expected space between them?
>> >>>
>> >>> I'm not very clear on why the examples you show would be missing a
>> word
>> >>> break detection after changing the sort. Or is it possible that the
>> text
>> >>> itself has a space glyph in it? I'm wondering if that space is maybe
>> >>> getting sorted weird because it has zero width...
>> >>>
>> >>>
>> >>>
>> >>> A few other thoughts:
>> >>>
>> >>> My proposed change is not a well thought through algorithm - it was a
>> >> hack
>> >>> to try to emulate the "block detection" you mention. It may be that
>> fine
>> >>> tuning the nearX calculation could be what is needed.
>> >>>
>> >>> For example, change the 4 to a 1. Or possibly take the average (or
>> maybe
>> >>> the geometric mean) of the two text positions.
>> >>>
>> >>> Actually, as I write this, I think it may be advisable to use the same
>> >>> algorithm that determines word breaks... If the x positions of the two
>> >> TPs
>> >>> are within that threshold, then nearX will be true and the fuzzy logic
>> >>> would kick in during the sort.
>> >>>
>> >>>
>> >>> What are your thoughts?
>> >>>
>> >>> K
>> >>>
>> >>> Kevin Day
>> >>>
>> >>> *trumpet**p| *480.961.6003 x1002
>> >>> *e| *[email protected]
>> >>> *www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn
>> >>> <https://www.linkedin.com/company/trumpet-inc.>
>> >>>
>> >>> On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<[email protected]>
>> >> wrote:
>> >>>> I tried this and get lots of differences, obviously. I looked at two
>> >>>> files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense,
>> but
>> >>>> there's a new problem: the segments are not separated.
>> >>>>
>> >>>> PDFBOX-2991:
>> >>>> Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin
>> >>>>
>> >>>> PDFBOX-3019:
>> >>>> Originally from Dallas, I have since moved throughout the U.S. and
>> have
>> >>>> spent mos*t r*hodescc3(ätt)vcu.edu
>> >>>>
>> >>>> The part in bold is where I'd expect to have a better separation.
>> >>>>
>> >>>> I haven't dealt much with this algorithm... I think the ideal
>> solution
>> >>>> would be some sort of block detection that goes first, and in a next
>> >>>> step collect these blocks separately (like the "bead" logic that
>> already
>> >>>> exists)
>> >>>>
>> >>>> Tilman
>> >>>>
>> >>>>
>> >>>> On 07.04.2025 21:19, Kevin Day wrote:
>> >>>>> Here is my suggestion for a potential fix if preserving the "y
>> position
>> >>>>> window" behavior is necessary:
>> >>>>>
>> >>>>> boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;
>> >>>>>
>> >>>>> // we will do a simple tolerance comparison
>> >>>>>
>> >>>>> if (yDifference < .1 ||
>> >>>>>
>> >>>>> nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
>> >>>>>
>> >>>>> nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
>> >>>>>
>> >>>>> {
>> >>>>>
>> >>>>> return Float.compare(x1, x2);
>> >>>>>
>> >>>>> }
>> >>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Same line calculation of PDFTextStripper

Reply via email to