Re: Same line calculation of PDFTextStripper

Tilman Hausherr Thu, 10 Apr 2025 17:08:50 -0700

On 09.04.2025 16:36, Kevin Day wrote:

Understood.


My biggest comment is that having a non-transitive comparator in a sort
algorithm is a really bad idea.  It produces all sorts of non-deterministic
behavior.

So I'm in agreement that a better solution is needed.

Do you have any history of why the fuzzy logic is in that comparator?  Some
sort of git blame or anything that might explain?  I can't imagine this was
part of the original algorithm.

This happened before it became an apache project. You would have to geta CVS client to get the history:


https://sourceforge.net/p/pdfbox/code/

We had a discussion about the non transitivity here and nobody came upwith a better algorithm


https://issues.apache.org/jira/browse/PDFBOX-1512

See the comment by Hannes Erven on 21/Mar/14

Tilman


- K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>


On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

Oops, the one from PDFBOX-3019 is no longer available. The one from
PDFBOX-2991 is here, you can test it yourself:
https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
The original extraction is
Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
I don't have any thoughts about the algorithm because I would have to
understand it first and I would need a lot of time and quietness for
this. At this time, all I can do to help is to test changes and then
tell about problems.
Another example is the file from
https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output
from the text stripper test:
Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015, SOFTLINE
Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116
UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler
07.04.2015]] New: [position: 0, size: 4, lines: [Datum :
23.04.2015Firma, Kund.Nr. : 44812SOFTLINE Datenverarbeitungsgesellschaft
mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler
07.04.20158200 Gleisdorf]]
Tilman

On 08.04.2025 15:29, Kevin Day wrote:

Hmmm.

Do you know what the extracted text was for those two examples under the
original sort algorithm? Were those text chunks properly extracted with

the

expected space between them?

I'm not very clear on why the examples you show would be missing a word
break detection after changing the sort. Or is it possible that the text
itself has a space glyph in it? I'm wondering if that space is maybe
getting sorted weird because it has zero width...



A few other thoughts:

My proposed change is not a well thought through algorithm - it was a

hack

to try to emulate the "block detection" you mention. It may be that fine
tuning the nearX calculation could be what is needed.

For example, change the 4 to a 1. Or possibly take the average (or maybe
the geometric mean) of the two text positions.

Actually, as I write this, I think it may be advisable to use the same
algorithm that determines word breaks... If the x positions of the two

TPs

are within that threshold, then nearX will be true and the fuzzy logic
would kick in during the sort.


What are your thoughts?

K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>

On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<thaush...@t-online.de>

wrote:

I tried this and get lots of differences, obviously. I looked at two
files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but
there's a new problem: the segments are not separated.

PDFBOX-2991:
Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin

PDFBOX-3019:
Originally from Dallas, I have since moved throughout the U.S. and have
spent mos*t r*hodescc3(ätt)vcu.edu

The part in bold is where I'd expect to have a better separation.

I haven't dealt much with this algorithm... I think the ideal solution
would be some sort of block detection that goes first, and in a next
step collect these blocks separately (like the "bead" logic that already
exists)

Tilman


On 07.04.2025 21:19, Kevin Day wrote:

Here is my suggestion for a potential fix if preserving the "y position
window" behavior is necessary:

boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;

// we will do a simple tolerance comparison

if (yDifference < .1 ||

nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||

nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)

{

return Float.compare(x1, x2);

}



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Same line calculation of PDFTextStripper

Reply via email to