Oops, the one from PDFBOX-3019 is no longer available. The one from PDFBOX-2991 is here, you can test it yourself:
https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
The original extraction is
Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
I don't have any thoughts about the algorithm because I would have to understand it first and I would need a lot of time and quietness for this. At this time, all I can do to help is to test changes and then tell about problems. Another example is the file from https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output from the text stripper test: Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015, SOFTLINE Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116 UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler 07.04.2015]] New: [position: 0, size: 4, lines: [Datum : 23.04.2015Firma, Kund.Nr. : 44812SOFTLINE Datenverarbeitungsgesellschaft mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler 07.04.20158200 Gleisdorf]]
Tilman

On 08.04.2025 15:29, Kevin Day wrote:
Hmmm.

Do you know what the extracted text was for those two examples under the
original sort algorithm? Were those text chunks properly extracted with the
expected space between them?

I'm not very clear on why the examples you show would be missing a word
break detection after changing the sort. Or is it possible that the text
itself has a space glyph in it? I'm wondering if that space is maybe
getting sorted weird because it has zero width...



A few other thoughts:

My proposed change is not a well thought through algorithm - it was a hack
to try to emulate the "block detection" you mention. It may be that fine
tuning the nearX calculation could be what is needed.

For example, change the 4 to a 1. Or possibly take the average (or maybe
the geometric mean) of the two text positions.

Actually, as I write this, I think it may be advisable to use the same
algorithm that determines word breaks... If the x positions of the two TPs
are within that threshold, then nearX will be true and the fuzzy logic
would kick in during the sort.


What are your thoughts?

K

Kevin Day

*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>

On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<thaush...@t-online.de> wrote:

I tried this and get lots of differences, obviously. I looked at two
files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but
there's a new problem: the segments are not separated.

PDFBOX-2991:
Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin

PDFBOX-3019:
Originally from Dallas, I have since moved throughout the U.S. and have
spent mos*t r*hodescc3(ätt)vcu.edu

The part in bold is where I'd expect to have a better separation.

I haven't dealt much with this algorithm... I think the ideal solution
would be some sort of block detection that goes first, and in a next
step collect these blocks separately (like the "bead" logic that already
exists)

Tilman


On 07.04.2025 21:19, Kevin Day wrote:
Here is my suggestion for a potential fix if preserving the "y position
window" behavior is necessary:

boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;

// we will do a simple tolerance comparison

if (yDifference < .1 ||

nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||

nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)

{

return Float.compare(x1, x2);

}


Reply via email to