On 09.04.2025 16:36, Kevin Day wrote:
Understood.
My biggest comment is that having a non-transitive comparator in a sort
algorithm is a really bad idea. It produces all sorts of non-deterministic
behavior.
So I'm in agreement that a better solution is needed.
Do you have any history of why the fuzzy logic is in that comparator? Some
sort of git blame or anything that might explain? I can't imagine this was
part of the original algorithm.
This happened before it became an apache project. You would have to get
a CVS client to get the history:
https://sourceforge.net/p/pdfbox/code/
We had a discussion about the non transitivity here and nobody came up
with a better algorithm
https://issues.apache.org/jira/browse/PDFBOX-1512
See the comment by Hannes Erven on 21/Mar/14
Tilman
- K
Kevin Day
*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com <http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>
On Tue, Apr 8, 2025 at 7:24 AM Tilman Hausherr <thaush...@t-online.de>
wrote:
Oops, the one from PDFBOX-3019 is no longer available. The one from
PDFBOX-2991 is here, you can test it yourself:
https://issues.apache.org/jira/secure/attachment/12766900/sample-resume.pdf
The original extraction is
Benjamin Costa Mesa, California benjaminmccan(ätt)gmail.com
I don't have any thoughts about the algorithm because I would have to
understand it first and I would need a lot of time and quietness for
this. At this time, all I can do to help is to test changes and then
tell about problems.
Another example is the file from
https://issues.apache.org/jira/browse/PDFBOX-2794 , here's the output
from the text stripper test:
Org: [position: 0, size: 4, lines: [Firma Datum : 23.04.2015, SOFTLINE
Datenverarbeitungsgesellschaft mbH Kund.Nr. : 44812, Ungerdorf 116
UID-Nr. : ATU30603807, 8200 Gleisdorf Bestellnr. : Fr. Scharler
07.04.2015]] New: [position: 0, size: 4, lines: [Datum :
23.04.2015Firma, Kund.Nr. : 44812SOFTLINE Datenverarbeitungsgesellschaft
mbH, UID-Nr. : *ATU30603807Ungerdorf* 116, Bestellnr. : Fr. Scharler
07.04.20158200 Gleisdorf]]
Tilman
On 08.04.2025 15:29, Kevin Day wrote:
Hmmm.
Do you know what the extracted text was for those two examples under the
original sort algorithm? Were those text chunks properly extracted with
the
expected space between them?
I'm not very clear on why the examples you show would be missing a word
break detection after changing the sort. Or is it possible that the text
itself has a space glyph in it? I'm wondering if that space is maybe
getting sorted weird because it has zero width...
A few other thoughts:
My proposed change is not a well thought through algorithm - it was a
hack
to try to emulate the "block detection" you mention. It may be that fine
tuning the nearX calculation could be what is needed.
For example, change the 4 to a 1. Or possibly take the average (or maybe
the geometric mean) of the two text positions.
Actually, as I write this, I think it may be advisable to use the same
algorithm that determines word breaks... If the x positions of the two
TPs
are within that threshold, then nearX will be true and the fuzzy logic
would kick in during the sort.
What are your thoughts?
K
Kevin Day
*trumpet**p| *480.961.6003 x1002
*e| *ke...@trumpetinc.com
*www.trumpetinc.com<http://trumpetinc.com/> | *LinkedIn
<https://www.linkedin.com/company/trumpet-inc.>
On Tue, Apr 8, 2025, 1:13 AM Tilman Hausherr<thaush...@t-online.de>
wrote:
I tried this and get lots of differences, obviously. I looked at two
files (PDFBOX-2991 and PDFBOX-3019) and the difference make sense, but
there's a new problem: the segments are not separated.
PDFBOX-2991:
Costa Mesa, California, benjaminmccan(ätt)gmail.co*mB*enjamin
PDFBOX-3019:
Originally from Dallas, I have since moved throughout the U.S. and have
spent mos*t r*hodescc3(ätt)vcu.edu
The part in bold is where I'd expect to have a better separation.
I haven't dealt much with this algorithm... I think the ideal solution
would be some sort of block detection that goes first, and in a next
step collect these blocks separately (like the "bead" logic that already
exists)
Tilman
On 07.04.2025 21:19, Kevin Day wrote:
Here is my suggestion for a potential fix if preserving the "y position
window" behavior is necessary:
boolean nearX = Math.abs(x1-x2) < pos1.getIndividualWidths()[0] * 4;
// we will do a simple tolerance comparison
if (yDifference < .1 ||
nearX && pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
nearX && pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
{
return Float.compare(x1, x2);
}
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org