[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2019-08-22 Thread Paul Slootweg (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913364#comment-16913364
 ] 

Paul Slootweg commented on PDFBOX-4313:
---

I am currently seeing a similar problem - in this case a line of bold text has 
a line of standard text below it and places the second line as part of the 
first.

This looks to be because it is using the bold font height to compare the 
overlap for the standard line.

See the attached file `details.pdf` - 

{{protected void writeString(String text, List textPositions)}} 
passes `text` as "Quote / Invoice Number: AT-82081073PO Number: CS-20167 " 
despite being on separate lines.

The overlap() method should also look at the x position to determine what, if 
any, the overlap is.

*PDFBox 2.0.16*

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Assignee: Andreas Lehmkühler
>Priority: Major
> Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, 
> PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, 
> PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, 
> details.pdf, pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2019-08-22 Thread Paul Slootweg (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913334#comment-16913334
 ] 

Paul Slootweg commented on PDFBOX-4313:
---

I am currently seeing a similar problem - in this case a line of bold text has 
a line of standard text below it and places the second line as part of the 
first.

This looks to be because it is using the bold font height to compare the 
overlap for the standard line.

Unfortunately at this point I can't provide a failing PDF, but I will try to 
see if I can.

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Assignee: Andreas Lehmkühler
>Priority: Major
> Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, 
> PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, 
> PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, 
> pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625081#comment-16625081
 ] 

Andreas Lehmkühler commented on PDFBOX-4313:


Linebreaks are triggered only if the last and the current textposition don't 
overlap at all. The given case is a corner case.

This is the relevant code from PDFTextStripper
{code}
private boolean overlap(float y1, float height1, float y2, float height2)
{
return within(y1, y2, .1f) || y2 <= y1 && y2 >= y1 - height1
|| y1 <= y2 && y1 >= y2 - height2;
}
{code}
These are the relevant testpositions from DrawPrintTextLocations
{code}
String[714.886,293.3178 fs=6.0 xscale=6.0 height=3.468 space=1.6680002 
width=1.3319702]l
String[20.0,297.63782 fs=6.0 xscale=6.0 height=3.468 space=1.6680002 
width=4.3320007]D

293.3178 <= 297.63782 && 293.3178 >= 297.63782 - 3.468 = 293.16982 -> leads to 
"true" and doesn't detect the line break
{code}

I've experimented with some threshold values to make the overlap detection a 
little bit more lenient. I've used 90% of the given height values.
{code}
private boolean overlap(float y1, float height1, float y2, float height2)
{
return within(y1, y2, .1f) || (y2 <= y1 && y1 - height1 - y2 < - (height1 * 
0.1f))
|| (y1 <= y2 && y2 - height2 - y1 < - (height2 * 0.1f));
}
{code}
Could this be a reasonable solution? Instead of using a fixed threshold we 
could introduce another parameter to change that value from the outside.



> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, 
> PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, 
> PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, 
> pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625055#comment-16625055
 ] 

Andreas Lehmkühler commented on PDFBOX-4313:


I've attached the resulting pdf from the given test and both results from text 
extraction (sorted and unsorted) using the 2.0 branch. The unsorted result 
isn't useful as the text is stored unsorted in the pdf. The sorted result 
doesn't show any issues with the number values in the second row. The column 
headers are difficult but the result is as good/bad as expected with one 
exception. There seems to be an issue with a missing line break after "Modul"

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, 
> PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, 
> PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, 
> pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-21 Thread Emilian Bold (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623903#comment-16623903
 ] 

Emilian Bold commented on PDFBOX-4313:
--

Normally a word has a direction. It's either LTR or RTL. My test is checking a 
reversal of direction inside a single word, as returned by PDFTextStripper, 
which is the bug.

I'm attaching an updates test where I just check for the exact wrong words 
being returned. Sorting doesn't help.

Also attaching a drawing with the words.

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: 1536938716546.pdf, PDFBOX-4313.pdf, PDFBOX4313Test.java, 
> crop-fisa-sintetica.png, pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-16 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616678#comment-16616678
 ] 

Tilman Hausherr commented on PDFBOX-4313:
-

I don't understand your test. Your initial complaint was that two words 
(numbers) are merged into one. Where is this happening?

However maybe your test is related to a known problem, that our sort criteria 
are not perfect, see PDFBOX-1512 and related issues.

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: 1536938716546.pdf, PDFBOX-4313.pdf, PDFBOX4313Test.java, 
> crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-14 Thread Emilian Bold (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614984#comment-16614984
 ] 

Emilian Bold commented on PDFBOX-4313:
--

See the attached PDF [^1536938716546.pdf] This reproduces the bug for me with 
and without setSortByPosition.

Also attaching the unit test for it.

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: 1536938716546.pdf, PDFBOX-4313.pdf, 
> crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-12 Thread Emilian Bold (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612475#comment-16612475
 ] 

Emilian Bold commented on PDFBOX-4313:
--

Thanks for that sample code! I should be able to use it and create a test PDF 
for you (assuming it adds chars without sorting them first by coordinates or 
something too smart to be able to duplicate the PDF layout I see).

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: PDFBOX-4313.pdf, crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-11 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610849#comment-16610849
 ] 

Tilman Hausherr commented on PDFBOX-4313:
-

See attached file - is that what you have in mind? It was created with this 
code:
{code}
try (PDDocument doc = new PDDocument())
{
PDPage page = new PDPage();
doc.addPage(page);
try (PDPageContentStream cs = new PDPageContentStream(doc, page))
{
cs.beginText();
cs.setFont(PDType1Font.HELVETICA, 12);
cs.newLineAtOffset(200, 700);
cs.showText("456");
cs.newLineAtOffset(-100, 0);
cs.showText("123");
cs.endText();
}
doc.save(new File(….));
}
 {code}
However it works in sorted mode.

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: PDFBOX-4313.pdf, crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-11 Thread Emilian Bold (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610232#comment-16610232
 ] 

Emilian Bold commented on PDFBOX-4313:
--

>  I can't do any changes without a test PDF. There is more than just what you 
>posted.

You shouldn't need a PDF to make a unit test that fails for PDFTextStripper.

The bug is obvious, you do ' && expectedStartOfNextWordX < positionX' so you 
assume the next character will have an increasing X coord. If you get a 
character with X before the start of the current word, you will still append 
it. So not only there's a bug with regard to detecting there are separate 
words, by appending text that's before the start of the current word, the 
ordering is wrong too!

setSortByPosition(true) does not fix this as shown above.

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-08 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607993#comment-16607993
 ] 

Tilman Hausherr commented on PDFBOX-4313:
-

The cropping may only have changed the cropbox rectangle.

I can't do any changes without a test PDF. There is more than just what you 
posted.

I need the PDF or a reduced version of it. A reduced version may be possible if 
you create a decoded file first (command line utilities "WriteDecodedDoc"), and 
then change the content stream with an editor. Of course for that you'd need to 
know a bit about the content stream operators etc.

Alternatively change the source code in the way you think is needed and then 
run the build tests. If they pass without errors, or only improvements, please 
tell what you did and I'll run additional tests with files that are not in the 
repository due to copyright reasons.

 

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-08 Thread Emilian Bold (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607982#comment-16607982
 ] 

Emilian Bold commented on PDFBOX-4313:
--

setSortByPosition(true) makes some errors go away but introduces some new 
others:

Direction switch for `Obligatie de plataDocument`
split 1 > Obligatie de plata 267.88 x 111.66 Obligatie de plata@ 267.88 x 
111.66[, `O` @ 267.88 x 111.66, `b` @ 274.048 x 111.66, `l` @ 
278.44 x 111.66, `i` @ 280.224 x 111.66, `g` @ 282.008 x 111.66, 
`a` @ 286.4 x 111.66, `t` @ 290.792 x 111.66, `i` @ 292.98398 x 
111.66, `e` @ 294.76797 x 111.66, ` ` @ 299.15997 x 111.66, ` ` @ 
301.35196 x 111.66, `d` @ 303.54395 x 111.66, `e` @ 307.93594 x 
111.66, ` ` @ 312.32794 x 111.66, `p` @ 314.51993 x 111.66, `l` @ 
318.91193 x 111.66, `a` @ 320.69592 x 111.66, `t` @ 325.08792 x 
111.66, `a` @ 327.2799 x 111.66]
split 2 > Document 72.84 x 117.7 Document@ 72.84 x 117.7[, `D` @ 72.84 
x 117.7, `o` @ 78.6 x 117.7, `c` @ 82.992 x 117.7, `u` @ 86.967995 
x 117.7, `m` @ 91.35999 x 117.7, `e` @ 98.079994 x 117.7, `n` @ 
102.47199 x 117.7, `t` @ 106.86399 x 117.7]

These two chunks that get mushed together after setSortByPosition(true) are 
part of this table header: 

!crop-fisa-sintetica.png!

 

To me the bug still seems related to PDFTextStripper (which should order the 
items anyhow if it requires a specific ordering).

I cannot attache the PDF as it contains financial records. Oddly enough even 
cropping the PDF (with macOS Preview) seems to preserve some confidential text 
that's outside the bounds of the crop.

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
> Attachments: crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

2018-09-06 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606051#comment-16606051
 ] 

Tilman Hausherr commented on PDFBOX-4313:
-

Please attach the PDF file. Also try if the sort option helps.

> PDFTextStripper groups unrelated chunks into words
> --
>
> Key: PDFBOX-4313
> URL: https://issues.apache.org/jira/browse/PDFBOX-4313
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.11
>Reporter: Emilian Bold
>Priority: Major
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>// test if our TextPosition starts after a new word would 
> be expected to start
> if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
> && expectedStartOfNextWordX < positionX &&
> // only bother adding a space if the last 
> character was not a space
> lastPosition.getTextPosition().getUnicode() != 
> null
> && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
> {
> line.add(LineItem.getWordSeparator());
> }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org