[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913364#comment-16913364 ] Paul Slootweg commented on PDFBOX-4313: --- I am currently seeing a similar problem - in this case a line of bold text has a line of standard text below it and places the second line as part of the first. This looks to be because it is using the bold font height to compare the overlap for the standard line. See the attached file `details.pdf` - {{protected void writeString(String text, List textPositions)}} passes `text` as "Quote / Invoice Number: AT-82081073PO Number: CS-20167 " despite being on separate lines. The overlap() method should also look at the x position to determine what, if any, the overlap is. *PDFBox 2.0.16* > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Assignee: Andreas Lehmkühler >Priority: Major > Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, > PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, > PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, > details.pdf, pdfbox-words.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913334#comment-16913334 ] Paul Slootweg commented on PDFBOX-4313: --- I am currently seeing a similar problem - in this case a line of bold text has a line of standard text below it and places the second line as part of the first. This looks to be because it is using the bold font height to compare the overlap for the standard line. Unfortunately at this point I can't provide a failing PDF, but I will try to see if I can. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Assignee: Andreas Lehmkühler >Priority: Major > Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, > PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, > PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, > pdfbox-words.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625081#comment-16625081 ] Andreas Lehmkühler commented on PDFBOX-4313: Linebreaks are triggered only if the last and the current textposition don't overlap at all. The given case is a corner case. This is the relevant code from PDFTextStripper {code} private boolean overlap(float y1, float height1, float y2, float height2) { return within(y1, y2, .1f) || y2 <= y1 && y2 >= y1 - height1 || y1 <= y2 && y1 >= y2 - height2; } {code} These are the relevant testpositions from DrawPrintTextLocations {code} String[714.886,293.3178 fs=6.0 xscale=6.0 height=3.468 space=1.6680002 width=1.3319702]l String[20.0,297.63782 fs=6.0 xscale=6.0 height=3.468 space=1.6680002 width=4.3320007]D 293.3178 <= 297.63782 && 293.3178 >= 297.63782 - 3.468 = 293.16982 -> leads to "true" and doesn't detect the line break {code} I've experimented with some threshold values to make the overlap detection a little bit more lenient. I've used 90% of the given height values. {code} private boolean overlap(float y1, float height1, float y2, float height2) { return within(y1, y2, .1f) || (y2 <= y1 && y1 - height1 - y2 < - (height1 * 0.1f)) || (y1 <= y2 && y2 - height2 - y1 < - (height2 * 0.1f)); } {code} Could this be a reasonable solution? Instead of using a fixed threshold we could introduce another parameter to change that value from the outside. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, > PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, > PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, > pdfbox-words.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625055#comment-16625055 ] Andreas Lehmkühler commented on PDFBOX-4313: I've attached the resulting pdf from the given test and both results from text extraction (sorted and unsorted) using the 2.0 branch. The unsorted result isn't useful as the text is stored unsorted in the pdf. The sorted result doesn't show any issues with the number values in the second row. The column headers are difficult but the result is as good/bad as expected with one exception. There seems to be an issue with a missing line break after "Modul" > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, > PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, > PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, > pdfbox-words.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623903#comment-16623903 ] Emilian Bold commented on PDFBOX-4313: -- Normally a word has a direction. It's either LTR or RTL. My test is checking a reversal of direction inside a single word, as returned by PDFTextStripper, which is the bug. I'm attaching an updates test where I just check for the exact wrong words being returned. Sorting doesn't help. Also attaching a drawing with the words. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: 1536938716546.pdf, PDFBOX-4313.pdf, PDFBOX4313Test.java, > crop-fisa-sintetica.png, pdfbox-words.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616678#comment-16616678 ] Tilman Hausherr commented on PDFBOX-4313: - I don't understand your test. Your initial complaint was that two words (numbers) are merged into one. Where is this happening? However maybe your test is related to a known problem, that our sort criteria are not perfect, see PDFBOX-1512 and related issues. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: 1536938716546.pdf, PDFBOX-4313.pdf, PDFBOX4313Test.java, > crop-fisa-sintetica.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614984#comment-16614984 ] Emilian Bold commented on PDFBOX-4313: -- See the attached PDF [^1536938716546.pdf] This reproduces the bug for me with and without setSortByPosition. Also attaching the unit test for it. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: 1536938716546.pdf, PDFBOX-4313.pdf, > crop-fisa-sintetica.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612475#comment-16612475 ] Emilian Bold commented on PDFBOX-4313: -- Thanks for that sample code! I should be able to use it and create a test PDF for you (assuming it adds chars without sorting them first by coordinates or something too smart to be able to duplicate the PDF layout I see). > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: PDFBOX-4313.pdf, crop-fisa-sintetica.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610849#comment-16610849 ] Tilman Hausherr commented on PDFBOX-4313: - See attached file - is that what you have in mind? It was created with this code: {code} try (PDDocument doc = new PDDocument()) { PDPage page = new PDPage(); doc.addPage(page); try (PDPageContentStream cs = new PDPageContentStream(doc, page)) { cs.beginText(); cs.setFont(PDType1Font.HELVETICA, 12); cs.newLineAtOffset(200, 700); cs.showText("456"); cs.newLineAtOffset(-100, 0); cs.showText("123"); cs.endText(); } doc.save(new File(….)); } {code} However it works in sorted mode. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: PDFBOX-4313.pdf, crop-fisa-sintetica.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610232#comment-16610232 ] Emilian Bold commented on PDFBOX-4313: -- > I can't do any changes without a test PDF. There is more than just what you >posted. You shouldn't need a PDF to make a unit test that fails for PDFTextStripper. The bug is obvious, you do ' && expectedStartOfNextWordX < positionX' so you assume the next character will have an increasing X coord. If you get a character with X before the start of the current word, you will still append it. So not only there's a bug with regard to detecting there are separate words, by appending text that's before the start of the current word, the ordering is wrong too! setSortByPosition(true) does not fix this as shown above. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: crop-fisa-sintetica.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607993#comment-16607993 ] Tilman Hausherr commented on PDFBOX-4313: - The cropping may only have changed the cropbox rectangle. I can't do any changes without a test PDF. There is more than just what you posted. I need the PDF or a reduced version of it. A reduced version may be possible if you create a decoded file first (command line utilities "WriteDecodedDoc"), and then change the content stream with an editor. Of course for that you'd need to know a bit about the content stream operators etc. Alternatively change the source code in the way you think is needed and then run the build tests. If they pass without errors, or only improvements, please tell what you did and I'll run additional tests with files that are not in the repository due to copyright reasons. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: crop-fisa-sintetica.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607982#comment-16607982 ] Emilian Bold commented on PDFBOX-4313: -- setSortByPosition(true) makes some errors go away but introduces some new others: Direction switch for `Obligatie de plataDocument` split 1 > Obligatie de plata 267.88 x 111.66 Obligatie de plata@ 267.88 x 111.66[, `O` @ 267.88 x 111.66, `b` @ 274.048 x 111.66, `l` @ 278.44 x 111.66, `i` @ 280.224 x 111.66, `g` @ 282.008 x 111.66, `a` @ 286.4 x 111.66, `t` @ 290.792 x 111.66, `i` @ 292.98398 x 111.66, `e` @ 294.76797 x 111.66, ` ` @ 299.15997 x 111.66, ` ` @ 301.35196 x 111.66, `d` @ 303.54395 x 111.66, `e` @ 307.93594 x 111.66, ` ` @ 312.32794 x 111.66, `p` @ 314.51993 x 111.66, `l` @ 318.91193 x 111.66, `a` @ 320.69592 x 111.66, `t` @ 325.08792 x 111.66, `a` @ 327.2799 x 111.66] split 2 > Document 72.84 x 117.7 Document@ 72.84 x 117.7[, `D` @ 72.84 x 117.7, `o` @ 78.6 x 117.7, `c` @ 82.992 x 117.7, `u` @ 86.967995 x 117.7, `m` @ 91.35999 x 117.7, `e` @ 98.079994 x 117.7, `n` @ 102.47199 x 117.7, `t` @ 106.86399 x 117.7] These two chunks that get mushed together after setSortByPosition(true) are part of this table header: !crop-fisa-sintetica.png! To me the bug still seems related to PDFTextStripper (which should order the items anyhow if it requires a specific ordering). I cannot attache the PDF as it contains financial records. Oddly enough even cropping the PDF (with macOS Preview) seems to preserve some confidential text that's outside the bounds of the crop. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > Attachments: crop-fisa-sintetica.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606051#comment-16606051 ] Tilman Hausherr commented on PDFBOX-4313: - Please attach the PDF file. Also try if the sort option helps. > PDFTextStripper groups unrelated chunks into words > -- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.11 >Reporter: Emilian Bold >Priority: Major > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ >// test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org