I am attaching the patch file. And yes, this patch is simply PDFBOX-3774 as an option, a small cosmetic change to use idiomatic Java for PDFBOX-5487, and a unit test that demonstrates the overlapping.
A couple of additional thoughts: 1. I feel that PDFBOX-5487 isn't doing very much. The PDFBOX-3774 feature will address the problem fixed by PDFBOX-5487, and the "problem" of having a space glyph entirely within the previous character is a very restricted edge-case. In the end, the performance hit is not a big deal, but it is code that needs to be maintained. I thought I'd mention it in case the PDFBOX-5487 requester would be happy with PDFBOX-3774 as a solution. 2. I noticed that there is a note about JDK7+ sorting requiring transitive comparators. Given that the build requires JDK8+, I wonder if it is time to remove the Collections.sort path (and get rid of an exception throw, etc...)? - K On Mon, Dec 16, 2024 at 6:21 AM Tilman Hausherr <thaush...@t-online.de> wrote: > On 16.12.2024 14:02, Kevin Day wrote: > > I just realized that there is an incorrect note in the getter/setter > > Javadocs about the setting only taking effect if sorting is enabled. > > > > That note can be removed. The new setting is valid regardless of whether > > sorting is enabled. > > Hi, > > Could you please resend the patch as text attachment? Somehow the mail > program messed this up. > > From what I understand, the patch is the suggestion from PDFBOX-3774but > as an option, plus a test. The other change (re PDFBOX-5487) is a > (useful) cosmetic change. I wonder why I missed that when I committed it. > > Tilman >
Index: pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java =================================================================== --- pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java (revision 1922522) +++ pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java (working copy) @@ -40,6 +40,7 @@ import java.util.TreeMap; import java.util.TreeSet; import java.util.regex.Pattern; +import java.util.stream.Collectors; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; @@ -147,6 +148,7 @@ private boolean shouldSeparateByBeads = true; private boolean sortByPosition = false; private boolean addMoreFormatting = false; + private boolean ignoreContentStreamSpaceGlyphs = false; private float indentThreshold = defaultIndentThreshold; private float dropThreshold = defaultDropThreshold; @@ -524,11 +526,10 @@ { IterativeMergeSort.sort(textList, comparator); } - finally - { - // PDFBOX-5487: Remove all space characters if contained within the adjacent letters - removeContainedSpaces(textList); - } + + // PDFBOX-5487: Remove all space characters if contained within the adjacent letters + removeContainedSpaces(textList); + } startArticle(); @@ -556,6 +557,10 @@ PositionWrapper current = new PositionWrapper(position); String characterValue = position.getUnicode(); + // PDFBOX-3774 - conditionally ignore spaces from the content stream + if (" ".equals(characterValue) && getIgnoreContentStreamSpaceGlyphs()) + continue; + // Resets the average character width when we see a change in font // or a change in the font size if (lastPosition != null && @@ -1273,6 +1278,29 @@ sortByPosition = newSortByPosition; } + + /** + * Determines whether spaces in the content stream text rendering instructions will be ignored during text extraction. + * + * @return true is space glyphs in the content stream text rendering instructions will be ignored - default is false + */ + public boolean getIgnoreContentStreamSpaceGlyphs() { + return ignoreContentStreamSpaceGlyphs; + } + + /** + * Instruct the algorithm to ignore any spaces in the text rendering instructions in the content stream, and + * instead rely purely on the algorithm to determine where word breaks are. + * + * This can improve text extraction results where the content stream is sorted by position and has text overlapping + * spaces, but could cause some word breaks to not be added to the output + * + * @param newIgnoreRenderedSpaces whether PDF Box should ignore context stream spaces + */ + public void setIgnoreContentStreamSpaceGlyphs(boolean newIgnoreContentStreamSpaceGlyphs) { + ignoreContentStreamSpaceGlyphs = newIgnoreContentStreamSpaceGlyphs; + } + /** * Get the current space width-based tolerance value that is being used to estimate where spaces in text should be * added. Note that the default value for this has been determined from trial and error. Index: pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java =================================================================== --- pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java (revision 0) +++ pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java (revision 0) @@ -0,0 +1,58 @@ +package org.apache.pdfbox.text; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +import org.apache.pdfbox.pdmodel.PDDocument; +import org.apache.pdfbox.pdmodel.PDPage; +import org.apache.pdfbox.pdmodel.PDPageContentStream; +import org.apache.pdfbox.pdmodel.font.PDFont; +import org.apache.pdfbox.pdmodel.font.PDType1Font; +import org.apache.pdfbox.pdmodel.font.Standard14Fonts.FontName; +import org.junit.jupiter.api.Test; + +public class PDFTextStripperOverlapTest { + + @Test + void testIgnoreContentStreamSpaceGlyphs() throws Exception + { + try (PDDocument doc = new PDDocument()) + { + PDPage page = new PDPage(); + try (PDPageContentStream cs = new PDPageContentStream(doc, page)) + { + float fontHeight = 8; + float x = 50; + float y = page.getMediaBox().getHeight() - 50; + PDFont font = new PDType1Font(FontName.HELVETICA); + cs.beginText(); + cs.setFont(font, fontHeight); + cs.newLineAtOffset(x, y); + cs.showText("( )"); + cs.endText(); + + int indent = 6; + float overlapX = x + indent * font.getAverageFontWidth()/1000f*fontHeight; + PDFont overlapFont = new PDType1Font(FontName.TIMES_ROMAN); + cs.beginText(); + cs.setFont(overlapFont, fontHeight*2f); + cs.newLineAtOffset(overlapX, y); + cs.showText("overlap"); + cs.endText(); + } + doc.addPage(page); + + PDFTextStripper stripper = new PDFTextStripper(); + stripper.setLineSeparator("\n"); + stripper.setPageEnd("\n"); + stripper.setStartPage(1); + stripper.setEndPage(1); + stripper.setSortByPosition(true); + + stripper.setIgnoreContentStreamSpaceGlyphs(true); + String text = stripper.getText(doc); + assertEquals("( overlap )\n", text); + + } + } + +}
--------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org