Re: Text extraction adding lots of strange spaces

Kevin Day Mon, 16 Dec 2024 06:56:12 -0800

I am attaching the patch file.

And yes, this patch is simply PDFBOX-3774 as an option, a small cosmetic
change to use idiomatic Java for PDFBOX-5487, and a unit test that
demonstrates the overlapping.

A couple of additional thoughts:

1.  I feel that PDFBOX-5487 isn't doing very much.  The PDFBOX-3774 feature
will address the problem fixed by PDFBOX-5487, and the "problem" of having
a space glyph entirely within the previous character is a very restricted
edge-case.  In the end, the performance hit is not a big deal, but it is
code that needs to be maintained.  I thought I'd mention it in case the
PDFBOX-5487 requester would be happy with PDFBOX-3774 as a solution.

2.  I noticed that there is a note about JDK7+ sorting requiring transitive
comparators.  Given that the build requires JDK8+, I wonder if it is time
to remove the Collections.sort path (and get rid of an exception throw,
etc...)?

- K

On Mon, Dec 16, 2024 at 6:21 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

> On 16.12.2024 14:02, Kevin Day wrote:
> > I just realized that there is an incorrect note in the getter/setter
> > Javadocs about the setting only taking effect if sorting is enabled.
> >
> > That note can be removed. The new setting is valid regardless of whether
> > sorting is enabled.
>
> Hi,
>
> Could you please resend the patch as text attachment? Somehow the mail
> program messed this up.
>
>  From what I understand, the patch is the suggestion from PDFBOX-3774but
> as an option, plus a test. The other change (re PDFBOX-5487) is a
> (useful) cosmetic change. I wonder why I missed that when I committed it.
>
> Tilman
>

Index: pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
===================================================================
--- pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java    
(revision 1922522)
+++ pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java    
(working copy)
@@ -40,6 +40,7 @@
 import java.util.TreeMap;
 import java.util.TreeSet;
 import java.util.regex.Pattern;
+import java.util.stream.Collectors;
 
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
@@ -147,6 +148,7 @@
     private boolean shouldSeparateByBeads = true;
     private boolean sortByPosition = false;
     private boolean addMoreFormatting = false;
+    private boolean ignoreContentStreamSpaceGlyphs = false;
 
     private float indentThreshold = defaultIndentThreshold;
     private float dropThreshold = defaultDropThreshold;
@@ -524,11 +526,10 @@
                 {
                     IterativeMergeSort.sort(textList, comparator);
                 }
-                finally
-                {
-                    // PDFBOX-5487: Remove all space characters if contained 
within the adjacent letters
-                    removeContainedSpaces(textList);
-                }
+                
+                // PDFBOX-5487: Remove all space characters if contained 
within the adjacent letters
+                removeContainedSpaces(textList);
+                
             }
 
             startArticle();
@@ -556,6 +557,10 @@
                 PositionWrapper current = new PositionWrapper(position);
                 String characterValue = position.getUnicode();
 
+                // PDFBOX-3774 - conditionally ignore spaces from the content 
stream
+                if (" ".equals(characterValue) && 
getIgnoreContentStreamSpaceGlyphs())
+                       continue;
+                
                 // Resets the average character width when we see a change in 
font
                 // or a change in the font size
                 if (lastPosition != null &&
@@ -1273,6 +1278,29 @@
         sortByPosition = newSortByPosition;
     }
 
+    
+    /**
+     * Determines whether spaces in the content stream text rendering 
instructions will be ignored during text extraction.
+     * 
+     * @return true is space glyphs in the content stream text rendering 
instructions will be ignored - default is false
+     */
+    public boolean getIgnoreContentStreamSpaceGlyphs() {
+       return ignoreContentStreamSpaceGlyphs;
+    }
+    
+    /**
+     * Instruct the algorithm to ignore any spaces in the text rendering 
instructions in the content stream, and
+     * instead rely purely on the algorithm to determine where word breaks are.
+     * 
+     * This can improve text extraction results where the content stream is 
sorted by position and has text overlapping
+     * spaces, but could cause some word breaks to not be added to the output
+     * 
+     * @param newIgnoreRenderedSpaces whether PDF Box should ignore context 
stream spaces
+     */
+    public void setIgnoreContentStreamSpaceGlyphs(boolean 
newIgnoreContentStreamSpaceGlyphs) {
+       ignoreContentStreamSpaceGlyphs = newIgnoreContentStreamSpaceGlyphs;
+       }
+    
     /**
      * Get the current space width-based tolerance value that is being used to 
estimate where spaces in text should be
      * added. Note that the default value for this has been determined from 
trial and error.
Index: 
pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java
===================================================================
--- pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java 
(revision 0)
+++ pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java 
(revision 0)
@@ -0,0 +1,58 @@
+package org.apache.pdfbox.text;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.apache.pdfbox.pdmodel.PDDocument;
+import org.apache.pdfbox.pdmodel.PDPage;
+import org.apache.pdfbox.pdmodel.PDPageContentStream;
+import org.apache.pdfbox.pdmodel.font.PDFont;
+import org.apache.pdfbox.pdmodel.font.PDType1Font;
+import org.apache.pdfbox.pdmodel.font.Standard14Fonts.FontName;
+import org.junit.jupiter.api.Test;
+
+public class PDFTextStripperOverlapTest {
+
+    @Test
+    void testIgnoreContentStreamSpaceGlyphs() throws Exception
+    {
+        try (PDDocument doc = new PDDocument())
+        {
+            PDPage page = new PDPage();
+            try (PDPageContentStream cs = new PDPageContentStream(doc, page))
+            {
+                float fontHeight = 8;
+                float x = 50;
+                float y = page.getMediaBox().getHeight() - 50;
+                PDFont font = new PDType1Font(FontName.HELVETICA);
+                cs.beginText();
+                cs.setFont(font, fontHeight);
+                cs.newLineAtOffset(x, y);
+                cs.showText("(                                      )");
+                cs.endText();
+
+                int indent = 6;
+                float overlapX = x + indent * 
font.getAverageFontWidth()/1000f*fontHeight;
+                PDFont overlapFont = new PDType1Font(FontName.TIMES_ROMAN);
+                cs.beginText();
+                cs.setFont(overlapFont, fontHeight*2f);
+                cs.newLineAtOffset(overlapX, y);
+                cs.showText("overlap");
+                cs.endText();
+            }
+            doc.addPage(page);
+            
+               PDFTextStripper stripper = new PDFTextStripper();
+            stripper.setLineSeparator("\n");
+            stripper.setPageEnd("\n");
+               stripper.setStartPage(1);
+               stripper.setEndPage(1);
+               stripper.setSortByPosition(true);
+
+               stripper.setIgnoreContentStreamSpaceGlyphs(true);
+               String text = stripper.getText(doc);
+               assertEquals("( overlap )\n", text);
+               
+        }
+    }  
+
+}

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Text extraction adding lots of strange spaces

Reply via email to