Re: Text extraction adding lots of strange spaces

Kevin Day Mon, 16 Dec 2024 05:02:53 -0800

I just realized that there is an incorrect note in the getter/setter
Javadocs about the setting only taking effect if sorting is enabled.


That note can be removed. The new setting is valid regardless of whether
sorting is enabled.

K

On Sun, Dec 15, 2024, 10:14 PM Kevin Day <ke...@trumpetinc.com> wrote:

> ok - I've done a thorough evaluation of what would be required to properly
> suppress overlapping spaces (but leave in the non-overlapping spaces), and
> a lot of pathfinding.
>
> What is really needed is to compare two non-space positions and see if
> there *should* be a space between them, and if not, then suppress any
> spaces that appear between them.  I know how to do this, but it would
> require a good bit of surgery.  Basically, we would need to change to
> accumulating text positions by line (without processing them).  Then once a
> line is accumulated, run through the line and insert word separators (and
> remove any overlapping spaces).  I would probably do this by moving the
> word separation logic into the normalize(line) method.  But that is a lot
> of surgery.
>
> My concern is that determining if two non-space glyphs are actually
> adjacent requires making a determination based on the width of a space
> glyph - and that introduces the risk of suppressing an overlapping space
> when it shouldn't.  I'm not comfortable with that sort of regression -
> especially for something that is not adding a ton of value beyond what we
> can achieve by just removing Tj spaces entirely.
>
> If someone really cares about preserving the physical layout of the page,
> I believe it would be better to put time into making normalize() add space
> characters (or tabs maybe) based on the physical layout of the non-space
> characters, instead of attempting to preserve any spaces that are part of
> the content stream operations.
>
> So for now, I am going to suggest that we implement the setting to ignore
> space glyphs in the content stream (as suggested in
> https://issues.apache.org/jira/browse/PDFBOX-3774 ).  I'm including a
> patch that does this below (one source change, plus a self-contained unit
> test).  I wish this patch was more valuable!  But if it helps anything,
> this change has a dramatic effect on the robustness of our application, so
> there is a ton of value to us to having PDFBox add this setting in 3.0.4.
>
> I'm happy to discuss the above further - I'm just not seeing a huge amount
> of value compared to the blanket removal of spaces...
>
>
> Index: pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
> ===================================================================
> --- pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
> (revision 1922522)
> +++ pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
> (working copy)
> @@ -40,6 +40,7 @@
>  import java.util.TreeMap;
>  import java.util.TreeSet;
>  import java.util.regex.Pattern;
> +import java.util.stream.Collectors;
>
>  import org.apache.commons.logging.Log;
>  import org.apache.commons.logging.LogFactory;
> @@ -147,6 +148,7 @@
>      private boolean shouldSeparateByBeads = true;
>      private boolean sortByPosition = false;
>      private boolean addMoreFormatting = false;
> +    private boolean ignoreContentStreamSpaceGlyphs = false;
>
>      private float indentThreshold = defaultIndentThreshold;
>      private float dropThreshold = defaultDropThreshold;
> @@ -524,11 +526,10 @@
>                  {
>                      IterativeMergeSort.sort(textList, comparator);
>                  }
> -                finally
> -                {
> -                    // PDFBOX-5487: Remove all space characters if
> contained within the adjacent letters
> -                    removeContainedSpaces(textList);
> -                }
> +
> +                // PDFBOX-5487: Remove all space characters if contained
> within the adjacent letters
> +                removeContainedSpaces(textList);
> +
>              }
>
>              startArticle();
> @@ -556,6 +557,10 @@
>                  PositionWrapper current = new PositionWrapper(position);
>                  String characterValue = position.getUnicode();
>
> +                // PDFBOX-3774 - conditionally ignore spaces from the
> content stream
> +                if (" ".equals(characterValue) &&
> getIgnoreContentStreamSpaceGlyphs())
> +                 continue;
> +
>                  // Resets the average character width when we see a
> change in font
>                  // or a change in the font size
>                  if (lastPosition != null &&
> @@ -1273,6 +1278,30 @@
>          sortByPosition = newSortByPosition;
>      }
>
> +
> +    /**
> +     * This setting only applies if sortByPosition has been set to true.
> +     *
> +     * Determines whether spaces in the content stream text rendering
> instructions will be ignored during text extraction.
> +     *
> +     * @return whether space glyphs in the content stream text rendering
> instructions will be ignored during text extraction - default is false
> +     */
> +    public boolean getIgnoreContentStreamSpaceGlyphs() {
> +     return ignoreContentStreamSpaceGlyphs;
> +    }
> +
> +    /**
> +     * This setting only applies if sortByPosition has been set to true.
> +     *
> +     * Instruct the algorithm to ignore any spaces in the text rendering
> instructions in the content stream, and instead rely purely on the
> algorithm to determine where word breaks are.
> +     * This can improve text extraction results where the content stream
> has text overlapping spaces, but could cause some word breaks to not be
> added to the output
> +     *
> +     * @param newIgnoreRenderedSpaces whether PDF Box should ignore
> context stream spaces
> +     */
> +    public void setIgnoreContentStreamSpaceGlyphs(boolean
> newIgnoreContentStreamSpaceGlyphs) {
> +     ignoreContentStreamSpaceGlyphs = newIgnoreContentStreamSpaceGlyphs;
> + }
> +
>      /**
>       * Get the current space width-based tolerance value that is being
> used to estimate where spaces in text should be
>       * added. Note that the default value for this has been determined
> from trial and error.
> Index:
> pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java
> ===================================================================
> ---
> pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java
> (revision 0)
> +++
> pdfbox/src/test/java/org/apache/pdfbox/text/PDFTextStripperOverlapTest.java
> (revision 0)
> @@ -0,0 +1,58 @@
> +package org.apache.pdfbox.text;
> +
> +import static org.junit.jupiter.api.Assertions.assertEquals;
> +
> +import org.apache.pdfbox.pdmodel.PDDocument;
> +import org.apache.pdfbox.pdmodel.PDPage;
> +import org.apache.pdfbox.pdmodel.PDPageContentStream;
> +import org.apache.pdfbox.pdmodel.font.PDFont;
> +import org.apache.pdfbox.pdmodel.font.PDType1Font;
> +import org.apache.pdfbox.pdmodel.font.Standard14Fonts.FontName;
> +import org.junit.jupiter.api.Test;
> +
> +public class PDFTextStripperOverlapTest {
> +
> +    @Test
> +    void testIgnoreContentStreamSpaceGlyphs() throws Exception
> +    {
> +        try (PDDocument doc = new PDDocument())
> +        {
> +            PDPage page = new PDPage();
> +            try (PDPageContentStream cs = new PDPageContentStream(doc,
> page))
> +            {
> +                float fontHeight = 8;
> +                float x = 50;
> +                float y = page.getMediaBox().getHeight() - 50;
> +                PDFont font = new PDType1Font(FontName.HELVETICA);
> +                cs.beginText();
> +                cs.setFont(font, fontHeight);
> +                cs.newLineAtOffset(x, y);
> +                cs.showText("(                                      )");
> +                cs.endText();
> +
> +                int indent = 6;
> +                float overlapX = x + indent *
> font.getAverageFontWidth()/1000f*fontHeight;
> +                PDFont overlapFont = new
> PDType1Font(FontName.TIMES_ROMAN);
> +                cs.beginText();
> +                cs.setFont(overlapFont, fontHeight*2f);
> +                cs.newLineAtOffset(overlapX, y);
> +                cs.showText("overlap");
> +                cs.endText();
> +            }
> +            doc.addPage(page);
> +
> +     PDFTextStripper stripper = new PDFTextStripper();
> +            stripper.setLineSeparator("\n");
> +            stripper.setPageEnd("\n");
> +     stripper.setStartPage(1);
> +     stripper.setEndPage(1);
> +     stripper.setSortByPosition(true);
> +
> +     stripper.setIgnoreContentStreamSpaceGlyphs(true);
> +     String text = stripper.getText(doc);
> +     assertEquals("( overlap )\n", text);
> +
> +        }
> +    }
> +
> +}
>
>

Re: Text extraction adding lots of strange spaces

Reply via email to