Working with the PDFTextStripper.class, i found a bug in the code.  I’d love to 
contribute the fix, but not sure the best way to do that.   I am an experienced 
programmer, but have never contributed to open source activities (yet, although 
I should consider I take advantage of such).

So, I found in a PDF I was pulling text from by using a custom PDFTextStripper 
subclass that overrides writeString(String text, List<TextPosition> 
textPositions) that i was getting the wrong textPositions that were not lined 
up with the text.   I found that the test position of all “words” in a line 
always come over as the “last” text positions of the last word in the line.   I 
found the issue in the PDFTextStripper class

So here is the Code Issue:

    /**
     * Used within {@link #normalize(List, boolean, boolean)} to handle a 
{@link TextPosition}.
     * @return The StringBuilder that must be used when calling this method.
     */
    private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> 
normalized,
            StringBuilder lineBuilder, List<TextPosition> wordPositions, 
TextPosition text)
    {
        if (text instanceof WordSeparator) 
        {
            normalized.add(createWord(lineBuilder.toString(), wordPositions));
            lineBuilder = new StringBuilder();
            wordPositions.clear();
        }
        else 
        {
            lineBuilder.append(text.getCharacter());
            wordPositions.add(text);
        }
        return lineBuilder;
    }


When the normalizeAdd method, you create a new word passing the wordPositions.  
 A reference to the wordPositions is stored in the new WordWithTextPositions in 
the normalized linked list, but in the next line, you clear().   Since the last 
wordPositions was passed as a reference, the wordPositions is cleared in the 
WordWithTextPositions you just created.

Soo, i would suggest you do the following:

    /**
     * Used within {@link #normalize(List, boolean, boolean)} to handle a 
{@link TextPosition}.
     * @return The StringBuilder that must be used when calling this method.
     */
    private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> 
normalized,
            StringBuilder lineBuilder, List<TextPosition> wordPositions, 
TextPosition text)
    {
        if (text instanceof WordSeparator) 
        {
            normalized.add(createWord(lineBuilder.toString(), wordPositions));
            lineBuilder = new StringBuilder();
            wordPositions = new ArrayList<TextPosition>();
        }
        else 
        {
            lineBuilder.append(text.getCharacter());
            wordPositions.add(text);
        }
        return lineBuilder;
    }


This will fix the issue.   I would be more than happy to add this, but as I 
mentioned, I am not really experienced in contributing to open source projects.

Thanks!
Andy Phillips

Reply via email to