Working with the PDFTextStripper.class, i found a bug in the code. I’d love to
contribute the fix, but not sure the best way to do that. I am an experienced
programmer, but have never contributed to open source activities (yet, although
I should consider I take advantage of such).
So, I found in a PDF I was pulling text from by using a custom PDFTextStripper
subclass that overrides writeString(String text, List<TextPosition>
textPositions) that i was getting the wrong textPositions that were not lined
up with the text. I found that the test position of all “words” in a line
always come over as the “last” text positions of the last word in the line. I
found the issue in the PDFTextStripper class
So here is the Code Issue:
/**
* Used within {@link #normalize(List, boolean, boolean)} to handle a
{@link TextPosition}.
* @return The StringBuilder that must be used when calling this method.
*/
private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
normalized,
StringBuilder lineBuilder, List<TextPosition> wordPositions,
TextPosition text)
{
if (text instanceof WordSeparator)
{
normalized.add(createWord(lineBuilder.toString(), wordPositions));
lineBuilder = new StringBuilder();
wordPositions.clear();
}
else
{
lineBuilder.append(text.getCharacter());
wordPositions.add(text);
}
return lineBuilder;
}
When the normalizeAdd method, you create a new word passing the wordPositions.
A reference to the wordPositions is stored in the new WordWithTextPositions in
the normalized linked list, but in the next line, you clear(). Since the last
wordPositions was passed as a reference, the wordPositions is cleared in the
WordWithTextPositions you just created.
Soo, i would suggest you do the following:
/**
* Used within {@link #normalize(List, boolean, boolean)} to handle a
{@link TextPosition}.
* @return The StringBuilder that must be used when calling this method.
*/
private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
normalized,
StringBuilder lineBuilder, List<TextPosition> wordPositions,
TextPosition text)
{
if (text instanceof WordSeparator)
{
normalized.add(createWord(lineBuilder.toString(), wordPositions));
lineBuilder = new StringBuilder();
wordPositions = new ArrayList<TextPosition>();
}
else
{
lineBuilder.append(text.getCharacter());
wordPositions.add(text);
}
return lineBuilder;
}
This will fix the issue. I would be more than happy to add this, but as I
mentioned, I am not really experienced in contributing to open source projects.
Thanks!
Andy Phillips