Thanks, i found that i had a mistake in my suggested fix, the correct way to
fix is as follows (i wrote the email before trying it):
/**
* Used within {@link #normalize(List, boolean, boolean)} to handle a
{@link TextPosition}.
* @return The StringBuilder that must be used when calling this method.
*/
private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
normalized,
StringBuilder lineBuilder, List<TextPosition> wordPositions,
TextPosition text)
{
if (text instanceof WordSeparator)
{
normalized.add(createWord(lineBuilder.toString(), new
ArrayList<TextPosition>(wordPositions)));
lineBuilder = new StringBuilder();
wordPositions.clear();
}
else
{
lineBuilder.append(text.getCharacter());
wordPositions.add(text);
}
return lineBuilder;
}
I’ll be more than happy to create an account and add the fix. What is the
link to JIRA for this project?
Thanks!
Andy P
On Dec 9, 2013, at 5:01 AM, Andreas Lehmkühler <[email protected]> wrote:
> Hi,
>
>> Andrew Phillips <[email protected]> hat am 6. Dezember 2013 um 23:05
>> geschrieben:
>>
>>
>> Working with the PDFTextStripper.class, i found a bug in the code. I’d love
>> to contribute the fix, but not sure the best way to do that. I am an
>> experienced programmer, but have never contributed to open source activities
>> (yet, although I should consider I take advantage of such).
>
> Thanks for your interest in PDFBox and your offer to help. We are using JIRA
> [1]
> to handle any changes,
> such as issues, improvements etc. YOu have to create an user (it's free) and
> create an issue. Choose a
> reasonable title, add a description and attach a sample pdf if possible.
> Patches
> should be created as
> diff against the current trunk and attached to the issue as well. That's it.
>
>>
>> So, I found in a PDF I was pulling text from by using a custom
>> PDFTextStripper
>> subclass that overrides writeString(String text, List<TextPosition>
>> textPositions) that i was getting the wrong textPositions that were not lined
>> up with the text. I found that the test position of all “words” in a line
>> always come over as the “last” text positions of the last word in the line.
>> I found the issue in the PDFTextStripper class
>>
>> So here is the Code Issue:
>>
>> /**
>> * Used within {@link #normalize(List, boolean, boolean)} to handle a
>> {@link TextPosition}.
>> * @return The StringBuilder that must be used when calling this method.
>> */
>> private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
>> normalized,
>> StringBuilder lineBuilder, List<TextPosition> wordPositions,
>> TextPosition text)
>> {
>> if (text instanceof WordSeparator)
>> {
>> normalized.add(createWord(lineBuilder.toString(),
>> wordPositions));
>> lineBuilder = new StringBuilder();
>> wordPositions.clear();
>> }
>> else
>> {
>> lineBuilder.append(text.getCharacter());
>> wordPositions.add(text);
>> }
>> return lineBuilder;
>> }
>>
>>
>> When the normalizeAdd method, you create a new word passing the
>> wordPositions. A reference to the wordPositions is stored in the new
>> WordWithTextPositions in the normalized linked list, but in the next line,
>> you
>> clear(). Since the last wordPositions was passed as a reference, the
>> wordPositions is cleared in the WordWithTextPositions you just created.
>>
>> Soo, i would suggest you do the following:
>>
>> /**
>> * Used within {@link #normalize(List, boolean, boolean)} to handle a
>> {@link TextPosition}.
>> * @return The StringBuilder that must be used when calling this method.
>> */
>> private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
>> normalized,
>> StringBuilder lineBuilder, List<TextPosition> wordPositions,
>> TextPosition text)
>> {
>> if (text instanceof WordSeparator)
>> {
>> normalized.add(createWord(lineBuilder.toString(),
>> wordPositions));
>> lineBuilder = new StringBuilder();
>> wordPositions = new ArrayList<TextPosition>();
>> }
>> else
>> {
>> lineBuilder.append(text.getCharacter());
>> wordPositions.add(text);
>> }
>> return lineBuilder;
>> }
>>
>>
>> This will fix the issue. I would be more than happy to add this, but as I
>> mentioned, I am not really experienced in contributing to open source
>> projects.
>
> Sounds reasonable!
>
>> Thanks!
>> Andy Phillips
>
> BR
> Andreas Lehmkühler
>
>
> [1] https://issues.apache.org/jira/browse/PDFBOX