Thanks, i found that i had a mistake in my suggested fix, the correct way to 
fix is as follows (i wrote the email before trying it):

    /**
     * Used within {@link #normalize(List, boolean, boolean)} to handle a 
{@link TextPosition}.
     * @return The StringBuilder that must be used when calling this method.
     */
    private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> 
normalized,
            StringBuilder lineBuilder, List<TextPosition> wordPositions, 
TextPosition text)
    {
        if (text instanceof WordSeparator) 
        {
            normalized.add(createWord(lineBuilder.toString(), new 
ArrayList<TextPosition>(wordPositions)));
            lineBuilder = new StringBuilder();
            wordPositions.clear();
        }
        else 
        {
            lineBuilder.append(text.getCharacter());
            wordPositions.add(text);
        }
        return lineBuilder;
    }


I’ll be more than happy to create an account and add the fix.   What is the 
link to JIRA for this project?  

Thanks!
Andy P

On Dec 9, 2013, at 5:01 AM, Andreas Lehmkühler <[email protected]> wrote:

> Hi,
> 
>> Andrew Phillips <[email protected]> hat am 6. Dezember 2013 um 23:05
>> geschrieben:
>> 
>> 
>> Working with the PDFTextStripper.class, i found a bug in the code.  I’d love
>> to contribute the fix, but not sure the best way to do that.   I am an
>> experienced programmer, but have never contributed to open source activities
>> (yet, although I should consider I take advantage of such).
> 
> Thanks for your interest in PDFBox and your offer to help. We are using JIRA 
> [1]
> to handle any changes,
> such as issues, improvements etc. YOu have to create an user (it's free) and
> create an issue. Choose a
> reasonable title, add a description and attach a sample pdf if possible. 
> Patches
> should be created as
> diff against the current trunk and attached to the issue as well. That's it.
> 
>> 
>> So, I found in a PDF I was pulling text from by using a custom 
>> PDFTextStripper
>> subclass that overrides writeString(String text, List<TextPosition>
>> textPositions) that i was getting the wrong textPositions that were not lined
>> up with the text.   I found that the test position of all “words” in a line
>> always come over as the “last” text positions of the last word in the line. 
>>  I found the issue in the PDFTextStripper class
>> 
>> So here is the Code Issue:
>> 
>>      /**
>>       * Used within {@link #normalize(List, boolean, boolean)} to handle a
>> {@link TextPosition}.
>>       * @return The StringBuilder that must be used when calling this method.
>>       */
>>      private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
>> normalized,
>>              StringBuilder lineBuilder, List<TextPosition> wordPositions,
>> TextPosition text)
>>      {
>>          if (text instanceof WordSeparator)
>>          {
>>              normalized.add(createWord(lineBuilder.toString(), 
>> wordPositions));
>>              lineBuilder = new StringBuilder();
>>              wordPositions.clear();
>>          }
>>          else
>>          {
>>              lineBuilder.append(text.getCharacter());
>>              wordPositions.add(text);
>>          }
>>          return lineBuilder;
>>      }
>> 
>> 
>> When the normalizeAdd method, you create a new word passing the
>> wordPositions.   A reference to the wordPositions is stored in the new
>> WordWithTextPositions in the normalized linked list, but in the next line, 
>> you
>> clear().   Since the last wordPositions was passed as a reference, the
>> wordPositions is cleared in the WordWithTextPositions you just created.
>> 
>> Soo, i would suggest you do the following:
>> 
>>      /**
>>       * Used within {@link #normalize(List, boolean, boolean)} to handle a
>> {@link TextPosition}.
>>       * @return The StringBuilder that must be used when calling this method.
>>       */
>>      private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
>> normalized,
>>              StringBuilder lineBuilder, List<TextPosition> wordPositions,
>> TextPosition text)
>>      {
>>          if (text instanceof WordSeparator)
>>          {
>>              normalized.add(createWord(lineBuilder.toString(), 
>> wordPositions));
>>              lineBuilder = new StringBuilder();
>>              wordPositions = new ArrayList<TextPosition>();
>>          }
>>          else
>>          {
>>              lineBuilder.append(text.getCharacter());
>>              wordPositions.add(text);
>>          }
>>          return lineBuilder;
>>      }
>> 
>> 
>> This will fix the issue.   I would be more than happy to add this, but as I
>> mentioned, I am not really experienced in contributing to open source
>> projects.
> 
> Sounds reasonable!
> 
>> Thanks!
>> Andy Phillips
> 
> BR
> Andreas Lehmkühler
> 
> 
> [1] https://issues.apache.org/jira/browse/PDFBOX

Reply via email to