Hi,

> Andrew Phillips <[email protected]> hat am 9. Dezember 2013 um 15:02
> geschrieben:
>
>
> Thanks, i found that i had a mistake in my suggested fix, the correct way to
> fix is as follows (i wrote the email before trying it):
>
>     /**
>      * Used within {@link #normalize(List, boolean, boolean)} to handle a
>{@link TextPosition}.
>      * @return The StringBuilder that must be used when calling this method.
>      */
>     private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
>normalized,
>             StringBuilder lineBuilder, List<TextPosition> wordPositions,
>TextPosition text)
>     {
>         if (text instanceof WordSeparator)
>         {
>             normalized.add(createWord(lineBuilder.toString(), new
>ArrayList<TextPosition>(wordPositions)));
>             lineBuilder = new StringBuilder();
>             wordPositions.clear();
>         }
>         else
>         {
>             lineBuilder.append(text.getCharacter());
>             wordPositions.add(text);
>         }
>         return lineBuilder;
>     }
>
>
> I’ll be more than happy to create an account and add the fix.   What is the
> link to JIRA for this project? 
Have a look at the footnote at the end of my former email ....

> Thanks!
> Andy P
>
> On Dec 9, 2013, at 5:01 AM, Andreas Lehmkühler <[email protected]> wrote:
>
> > Hi,
> >
> >> Andrew Phillips <[email protected]> hat am 6. Dezember 2013 um
> >> 23:05
> >> geschrieben:
> >>
> >>
> >> Working with the PDFTextStripper.class, i found a bug in the code.  I’d
> >> love
> >> to contribute the fix, but not sure the best way to do that.   I am an
> >> experienced programmer, but have never contributed to open source
> >> activities
> >> (yet, although I should consider I take advantage of such).
> >
> > Thanks for your interest in PDFBox and your offer to help. We are using JIRA
> > [1]
> > to handle any changes,
> > such as issues, improvements etc. YOu have to create an user (it's free) and
> > create an issue. Choose a
> > reasonable title, add a description and attach a sample pdf if possible.
> > Patches
> > should be created as
> > diff against the current trunk and attached to the issue as well. That's it.
> >
> >>
> >> So, I found in a PDF I was pulling text from by using a custom
> >> PDFTextStripper
> >> subclass that overrides writeString(String text, List<TextPosition>
> >> textPositions) that i was getting the wrong textPositions that were not
> >> lined
> >> up with the text.   I found that the test position of all “words” in a line
> >> always come over as the “last” text positions of the last word in the line.
> >>  I found the issue in the PDFTextStripper class
> >>
> >> So here is the Code Issue:
> >>
> >>      /**
> >>       * Used within {@link #normalize(List, boolean, boolean)} to handle a
> >> {@link TextPosition}.
> >>       * @return The StringBuilder that must be used when calling this
> >>method.
> >>       */
> >>      private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
> >> normalized,
> >>              StringBuilder lineBuilder, List<TextPosition> wordPositions,
> >> TextPosition text)
> >>      {
> >>          if (text instanceof WordSeparator)
> >>          {
> >>              normalized.add(createWord(lineBuilder.toString(),
> >>wordPositions));
> >>              lineBuilder = new StringBuilder();
> >>              wordPositions.clear();
> >>          }
> >>          else
> >>          {
> >>              lineBuilder.append(text.getCharacter());
> >>              wordPositions.add(text);
> >>          }
> >>          return lineBuilder;
> >>      }
> >>
> >>
> >> When the normalizeAdd method, you create a new word passing the
> >> wordPositions.   A reference to the wordPositions is stored in the new
> >> WordWithTextPositions in the normalized linked list, but in the next line,
> >> you
> >> clear().   Since the last wordPositions was passed as a reference, the
> >> wordPositions is cleared in the WordWithTextPositions you just created.
> >>
> >> Soo, i would suggest you do the following:
> >>
> >>      /**
> >>       * Used within {@link #normalize(List, boolean, boolean)} to handle a
> >> {@link TextPosition}.
> >>       * @return The StringBuilder that must be used when calling this
> >>method.
> >>       */
> >>      private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
> >> normalized,
> >>              StringBuilder lineBuilder, List<TextPosition> wordPositions,
> >> TextPosition text)
> >>      {
> >>          if (text instanceof WordSeparator)
> >>          {
> >>              normalized.add(createWord(lineBuilder.toString(),
> >>wordPositions));
> >>              lineBuilder = new StringBuilder();
> >>              wordPositions = new ArrayList<TextPosition>();
> >>          }
> >>          else
> >>          {
> >>              lineBuilder.append(text.getCharacter());
> >>              wordPositions.add(text);
> >>          }
> >>          return lineBuilder;
> >>      }
> >>
> >>
> >> This will fix the issue.   I would be more than happy to add this, but as I
> >> mentioned, I am not really experienced in contributing to open source
> >> projects.
> >
> > Sounds reasonable!
> >
> >> Thanks!
> >> Andy Phillips
> >
> > BR
> > Andreas Lehmkühler
> >
> >
> > [1] https://issues.apache.org/jira/browse/PDFBOX
>

BR
Andreas Lehmkühler

Reply via email to