Hi, > Andrew Phillips <[email protected]> hat am 9. Dezember 2013 um 15:02 > geschrieben: > > > Thanks, i found that i had a mistake in my suggested fix, the correct way to > fix is as follows (i wrote the email before trying it): > > /** > * Used within {@link #normalize(List, boolean, boolean)} to handle a >{@link TextPosition}. > * @return The StringBuilder that must be used when calling this method. > */ > private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> >normalized, > StringBuilder lineBuilder, List<TextPosition> wordPositions, >TextPosition text) > { > if (text instanceof WordSeparator) > { > normalized.add(createWord(lineBuilder.toString(), new >ArrayList<TextPosition>(wordPositions))); > lineBuilder = new StringBuilder(); > wordPositions.clear(); > } > else > { > lineBuilder.append(text.getCharacter()); > wordPositions.add(text); > } > return lineBuilder; > } > > > I’ll be more than happy to create an account and add the fix. What is the > link to JIRA for this project? Have a look at the footnote at the end of my former email ....
> Thanks! > Andy P > > On Dec 9, 2013, at 5:01 AM, Andreas Lehmkühler <[email protected]> wrote: > > > Hi, > > > >> Andrew Phillips <[email protected]> hat am 6. Dezember 2013 um > >> 23:05 > >> geschrieben: > >> > >> > >> Working with the PDFTextStripper.class, i found a bug in the code. I’d > >> love > >> to contribute the fix, but not sure the best way to do that. I am an > >> experienced programmer, but have never contributed to open source > >> activities > >> (yet, although I should consider I take advantage of such). > > > > Thanks for your interest in PDFBox and your offer to help. We are using JIRA > > [1] > > to handle any changes, > > such as issues, improvements etc. YOu have to create an user (it's free) and > > create an issue. Choose a > > reasonable title, add a description and attach a sample pdf if possible. > > Patches > > should be created as > > diff against the current trunk and attached to the issue as well. That's it. > > > >> > >> So, I found in a PDF I was pulling text from by using a custom > >> PDFTextStripper > >> subclass that overrides writeString(String text, List<TextPosition> > >> textPositions) that i was getting the wrong textPositions that were not > >> lined > >> up with the text. I found that the test position of all “words” in a line > >> always come over as the “last” text positions of the last word in the line. > >> I found the issue in the PDFTextStripper class > >> > >> So here is the Code Issue: > >> > >> /** > >> * Used within {@link #normalize(List, boolean, boolean)} to handle a > >> {@link TextPosition}. > >> * @return The StringBuilder that must be used when calling this > >>method. > >> */ > >> private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> > >> normalized, > >> StringBuilder lineBuilder, List<TextPosition> wordPositions, > >> TextPosition text) > >> { > >> if (text instanceof WordSeparator) > >> { > >> normalized.add(createWord(lineBuilder.toString(), > >>wordPositions)); > >> lineBuilder = new StringBuilder(); > >> wordPositions.clear(); > >> } > >> else > >> { > >> lineBuilder.append(text.getCharacter()); > >> wordPositions.add(text); > >> } > >> return lineBuilder; > >> } > >> > >> > >> When the normalizeAdd method, you create a new word passing the > >> wordPositions. A reference to the wordPositions is stored in the new > >> WordWithTextPositions in the normalized linked list, but in the next line, > >> you > >> clear(). Since the last wordPositions was passed as a reference, the > >> wordPositions is cleared in the WordWithTextPositions you just created. > >> > >> Soo, i would suggest you do the following: > >> > >> /** > >> * Used within {@link #normalize(List, boolean, boolean)} to handle a > >> {@link TextPosition}. > >> * @return The StringBuilder that must be used when calling this > >>method. > >> */ > >> private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> > >> normalized, > >> StringBuilder lineBuilder, List<TextPosition> wordPositions, > >> TextPosition text) > >> { > >> if (text instanceof WordSeparator) > >> { > >> normalized.add(createWord(lineBuilder.toString(), > >>wordPositions)); > >> lineBuilder = new StringBuilder(); > >> wordPositions = new ArrayList<TextPosition>(); > >> } > >> else > >> { > >> lineBuilder.append(text.getCharacter()); > >> wordPositions.add(text); > >> } > >> return lineBuilder; > >> } > >> > >> > >> This will fix the issue. I would be more than happy to add this, but as I > >> mentioned, I am not really experienced in contributing to open source > >> projects. > > > > Sounds reasonable! > > > >> Thanks! > >> Andy Phillips > > > > BR > > Andreas Lehmkühler > > > > > > [1] https://issues.apache.org/jira/browse/PDFBOX > BR Andreas Lehmkühler

