Hello,
I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.
My problem is that when trying to getText(doc) form a certain section of the
pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text
rather than just the text from the specified section.
WhiIe trying to resolve this I realized that the writeText(doc, outputStream)
method always calls resetEngine() method. That will reset all the parameters
and delete the bookmarks I set.
So my first question is what is the correct way to get the text from a
specified section of the pdf?
When I continued to try and resolve this I created a new class that
extendsPDFTextStripper and I changed the getText() and writeText() methods
(also changing their names) so that it won’t call the resetEngine() method
while keeping the rest of the functionality (I also had to delete the if
(getAddMoreFormatting()) section as the parameters are private, is that a
problem?).
Now when I call the method I created I have a second problem, while it tries to
determine the startBookmarkPageNumber in processPages method getPageNumber
method returns -1.
When I dug deeper I saw that in findDestinationPage method the rawDest is of
type PDNamedDestination.
The problem is that when trying to get namesDict =
doc.getDocumentCatalog().getNames() it returns null. That means that the names
dictionary doesn’t exist. What can be done?
Just need to point out that in Acrobat the bookmarks all work.
Noam