Hello,

I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.

My problem is that when trying to getText(doc) form a certain section of the 
pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text 
rather than just the text from the specified section.

WhiIe trying to resolve this I realized that the writeText(doc, outputStream) 
method always calls resetEngine() method. That will reset all the parameters 
and delete the bookmarks I set.

So my first question is what is the correct way to get the text from a 
specified section of the pdf?

When I continued to try and resolve this I created a new class that 
extendsPDFTextStripper and I changed the getText() and writeText() methods 
(also changing their names) so that it won’t call the resetEngine() method 
while keeping the rest of the functionality (I also had to delete the if 
(getAddMoreFormatting()) section as the parameters are private, is that a 
problem?).

Now when I call the method I created I have a second problem, while it tries to 
determine the startBookmarkPageNumber in processPages method getPageNumber 
method returns -1. 

When I dug deeper I saw that in findDestinationPage method the rawDest is of 
type PDNamedDestination.

The problem is that when trying to get namesDict = 
doc.getDocumentCatalog().getNames() it returns null. That means that the names 
dictionary doesn’t exist. What can be done?

Just need to point out that in Acrobat the bookmarks all work.


Noam

Reply via email to