Thanks Tilman for all your great and fast work. Unfortunately I can't share the pdf publicly, it's copyrighted. My code for extracting the text is (simplified):
public static void main(String[] args) throws IOException { PDDocument doc = null; boolean hasOutputPath = false; if (args.length != 1 && args.length != 2) { usage(); System.exit(0); } if (args.length == 2) { hasOutputPath = true; } try { doc = PDDocument.load(args[0]); if (doc.isEncrypted()) { StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(""); doc.openProtection(sdm); } } catch (IOException e) { System.err.println("Error loading PDF file"); e.printStackTrace(); System.exit(0); } catch (BadSecurityHandlerException e) { e.printStackTrace(); System.exit(0); } catch (CryptographyException e) { e.printStackTrace(); System.exit(0); } TextParser parser = new TextParser(hasOutputPath? args[1]: args[0]);//A class of mine to parse the text received PDDocumentOutline outlineRoot = doc.getDocumentCatalog().getDocumentOutline(); PDOutlineItem parentItem = outlineRoot.getFirstChild(); String parentTitleName; String currentChildTitleName; String nextChildTitleName; PDFTextStripperExt stripper = new PDFTextStripperExt(); boolean childrenWereParsed = false; while (parentItem != null) { parentTitleName = parentItem.getTitle(); if (Pattern.matches(".*Commands", parentTitleName)) { PDOutlineItem item = parentItem.getFirstChild(); while (item != null) { currentChildTitleName = item.getTitle(); stripper.setStartBookmark(item); if ((item = item.getNextSibling()) == null) { nextChildTitleName = (parentItem = parentItem.getNextSibling()).getTitle();/*need to check null on next parent item but in this pdf case it won't happen*/ stripper.setEndBookmark(parentItem); } else { nextChildTitleName = item.getTitle(); stripper.setEndBookmark(item); } parser.parseText(stripper.getTextBySpecification(doc), currentChildTitleName, nextChildTitleName); docCount++; } childrenWereParsed = true; } if (!childrenWereParsed) { parentItem = parentItem.getNextSibling(); } } } (there might be some syntax errors since I simplified the code, but this is the main concept) The code which I was talking about with the *namesDict = doc**.getDocumentCatalog().getNames() *returns *null *is part of the pdfbox code in the *findDestinationPage *method in the section of the *if( rawDest instanceof PDNamedDestination )* in the *PDOutlineItem* class. It sems that there is an anomaly in this spacific pdf. Ill try to load the pdf with *loadNonSeq(file,null) *and see what's the difference. Noam On Sun, May 10, 2015 at 5:37 PM, Tilman Hausherr <thaush...@t-online.de> wrote: > Am 08.05.2015 um 17:17 schrieb noamsil...@gmail.com: > >> I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox >> v1.8.9. >> >> My problem is that when trying to getText(doc) form a certain section of >> the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the >> text rather than just the text from the specified section. >> >> WhiIe trying to resolve this I realized that the writeText(doc, >> outputStream) method always calls resetEngine() method. That will reset all >> the parameters and delete the bookmarks I set. >> >> So my first question is what is the correct way to get the text from a >> specified section of the pdf? >> > > I've now hopefully fixed that problem in > https://issues.apache.org/jira/browse/PDFBOX-2792 > a snapshot version will soon be available here: > > https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/ > > When I continued to try and resolve this I created a new class that >> extendsPDFTextStripper and I changed the getText() and writeText() methods >> (also changing their names) so that it won’t call the resetEngine() method >> while keeping the rest of the functionality (I also had to delete the if >> (getAddMoreFormatting()) section as the parameters are private, is that a >> problem?). >> >> Now when I call the method I created I have a second problem, while it >> tries to determine the startBookmarkPageNumber in processPages method >> getPageNumber method returns -1. >> >> When I dug deeper I saw that in findDestinationPage method the rawDest is >> of type PDNamedDestination. >> >> The problem is that when trying to get namesDict = >> doc.getDocumentCatalog().getNames() it returns null. That means that the >> names dictionary doesn’t exist. What can be done? >> >> Just need to point out that in Acrobat the bookmarks all work. >> > > I tested this on a document with names, and I didn't have that effect with > 1.8.9, so whatever the problem is, it isn't a general problem, so I need > the file. > > One thing to try is to load the document with loadNonSeq(file,null) > instead of load(). > > Tilman > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >