how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique

robyp7 . Fri, 09 Oct 2015 01:34:39 -0700

hi,

I have some questions about parsing pdf anfd how to:


1) what is the purpose of using

PDDocument.loadNonSeq method that include a scratch/temporary file?


2) I have big pdf and i need to parse it and get text contents. I use
PDDocument.load() and then PDFTextStripper to extract data page by page
(pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using
loadNonSeq insted load?

For example

File pdfFile =  new File("mypdf.pdf");
File tmp_file =  new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
  PDFTextStripper stripper = new PDFTextStripper();
        Writer destination = new StringWriter();
        String xml="";
        stripper.setStartPage(index);
        stripper.setEndPage(index);
        stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}

Is this code above a right loadNonSeq use and is it a good practice to read
pdf page per page without vaste in memory?
I use page per page reading because i need to write text in xml using dom
memory (using stripping technique, i decide to produce an xml for every
page)

Thank you very much

Roby

how to use PDDocument.loadNonSeq, large pdf stripper/parsing text technique

Reply via email to