hi,
I have some questions about parsing pdf anfd how to:
1) what is the purpose of using
PDDocument.loadNonSeq method that include a scratch/temporary file?
2) I have big pdf and i need to parse it and get text contents. I use
PDDocument.load() and then PDFTextStripper to extract data page by page
(pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using
loadNonSeq insted load?
For example
File pdfFile = new File("mypdf.pdf");
File tmp_file = new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
PDFTextStripper stripper = new PDFTextStripper();
Writer destination = new StringWriter();
String xml="";
stripper.setStartPage(index);
stripper.setEndPage(index);
stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}
Is this code above a right loadNonSeq use and is it a good practice to read
pdf page per page without vaste in memory?
I use page per page reading because i need to write text in xml using dom
memory (using stripping technique, i decide to produce an xml for every
page)
Thank you very much
Roby