Am 09.10.2015 um 10:34 schrieb robyp7 .:
hi,
I have some questions about parsing pdf anfd how to:
1) what is the purpose of using
PDDocument.loadNonSeq method that include a scratch/temporary file?
saves memory
2) I have big pdf and i need to parse it and get text contents. I use
PDDocument.load() and then PDFTextStripper to extract data page by page
(pdfstripper have got setStartPage(n) and setEndPage(n)
where n=n+1 every page loop ). Is more efficient for memory using
loadNonSeq insted load?
Don't know, but loadNonSeq is the correct parser. load() is an outdated
parsing method. So you might get wrong results with load() in some rare
cases. In the upcoming 2.0 version, the old parser will be removed anyway.
For example
File pdfFile = new File("mypdf.pdf");
File tmp_file = new File("result.tmp");
PDDocument doc = PDDocument.loadNonSeq(pdfFile, new
RandomAccessFile(tmp_file, READ_WRITE));
int index=1;
int numpages = doc.getNumberOfPages();
for (int index = 1; index <= numpages; index++){
PDFTextStripper stripper = new PDFTextStripper();
Writer destination = new StringWriter();
String xml="";
stripper.setStartPage(index);
stripper.setEndPage(index);
stripper.writeText(this.doc, destination);
.... //filtering text and then convert it in xml
}
Is this code above a right loadNonSeq use and is it a good practice to read
pdf page per page without vaste in memory?
I use page per page reading because i need to write text in xml using dom
memory (using stripping technique, i decide to produce an xml for every
page)
If your results need to be separated by page, then your code is OK.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]