thanks you Tilman! I have decide to use Apache Tika. It uses SAX handler to perform xhtml, and i rewrite new one personal sax handler for my specific xml format. The last version of Tika use the last PDFBox version and i found loadNoSeq method call inside Tika parser library: i think its a good idea to use robust code instead of mine above. bye
2015-10-09 19:40 GMT+02:00 Tilman Hausherr <[email protected]>: > Am 09.10.2015 um 10:34 schrieb robyp7 .: > >> hi, >> >> I have some questions about parsing pdf anfd how to: >> >> 1) what is the purpose of using >> >> PDDocument.loadNonSeq method that include a scratch/temporary file? >> > > saves memory > > >> >> 2) I have big pdf and i need to parse it and get text contents. I use >> PDDocument.load() and then PDFTextStripper to extract data page by page >> (pdfstripper have got setStartPage(n) and setEndPage(n) >> where n=n+1 every page loop ). Is more efficient for memory using >> loadNonSeq insted load? >> > > Don't know, but loadNonSeq is the correct parser. load() is an outdated > parsing method. So you might get wrong results with load() in some rare > cases. In the upcoming 2.0 version, the old parser will be removed anyway. > > >> For example >> >> File pdfFile = new File("mypdf.pdf"); >> File tmp_file = new File("result.tmp"); >> PDDocument doc = PDDocument.loadNonSeq(pdfFile, new >> RandomAccessFile(tmp_file, READ_WRITE)); >> int index=1; >> int numpages = doc.getNumberOfPages(); >> for (int index = 1; index <= numpages; index++){ >> PDFTextStripper stripper = new PDFTextStripper(); >> Writer destination = new StringWriter(); >> String xml=""; >> stripper.setStartPage(index); >> stripper.setEndPage(index); >> stripper.writeText(this.doc, destination); >> .... //filtering text and then convert it in xml >> } >> >> Is this code above a right loadNonSeq use and is it a good practice to >> read >> pdf page per page without vaste in memory? >> I use page per page reading because i need to write text in xml using dom >> memory (using stripping technique, i decide to produce an xml for every >> page) >> > > If your results need to be separated by page, then your code is OK. > > Tilman > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

