Thanks Patrick for the response. I could get the Extract the text from pdf.
Apart from extracting the text from pdf, is it possible to extract the font information, location or position of the text layout in the pdf using pdfbox? Any pointers are appreciated. Thanks. Regards, Nitin -----Original Message----- From: Patrick Herber [mailto:[email protected]] Sent: Thursday, November 19, 2009 5:00 PM To: [email protected] Subject: Re: Text extraction - Any tutorials? Hello, you should add in your classpath also the commons-logging-1.1.1.jar File. To extract Text from a PDF FIle (given as inputstream) I'm using following method (perhaps is not the best one): private String parsePdfFile(InputStream stream) throws Exception { StringWriter output = new StringWriter(4096); PDDocument document = null; try { document = PDDocument.load(stream); if (document.isEncrypted()) { try { document.decrypt(""); } catch (Throwable e) { log.warn("Could not parse PDF File since the document is encrypted"); return ""; } } PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(1); stripper.setEndPage(Integer.MAX_VALUE); stripper.writeText(document, output); return output.toString(); } catch (EOFException eofe) { log.warn("EOF Exception parsing PDF Document"); return ""; } catch (Exception e) { log.info("Exception parsing PDF document", e); return ""; } finally { if (document != null) { try { document.close(); } catch (Exception e) { /* ignore */ } } } } Regards, Patrick Nitin Shukla wrote: > Hello, > > I am looking out to extract text, text location; font etc details from PDF > file and looking out for pdf libraries that can help me do this. I came > across the PDFBox today and wanted to evaluate it. > > I am looking for any quick tutorial that can help me get started on how to > use of PDFBox library to extract text from pdf and it's font information, > text location etc. Can anyone point me to such tutorial that shows how to > make use of PDFBox APIs to extract text etc? > > > I tried using running the command line utility that is bundled with PDFBox > jar to extract text as follows. > > $ java -cp log4j-1.2.15.jar;pdfbox-0.8.0-incubating.jar > org/apache/pdfbox/ExtractText "D:\Test Lab\Murex Sample > Reports\INVOICE00009.pdf" INVOICE00009.txt > > But the above command execution threw the following error. > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/commons/logging/LogFactory > > I don't see the org/apache/commons/logging/LogFactory in the > pdfbox-0.8.0-incubating.jar nor in the log4j-1.2.15.jar. Can someone help > point what am I doing wrong? Am I missing something?? > > Thanks n Regards, > Nitin > > > ________________________________ > http://www.mindtree.com/email/disclaimer.html > >

