Re: extracting text from image using pdfbox

2012-10-14 Thread Peter Murray-Rust
On Mon, Oct 15, 2012 at 6:17 AM, Kishore Babu wrote: > Hi Peter, > Thank you very much for the reply. Unfortunately, the image I am dealing > are the scanned one. > > I will update my result if I succeed in using the mentioned line detection > algorithms. > > There is an excellent explanation of

RE: extracting text from image using pdfbox

2012-10-14 Thread Kishore Babu
Thanks Jeremias I will try it. Regards, Kishore Babu I Developer email: kb...@envistacorp.com office: 040.66417681 www.envistacorp.com Subscribe to enVista's Newsletter! -Original Message- From: Jeremias Maerki [mailto:d...@jeremias-maerki.ch] Sent: Sunday, 14 Octo

RE: extracting text from image using pdfbox

2012-10-14 Thread Kishore Babu
Hi Peter, Thank you very much for the reply. Unfortunately, the image I am dealing are the scanned one. I will update my result if I succeed in using the mentioned line detection algorithms. Thanks & Regards, Kishore Babu I Developer email: kb...@envistacorp.com office: 040.66417681 w

Re: Help with removing images from a PDF

2012-10-14 Thread Nicholas Tiong
Hi Andreas, I've commented out the 'do' line, but still cannot get rid of the images. I've basically opened the document and loaded the resources and then saved the document. See code below. This seems to be insufficient. Do I need to parse the PDF stream somehow? Regards, Nicholas Tiong impor

Re: Help with removing images from a PDF

2012-10-14 Thread Andreas Lehmkuehler
Hi, Am 14.10.2012 22:47, schrieb Nicholas Tiong: Hi Andreas, Thanks for your help, but I am not sure where to find this 'Do' line in pagedrawer.properties. I see that there is a package in the pdfbox jar file that is called org.apache.pdfbox.util.operator.pagedrawer, but I'm not sure where the

Re: Help with removing images from a PDF

2012-10-14 Thread Nicholas Tiong
Hi Andreas, Thanks for your help, but I am not sure where to find this 'Do' line in pagedrawer.properties. I see that there is a package in the pdfbox jar file that is called org.apache.pdfbox.util.operator.pagedrawer, but I'm not sure where the 'do' line is. I'm guessing its somewhere within the

Re: question on TextToPDF Command Line Utility

2012-10-14 Thread Andreas Lehmkuehler
Hi, Am 20.07.2012 21:30, schrieb Robert Nelson: Hi, The following command creates a portrait oriented output: $ java -jar pdfbox-app-1.7.0.jar TextToPDF -standardFont Courier file-out.pdf file-in.txt How does one request the output to be generated in landscape orientation? TestToPDF is more a

Re: Help with removing images from a PDF

2012-10-14 Thread Andreas Lehmkuehler
Hi, Am 04.10.2012 02:58, schrieb Nicholas Tiong: Hi, I'm new here and I've just discovered PDFBox. My experience with coding is fairly basic. Based on a sample code I found here, http://stackoverflow.com/questions/6831194/how-can-i-remove-all-images-drawi ngs-from-a-pdf-file-and-leave-text-onl

Re: Flipped images

2012-10-14 Thread Andreas Lehmkuehler
Hi, Am 08.10.2012 16:00, schrieb Finzen, Jan: Hi, we use pdf-box to extract images from PDF files. Unfortunately images sometimes > come out mirrored - and we do not understand when and why. Which version are you using? Try the current trunk, as I just checked in some fixes concerning the rot

Re: problem,help

2012-10-14 Thread Andreas Lehmkuehler
Hi, Am 13.10.2012 10:08, schrieb wangxinjing1988: Hello,my Friends.I had the problem of using PDFbox.When I wanted using PDFbox to read file of pdf in java environment, Most files can be read.but a few of files can't be read.For example notes.pdf.the detail of exception:as felow, java.io.IOExcep

Re: extracting text from image using pdfbox

2012-10-14 Thread Jeremias Maerki
Hi, Apache PDFBox can't help you here, I'm afraid. What you're after is OCR functionality (http://en.wikipedia.org/wiki/Optical_character_recognition) and PDFBox doesn't provide that. The only thing you can do is to extract the bitmap images using PDFBox and then attempt to decipher the text contai