Re: extracting text from image using pdfbox

Jeremias Maerki Sun, 14 Oct 2012 01:09:35 -0700

Hi,
Apache PDFBox can't help you here, I'm afraid. What you're after is OCR
functionality (http://en.wikipedia.org/wiki/Optical_character_recognition)
and PDFBox doesn't provide that. The only thing you can do is to extract
the bitmap images using PDFBox and then attempt to decipher the text
contained in them using an external OCR process. Just a warning: don't
expect an OCR process to be 100% accurate.


If you're looking for an open source OCR engine, Tesseract is probably
the most popular one: http://en.wikipedia.org/wiki/Tesseract_%28software%29

HTH
Jeremias Maerki


On 12.10.2012 15:47:40 Kishore Babu wrote:
> Hi All,
> Is it possible to extract text from an image (JPEG) using pdfbox or is there 
> any open source java code for this?
> 
> When I try to  convert pdf to text, it is showing blank output. Then I 
> converted into JPEG image. The image contains the text properly, which I am 
> failing to extract.
> 
> For normal pdf documents I am extracting text nicely using the standard 
> process but when the pdf document is an image, I am failing to extract the 
> text that is present in the image.
> 
> Can anyone give directions on this, please?
> 
> Thanks in advance.
> 
> Regards,
> Kishore Babu I Developer

Re: extracting text from image using pdfbox

Reply via email to