Another way to prepare a PDF document for tesseract is to use the 'convert' 
command from the ImageMagick package to split an image only PDF file into a 
series of GrayScale TIFF images, one for each page.  This convert command can 
work on just about any image.  For PDF conversions, it actually makes 
ghostscript do all of the work.  This same syntax also works with multi-page 
TIFF files and Postscript files.  

convert mydoc.pdf -type GrayScale -depth 8 -scene 1 mydoc-%03d.tif 

Then you would need to loop through the TIFF files to perform OCR on each page 
image.  In a day or two, I will update my speedy-ocr bash script, which will 
now handle PDF image files.  

Don Marang
Vinux Software Coordinator - vinux.org.uk

There is just so much stuff in the world that, to me, is devoid of any real 
substance, value, and content that I just try to make sure that I am working on 
things that matter. 
Dean Kamen 



From: KHEM Sochenda 
Sent: Monday, February 07, 2011 10:23 PM
To: [email protected] 
Subject: Re: VietOCR v2.0/3.1 & VietOCR.NET v2.0 Releases


Dear Quan,

I would like to know how to let tesseract OCR work with pdf documents. 

Thank you very much in advance for you kind response.

With Best Regards,

Sochenda


On Tue, Feb 8, 2011 at 7:56 AM, Quan Nguyen <[email protected]> wrote:

  A Java/.NET GUI frontend for Tesseract OCR engine. The releases
  include the following fixes and improvements:

  * Add support for spellcheck suggestion in context menu
  * Improve program accessibility and usability
  * Add support for downloading and installing language data packs and
  appropriate spell dictionaries
  * Add UI localization for Lithuanian and Slovak
  * Update Tesseract OCR engine to 3.01 (r551) (v3.1 only)

  http://vietocr.sf.net

  --
  You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
  To post to this group, send email to [email protected].
  To unsubscribe from this group, send email to 
[email protected].
  For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.





-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to