Maybe a stupid question, 

but can't you use imagemagick to prepare the images for tesseract?

This is how I do it on my computer.

convert -monitor -black-threshold 75% "$i" "$i.png";
# Create black/white, threshold adjustable

convert -monitor -colors 2 -depth 8 -blur 0 "$i.png" "$i.tif";
# Create tiff image, two color, 8bpp, remove noise (blur)

rm "$i.png";
# clean up the mess

tesseract "$i.tif" "$i.ocrin" -l $TESSLANGUAGE;
# ocr the file


OK it's not pretty and definately not fast, but it works for nearly any
kind of image. PDFs should be converted with "-density 300" option,
though.

my 2c's

best
Arno

On Thu, 2009-07-09 at 19:10 -0700, Ray Smith wrote:
> This is a plea for help!
> 
> 
> Anyone interested in seeing 3.00 this side of August?
> 
> 
> Here is the status:
> 
> 
> Linux:
> Preliminary alpha release compiles and runs. It is slower than 2.04,
> due to the new page layout analysis, but the benefits are supposed to
> outweigh that:
> Page layout analysis.
> *Lots* of languages.
> more...
> In theory the linux version should compile and link happily with
> leptonica, given the right combination of apt-gets. Not tested yet, as
> I have been bogged down with windows:
> 
> 
> Windows:
> Preliminary alpha release also compiles and runs *without leptonica
> only*.
> DLL is broken due to API change.
> 
> 
> I only have very little time left before I will be away for a while,
> but I was hoping to post a pre-alpha version to svn for people to try.
> 
> 
> The problem is that there is no chance of getting the windows version
> to work with leptonica any time soon, and without it the flagship page
> layout analysis won't work properly.
> 
> 
> Here is the problem:
> Leptonica depends on the following lower-level libraries:
> libjpeg,
> libpng,
> libtiff,
> zlib.
> 
> 
> DLLs for these are all available for windows, but they are all
> compiled to use msvcrt.dll.
> Tesseract and Leptonica will not work unless they use the same crt
> (C-runtime) as the libraries, and VC++2008, which everyone wants to
> use will not (without jumping through more hoops than I can ask an
> average tesseract user to do) build anything to use msvcrt.dll. You
> must use either a statically linked crt, or use msvcr90.dll, a newer
> version that contains .net stuff that tesseract doesn't care about.
> 
> 
> What I need are statically linked versions of the 4 libraries above
> compiled to use a statically linked crt (/MT option) and possibly
> their dependencies.
> Alternatively, libraries built for the new msvcr90.dll (/MD) would do,
> but that would mean everybody has to have the VC++2008 distributables.
> It might help dll users though, when it is eventually working again.
> 
> 
> This is not an easy task, as most of the sources for these libraries
> don't have vcproj/sln projects with which to build them.
> If anyone is sufficiently expert with VC++2008 and building other
> people's code, and understands what I am talking about, I would be
> grateful for the help.
> The other viable alternative would be to compile letonica without
> image i/o at all, and leave tesseract still unable to read anything
> other than compressed tiff.
> 
> 
> Ray.
> 
> 
> PS A good place to get all these libraries
> is:http://gnuwin32.sourceforge.net/packages/*.htm, where * is tiff,
> jpeg, libpng, or zlib.
> 
> On Tue, May 12, 2009 at 5:49 AM, javolo <[email protected]>
> wrote:
>         
>         Ditto!  I'm working on a pretty cool OCR application, and I'd
>         happily
>         help testing for access to the 3.0 beta or release candidate.
>         I can test on Ubuntu and Windows XP.
>         
>         Thanks...
>         
>         
>         On May 4, 3:07 pm, "Rob H." <[email protected]> wrote:
>         > But seriously... I'm writing a fairly interesting
>         application using
>         > Tesseract for my client: Gulfstream Aerospace.
>         > I have no problem testing 3.0, especially if I can get some
>         > performance gains.
>         
>         
> 
> 
> 
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group. 
> To post to this group, send email to [email protected] 
> To unsubscribe from this group, send email to tesseract-ocr
> [email protected] 
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
> -~----------~----~----~----~------~----~------~--~---
> 

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to