Maybe a stupid question, but can't you use imagemagick to prepare the images for tesseract?
This is how I do it on my computer. convert -monitor -black-threshold 75% "$i" "$i.png"; # Create black/white, threshold adjustable convert -monitor -colors 2 -depth 8 -blur 0 "$i.png" "$i.tif"; # Create tiff image, two color, 8bpp, remove noise (blur) rm "$i.png"; # clean up the mess tesseract "$i.tif" "$i.ocrin" -l $TESSLANGUAGE; # ocr the file OK it's not pretty and definately not fast, but it works for nearly any kind of image. PDFs should be converted with "-density 300" option, though. my 2c's best Arno On Thu, 2009-07-09 at 19:10 -0700, Ray Smith wrote: > This is a plea for help! > > > Anyone interested in seeing 3.00 this side of August? > > > Here is the status: > > > Linux: > Preliminary alpha release compiles and runs. It is slower than 2.04, > due to the new page layout analysis, but the benefits are supposed to > outweigh that: > Page layout analysis. > *Lots* of languages. > more... > In theory the linux version should compile and link happily with > leptonica, given the right combination of apt-gets. Not tested yet, as > I have been bogged down with windows: > > > Windows: > Preliminary alpha release also compiles and runs *without leptonica > only*. > DLL is broken due to API change. > > > I only have very little time left before I will be away for a while, > but I was hoping to post a pre-alpha version to svn for people to try. > > > The problem is that there is no chance of getting the windows version > to work with leptonica any time soon, and without it the flagship page > layout analysis won't work properly. > > > Here is the problem: > Leptonica depends on the following lower-level libraries: > libjpeg, > libpng, > libtiff, > zlib. > > > DLLs for these are all available for windows, but they are all > compiled to use msvcrt.dll. > Tesseract and Leptonica will not work unless they use the same crt > (C-runtime) as the libraries, and VC++2008, which everyone wants to > use will not (without jumping through more hoops than I can ask an > average tesseract user to do) build anything to use msvcrt.dll. You > must use either a statically linked crt, or use msvcr90.dll, a newer > version that contains .net stuff that tesseract doesn't care about. > > > What I need are statically linked versions of the 4 libraries above > compiled to use a statically linked crt (/MT option) and possibly > their dependencies. > Alternatively, libraries built for the new msvcr90.dll (/MD) would do, > but that would mean everybody has to have the VC++2008 distributables. > It might help dll users though, when it is eventually working again. > > > This is not an easy task, as most of the sources for these libraries > don't have vcproj/sln projects with which to build them. > If anyone is sufficiently expert with VC++2008 and building other > people's code, and understands what I am talking about, I would be > grateful for the help. > The other viable alternative would be to compile letonica without > image i/o at all, and leave tesseract still unable to read anything > other than compressed tiff. > > > Ray. > > > PS A good place to get all these libraries > is:http://gnuwin32.sourceforge.net/packages/*.htm, where * is tiff, > jpeg, libpng, or zlib. > > On Tue, May 12, 2009 at 5:49 AM, javolo <[email protected]> > wrote: > > Ditto! I'm working on a pretty cool OCR application, and I'd > happily > help testing for access to the 3.0 beta or release candidate. > I can test on Ubuntu and Windows XP. > > Thanks... > > > On May 4, 3:07 pm, "Rob H." <[email protected]> wrote: > > But seriously... I'm writing a fairly interesting > application using > > Tesseract for my client: Gulfstream Aerospace. > > I have no problem testing 3.0, especially if I can get some > > performance gains. > > > > > > --~--~---------~--~----~------------~-------~--~----~ > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to tesseract-ocr > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -~----------~----~----~----~------~----~------~--~--- >
signature.asc
Description: This is a digitally signed message part

