Generic comments and questions

CraigLandrum Tue, 20 Jul 2010 12:44:54 -0700

We are using tesseract-ocr (2.04) as the built-in OCR option for our
document management and workflow client software for both Mac and
Windows. The client software supports high-speed scanning from Fujitsu
document scanners (and others) to TIFF and PDF documents as well as
single-image formats.  Images are maintained internally in platform-
specific formats (NSBitmapImageRep on Mac OS X and HBITMAP on
Windows).


In most cases, we use the TessBaseAPI::TesseractRect method to either
recognize specific areas on a page or the entire page.  Both work well
for us and produce decent text, and we look forward to upgrading our
library as new versions are released.

Comments:

- The various "dawg" and tessconfig files ("batch", etc) appear to
have only slight effect on the output - probably because we are not
using them correctly. These file are normally installed in specific
linux folders separate from the application itself.  This is a non-
starter for installable Mac and Windows software where it would be
better to have them as semi-hidden/protected resources in the app
bundle (Mac) or zipped resources (Windows), which is what we have
done, after modifying the paths in the library code appropriately.
Question:  Are the "eng" config files based on a lot of training of
typical English docs or would it actually be better for us to go
through the hassle of doing the training ourselves?  Note that the
documents we process are from every industry you can think of and are
not limited to something easy, like legal docs.

- We VERY much like the ability to not have to link in a bunch of
imaging libs, since we do all that internally already.  We appreciate
the ability to simply
pass a pointer to the start of image memory with descriptors as to the
byte width, pixel width, depth, etc and get a pointer to text in
return.  This concept is perfect.  We urge you to continue to support
this ability and to maintain the ability to keep the OCR separate from
image I/O.

- We use the text from the OCR in three ways:  1) as simple
unformatted text that we can display to the user immediately, allowing
them to copy and paste to other apps;  2) as text to be assigned to a
target field in a database (forms scanning and auto indexing);  3) as
text to be placed behind our scanned images in a PDF file.  With
respect to this final capability, we have been unable to get a grasp
on a method to estimate the point size of the recognized characters/
words/lines/paras and have had to resort to examining the word-rects
returned from the TessBaseAPI::TesseractRectBoxes method, and this has
not worked out well for us, resulting wildly different font sizes even
between words on the same line - primarily because of ascenders and
descenders, etc.  Because selection of text from a text-behind-image
PDF works much better when the text is specified at the correct font
size, it would be to our benefit to improve this information.
Question:  Are there different methods we should be using to get font/
font size information about the recognized text?  If not, can you
suggest a technique for estimating the font size that would be better
than looking at the rect info returned from TesseractRectBoxes?
Finally, is there any way to get some font info? We would be happy
simply knowing if the font is a serif or non-serif font after which we
could simply select between something like Ariel or Times, etc.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Generic comments and questions

Reply via email to