Re: Generic comments and questions

Jimmy O'Regan Tue, 20 Jul 2010 14:21:15 -0700

On 20 July 2010 20:40, CraigLandrum <[email protected]> wrote:
> - The various "dawg" and tessconfig files ("batch", etc) appear to
> have only slight effect on the output - probably because we are not
> using them correctly.


No. I answered basically this question once already today, and I don't
feel like repeating myself - the list is publicly archived, if you're
interested.

> These file are normally installed in specific
> linux folders separate from the application itself.  This is a non-
> starter for installable Mac and Windows software where it would be
> better to have them as semi-hidden/protected resources in the app
> bundle (Mac) or zipped resources (Windows), which is what we have
> done, after modifying the paths in the library code appropriately.

In Tesseract 3, all language data are kept in a single file per language.

> Question:  Are the "eng" config files based on a lot of training of
> typical English docs or would it actually be better for us to go
> through the hassle of doing the training ourselves?

That's not exactly a yes or no question: the answer to both parts is 'yes'.

> Note that the
> documents we process are from every industry you can think of and are
> not limited to something easy, like legal docs.
>
> - We VERY much like the ability to not have to link in a bunch of
> imaging libs, since we do all that internally already.  We appreciate
> the ability to simply
> pass a pointer to the start of image memory with descriptors as to the
> byte width, pixel width, depth, etc and get a pointer to text in
> return.  This concept is perfect.  We urge you to continue to support
> this ability and to maintain the ability to keep the OCR separate from
> image I/O.
>

Tesseract is an OCR engine that grew into being a library, not the
other way around; the primary users are the users of the engine. If I
found a cross-platform PDF reading library under a compatible licence
in the morning, I'd link it in without a moment's hesitation, because
it would be an improvement for the majority of users.

As it happens, Tesseract 3 has mostly moved its image processing to
Leptonica, which is also used for page segmentation.

> Question:  Are there different methods we should be using to get font/
> font size information about the recognized text?  If not, can you
> suggest a technique for estimating the font size that would be better
> than looking at the rect info returned from TesseractRectBoxes?
> Finally, is there any way to get some font info? We would be happy
> simply knowing if the font is a serif or non-serif font after which we
> could simply select between something like Ariel or Times, etc.

Apparently, font size per character is in the EANYCODE_CHAR structure,
font name etc. is kept in the EFONT_DESC structure (ocrclass.h),
though I'm not aware of anything that actually uses them.


-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Generic comments and questions

Reply via email to