On 20 July 2010 20:40, CraigLandrum <[email protected]> wrote: > - The various "dawg" and tessconfig files ("batch", etc) appear to > have only slight effect on the output - probably because we are not > using them correctly.
No. I answered basically this question once already today, and I don't feel like repeating myself - the list is publicly archived, if you're interested. > These file are normally installed in specific > linux folders separate from the application itself. This is a non- > starter for installable Mac and Windows software where it would be > better to have them as semi-hidden/protected resources in the app > bundle (Mac) or zipped resources (Windows), which is what we have > done, after modifying the paths in the library code appropriately. In Tesseract 3, all language data are kept in a single file per language. > Question: Are the "eng" config files based on a lot of training of > typical English docs or would it actually be better for us to go > through the hassle of doing the training ourselves? That's not exactly a yes or no question: the answer to both parts is 'yes'. > Note that the > documents we process are from every industry you can think of and are > not limited to something easy, like legal docs. > > - We VERY much like the ability to not have to link in a bunch of > imaging libs, since we do all that internally already. We appreciate > the ability to simply > pass a pointer to the start of image memory with descriptors as to the > byte width, pixel width, depth, etc and get a pointer to text in > return. This concept is perfect. We urge you to continue to support > this ability and to maintain the ability to keep the OCR separate from > image I/O. > Tesseract is an OCR engine that grew into being a library, not the other way around; the primary users are the users of the engine. If I found a cross-platform PDF reading library under a compatible licence in the morning, I'd link it in without a moment's hesitation, because it would be an improvement for the majority of users. As it happens, Tesseract 3 has mostly moved its image processing to Leptonica, which is also used for page segmentation. > Question: Are there different methods we should be using to get font/ > font size information about the recognized text? If not, can you > suggest a technique for estimating the font size that would be better > than looking at the rect info returned from TesseractRectBoxes? > Finally, is there any way to get some font info? We would be happy > simply knowing if the font is a serif or non-serif font after which we > could simply select between something like Ariel or Times, etc. Apparently, font size per character is in the EANYCODE_CHAR structure, font name etc. is kept in the EFONT_DESC structure (ocrclass.h), though I'm not aware of anything that actually uses them. -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

