Hi Alan, Sorry for the delay. As for me, I wouldn't be working with pixellized images of this font's chars. I'd rather use blurring then thresholding to achieve better stroke smoothness and increase stroke width - the conditions Tesseract is designed for. All in all, the "ideal" conditions you are asking about is a matter of experimentation here, and I cannot answer this question at once.
HTH Warm regards, Dmitri Silaev www.CustomOCR.com On Wed, Sep 7, 2011 at 12:33 AM, Alan Willard <[email protected]> wrote: > Hi thanks, > I can attach some sample images, it may not be possible for to attach the > training data since we developed this under contract with our customers. > > A few more data points. > We trained Tesseract for a specific font "MS Sans Serif" > Training process was basically the same as the wiki. The sample text used to > create the boxfile was the same from the Tessearact 2 data set. > We are running version 3.01. I do not know the SVN revision it was compiled > from but the date was approximately June 16th of this year. > We are calling tesseract from the command line. > The image is being scaled with Mogrify before Tesseract by a value of 250% > > Hopefully this is enough to get some help. Thanks > > On Fri, Sep 2, 2011 at 12:03 AM, Dmitri Silaev <[email protected]> > wrote: >> >> Although you've given some info, it's not enough. Pleasу complete the >> following checklist: >> >> >> >> Make sure you have read the Wiki at >> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 >> and searched the forum for questions similar to yours. >> >> If you'd like your question to be answered, please ensure your message >> contains the following: >> - Sample image (or a set of such images) you are trying to recognize >> - If you trained Tesseract yourself, attach all the source files you >> used to build your "traineddata" file and the "traineddata" file itself >> - Provide all the command lines you used to train Tesseract and recognize >> images >> - Attach all config files you used during training and recognition, no >> matter if they are "stock" or created manually >> - If you are using a compiled Tesseract executable report the web page >> from where >> you downloaded it >> - If you compile Tesseract yourself or call it from your own code, >> indicate >> the SVN revision you use >> - If you call Tesseract from code, provide the entire code snippet you >> use for calling >> >> The less info you provide the less chances are your question will be >> answered. >> Providing the full info does not guarantee your question to be answered, >> though. >> << >> >> Warm regards, >> Dmitri Silaev >> www.CustomOCR.com >> >> >> >> >> >> On Thu, Sep 1, 2011 at 7:06 PM, Alan Willard <[email protected]> wrote: >> > Hello All, >> > I have a OCR scenario where we are trying to OCR text from screen >> > images. I have a trained language that includes the one specific font >> > in use. >> > >> > I have noticed a couple of strange issues. >> > >> > 1.) unicharambigs and dictionary seems to have no effect. For example >> > a very common error I see is the character 'a' being interpreted as an >> > 'e'. This is despite having a line in unicharambigs that tries to >> > resolve the ambiguity, AND the original word is a dictionary word, and >> > the result is not. Example: art -> ert >> > >> > 2.) The size of the image seems to greatly influence the quality of >> > OCR. Not only the size, but the location of the text within that >> > image. My OCR scenarios are really simple, black text on a white >> > background, no other noise (like a standard text field). I will get >> > different OCR results based on the amount of white space around the >> > text, having more white space on the right gives me a different result >> > than having more white space on the left, and so on. Some of the >> > results are horrendously bad, and are miraculously accurate when the >> > image is slightly changed, but I can't find a one-size-fits-all >> > solution. What are the ideal image specifications to OCR? >> > >> > -- >> > You received this message because you are subscribed to the Google >> > Groups "tesseract-ocr" group. >> > To post to this group, send email to [email protected] >> > To unsubscribe from this group, send email to >> > [email protected] >> > For more options, visit this group at >> > http://groups.google.com/group/tesseract-ocr?hl=en >> > >> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

