Re: OCR Problems (unicharambigs and image sizes)

Dmitri Silaev Sun, 11 Sep 2011 00:22:44 -0700

Hi Alan,

Sorry for the delay. As for me, I wouldn't be working with pixellized
images of this font's chars. I'd rather use blurring then thresholding
to achieve better stroke smoothness and increase stroke width - the
conditions Tesseract is designed for. All in all, the "ideal"
conditions you are asking about is a matter of experimentation here,
and I cannot answer this question at once.


HTH

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Wed, Sep 7, 2011 at 12:33 AM, Alan Willard <[email protected]> wrote:
> Hi thanks,
> I can attach some sample images, it may not be possible for to attach the
> training data since we developed this under contract with our customers.
>
> A few more data points.
> We trained Tesseract for a specific font "MS Sans Serif"
> Training process was basically the same as the wiki. The sample text used to
> create the boxfile was the same from the Tessearact 2 data set.
> We are running version 3.01. I do not know the SVN revision it was compiled
> from but the date was approximately June 16th of this year.
> We are calling tesseract from the command line.
> The image is being scaled with Mogrify before Tesseract by a value of 250%
>
> Hopefully this is enough to get some help. Thanks
>
> On Fri, Sep 2, 2011 at 12:03 AM, Dmitri Silaev <[email protected]>
> wrote:
>>
>> Although you've given some info, it's not enough. Pleasу complete the
>> following checklist:
>>
>> >>
>> Make sure you have read the Wiki at
>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>> and searched the forum for questions similar to yours.
>>
>> If you'd like your question to be answered, please ensure your message
>> contains the following:
>> - Sample image (or a set of such images) you are trying to recognize
>> - If you trained Tesseract yourself, attach all the source files you
>> used to build your "traineddata" file and the "traineddata" file itself
>> - Provide all the command lines you used to train Tesseract and recognize
>> images
>> - Attach all config files you used during training and recognition, no
>> matter if they are "stock" or created manually
>> - If you are using a compiled Tesseract executable report the web page
>> from where
>> you downloaded it
>> - If you compile Tesseract yourself or call it from your own code,
>> indicate
>> the SVN revision you use
>> - If you call Tesseract from code, provide the entire code snippet you
>> use for calling
>>
>> The less info you provide the less chances are your question will be
>> answered.
>> Providing the full info does not guarantee your question to be answered,
>> though.
>> <<
>>
>> Warm regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>>
>>
>>
>>
>> On Thu, Sep 1, 2011 at 7:06 PM, Alan Willard <[email protected]> wrote:
>> > Hello All,
>> > I have a OCR scenario where we are trying to OCR text from screen
>> > images. I have a trained language that includes the one specific font
>> > in use.
>> >
>> > I have noticed a couple of strange issues.
>> >
>> > 1.) unicharambigs and dictionary seems to have no effect. For example
>> > a very common error I see is the character 'a' being interpreted as an
>> > 'e'. This is despite having a line in unicharambigs that tries to
>> > resolve the ambiguity, AND the original word is a dictionary word, and
>> > the result is not. Example: art -> ert
>> >
>> > 2.) The size of the image seems to greatly influence the quality of
>> > OCR. Not only the size, but the location of the text within that
>> > image. My OCR scenarios are really simple, black text on a white
>> > background, no other noise (like a standard text field). I will get
>> > different OCR results based on the amount of white space around the
>> > text, having more white space on the right gives me a different result
>> > than having more white space on the left, and so on. Some of the
>> > results are horrendously bad, and are miraculously accurate when the
>> > image is slightly changed, but I can't find a one-size-fits-all
>> > solution. What are the ideal image specifications to OCR?
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to [email protected]
>> > To unsubscribe from this group, send email to
>> > [email protected]
>> > For more options, visit this group at
>> > http://groups.google.com/group/tesseract-ocr?hl=en
>> >
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: OCR Problems (unicharambigs and image sizes)

Reply via email to