Re: Re: Re: tesseract improve the reject rate ?

Dmitri Silaev Wed, 30 Mar 2011 22:03:39 -0700

Liuguanqiang,

Well, now I guess, I understand what you want. You have a text consisting of
arbitrary characters: digits and Chinese letters. You goal is to find in
this text a particular fragment, knowing it can be comprised by digits 5678
only. Confirm?


If so, the first thing to do is to set the whitelist to "5678", as you've
already done yourself. Next, you need to play with "suspect_rating_per_ch"
and "suspect_accept_rating" params. In theory, they have something to do
with rejection. Note that they relate to Tesseract "ratings" which are
different from usual probabilities (or confidence values): they are inverted
(smaller is better) and can be negative. Also you can take a look at
- matcher_good_threshold
- matcher_great_threshold
- matcher_perfect_threshold
- matcher_bad_match_pad
Not sure what they do, you have to study this yourself in the code. They are
also measured in ratings.

All this, in theory, allows you to tune Tesseract with no programming. But
if you choose to program, you can simlpy iterate over the resulting chars
and reject them by low confidence yourself, with no need to experiment

HTH

Warm regards,
Dmitri Silaev




2011/3/31 liuguanqiang <[email protected]>

>  For example, I use the eng.traineddata(setwhitelist to "0123456789")  to
> recognize the digital in the following picture:
>  The tesseract output the correct result: "24013091"
> Now, I have known there are only "5678" in the input image, So I
> setwhitelist to "5678".
> On the above image, the tesseract output the wrong correct(include the
> length): "866685".
> In this case , how to let the tesseract oupt empty or null?
> My question is which tess variables control the classifier match criterion ?
>
> I want to tune these tess variables to let the tesseract out null in the
> above case.
> 2011-03-31
> ------------------------------
>  liuguanqiang
> ------------------------------
> *发件人：* Dmitri Silaev
> *发送时间：* 2011-03-29  14:59:22
> *收件人：* tesseract-ocr
> *抄送：* liuguanqiang
> *主题：* Re: Re: tesseract improve the reject rate ?
>  As I always say, send the sample image(s) and describe what you need
> exactly. Maybe you're looking in the wrong direction.
>  Warm regards,
> Dmitri Silaev
>    On Tue, Mar 29, 2011 at 7:34 AM, liuguanqiang <[email protected]
> > wrote:
> > Thanks for your reply.
> > In another case, I use tesseract to recognize Chinese characters.
> > Some Chinese character is recognized as other wrong Chinese character,
> > though they are very different in apperance.
> > The Chinese character has many(dense) strokes is the reason ?
> > In this case, detecting ROI is helpless.
>
> > My question is which tess variables control the classifier match metrics ?
> > I want to tune these tess variables to solve this problem or
> > improve the reject rate.
> > Best regards
> > 2011-03-29
> > ________________________________
> > liuguanqiang
> > ________________________________
> > 发件人： Dmitri Silaev
> > 发送时间： 2011-03-27  05:36:01
> > 收件人： tesseract-ocr
> > 抄送： liuguanqiang
> > 主题： Re: tesseract improve the reject rate ?
> > When you have a small trained alphabet, Tesseract's classifier
> > sometimes might not find suitable matches and in that way it will
> > output a null character further converted to a space. However in your
> > case, there are Chinese characters that have many strokes and
> > outlines, many of which somehow (partially) match the characters from
> > your whitelist. So be ready for a quantity of false detections even
> > when your alphabet is small, i.e. you train Tess to get only digits.
> > The best approach would be to determine locations where regions of
> > interest (ROIs) are located, and then run the recognition over them,
> > using appropriate whitelists.
> > Warm regards,
> > Dmitri Silaev
> > On Sat, Mar 26, 2011 at 8:44 AM, liuguanqiang <[email protected]
> > wrote:
> >> hi:
> >> I use tesseract recognize digital(setwhitelist"0123456789") using
> >> eng.traineddata.
> >> There is some other character set(Chinese) in the test image, but the
> >> tesseract recognize the chinese char  to digital.
> >> Is there some tess variables to control this situation? Is this problem
> >> equals " improve the reject rate "?
>
> >> The following picture(binary) is recognized as "5221555255", how to let the
> >> tesseract output null?
> >>
> >>
> >> --
>
> >> You received this message because you are subscribed to the Google Groups
> >> "tesseract-ocr" group.
> >> To post to this group, send email to [email protected].
> >> To unsubscribe from this group, send email to
> >> [email protected].
> >> For more options, visit this group at
> >> http://groups.google.com/group/tesseract-ocr?hl=en.
> >>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

<<testImage(03-31-11-51-18).jpg>>

Re: Re: Re: tesseract improve the reject rate ?

Reply via email to