Re: Ugly behavior when recognizing – advice requirement

Andres Mon, 20 May 2013 22:54:27 -0700

Hi Dmitri,

Many thanks for your help.


I’ve tried with PageSegMode in PSM_SINGLE_BLOCK_VERT_TEXT and surprisingly
I got very good results.

But then I switched from Tesseract 3.01 to 3.02 (revision 724) and the
behavior of tesseract changed significantly, not for good in my case. It
began to detect two characters in the same character, one in a higher
position and another  in a lower position.

So I tested calling tesseract for each char (PSM_SINGLE_CHAR ), as I do the
segmentation by myself. The results on some characters were ok but in some
others it detected the inner contours of characters like Q as a character
(please see the red rectangle on this image
https://docs.google.com/file/d/0BxkuvS_LuBAzeDJQRWg2aHBnNFU/edit?usp=sharing)

Do you have any suggestions on this ?

I’ve been thinking that perhaps there could be a variable to restrict
tesseract a little (
http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version ) but the
list is so wide that discourages me.

I also have been thinking in doing something with RunAdaptiveClassifier
which is exposed by the API, but I’m not sure if that function could serve
to make OCR of a single char.

The main particularity of my case of use is that I already have the text
segmented, so I wonder that it should be easy. That’s why  I think that
perhaps I’m making a big error in some part.

Best regards,

Andres




2013/5/7 Dmitri Silaev <[email protected]>

> Andres,
>
> Your code seems to be correct. I personally use a few more lines right
> after the call to GetIterator():
>     it->Begin();
>     if(it->IsAtFinalElement(RIL_BLOCK, RIL_SYMBOL))
>         return;
>     if(!it->IsAtBeginningOf(RIL_SYMBOL))
>         return;
> But this shouldn't bother you if you rely on non-degenerate cases.
>
> Well, I suggest using revision 724. It is battle-tested by me and
> probably contains less bugs and has better balance between accuracy
> and speed compared to any newer revision. Although newer ones may
> introduce many fancy features, I'll refrain of using them in
> production. Maybe this can help you.
>
> Warm regards,
> Dmitri Silaev
> www.CustomOCR.com
>
>
> On Mon, May 6, 2013 at 9:28 AM, Andres <[email protected]> wrote:
> > Answering part of what I asked last, I've found a way of getting the
> > alternatives to each char, but seems to be not working in 3.01 according
> to
> > what I tested and
> > http://code.google.com/p/tesseract-ocr/issues/detail?id=714
> > My snippet:
> >
> > #include <api/resultiterator.h>
> >
> > ...
> >
> > tess_api.SetVariable("save_blob_choices", "T");
> >
> > ...
> >
> >
> > tesseract::ResultIterator* it = tess_api.GetIterator();
> >
> > do
> > {
> >     char* uval = it->GetUTF8Text(tesseract::RIL_SYMBOL);
> >     cout<<uval<<"("<<it->Confidence(tesseract::RIL_SYMBOL)<<"){";
> >     tesseract::ChoiceIterator ci(*it);
> >     do
> >     {
> >         const char* val = ci.GetUTF8Text();
> >         cout<<" "<<(val == NULL ? "#" : val)<<" "<<ci.Confidence();
> >     }
> >     while (ci.Next());
> >     cout<<"}";
> > }
> > while (it->Next(tesseract::RIL_SYMBOL));
> >
> >
> >
> >
> >
> > El lunes, 6 de mayo de 2013 01:50:42 UTC-3, Andres escribió:
> >>
> >> Hi Dmitri,
> >>
> >> Many thanks for your hints, as always.
> >>
> >> Regarding the links in my previous message, sorry for that, I'll repost
> >> the entire message below this message, fixed.
> >>
> >> I like the method that you tell that you use in CustomOCR. Is there a
> way
> >> of getting the character variants without making a hack ? As I saw, the
> >> interface of the API just exposes the confidence level for each
> character.
> >> Am I right with this ?
> >>
> >> Regarding psm mode, I'm using this from insinde my code with value 7,
> >> which is for "Treat the image as a single text line". Is that the
> parameter
> >> that you are suggesting me ?
> >>
> >> Anyway, I think that I might have big newbie errors in my training, so I
> >> will be grateful if you just see my training image and my problematic
> image,
> >> to know if you see an obvious error at first sight.
> >>
> >> My training image:
> >>
> >>
> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
> >>
> >> Problematic image (a "6" recognized as a "5"):
> >>
> >>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
> >>
> >> Another problematic image ("A A" recognized as "M")
> >> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit
> >>
> >> The following is my original message with the links fixed:
> >>
> >> Dear people,
> >>
> >> I trained Tesseract for my font (FE-Schrift:
> >> http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad
> results.
> >> I am using Tesseract 3.01 under Windows.
> >>
> >> In this image:
> >>
> >>
> >>
> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
> >>
> >> Where text is SAA5298 I’m getting SM529B, this is being done from
> inside a
> >> program and I know that the “M” from the result is the result of the
> “AA” of
> >> the source.  So, Tesseract is making a very bad segmentation of these
> two
> >> characters, and even they are very good separated, as you can see.  Do
> you
> >> have an idea about why is this happening ? In the other hand, is there
> a way
> >> to give tesseract a hint for this (e.g., telling it the character
> width).
> >>
> >> The other problem is with this one:
> >>
> >>
> >>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
> >>
> >> Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”,
> >> even when the image is very good.
> >>
> >>  Here is my fonts training file:
> >>
> >>
> >>
> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
> >>
> >> Here is my box file:
> >>
> >>
> >>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
> >>
> >> Here is my .traineddata file:
> >>
> >>
> >>
> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
> >>
> >> And here is a .cmd file for testing these 2 images:
> >>
> >>
> >>
> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Andres
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> El viernes, 3 de mayo de 2013 16:05:50 UTC-3, Dmitri Silaev escribió:
> >>>
> >>> Andres,
> >>>
> >>> Above all, your first link seem to be pointing to a "traineddata" file
> >>> instead of an image. Second, without actually diving deep into your
> >>> problem, I can suggest specifying the single line psm mode in the
> >>> command line. And finally you can use the user patterns feature to
> >>> restrict possible output of Tesseract (for the format see comments in
> >>> dict/trie.h on read_pattern_list()). Another way of achieving the
> >>> latter, like we do in CustomOCR, and it seems to be more reliable, is
> >>> to use the API to get a number of of character variants for each blob
> >>> alng with confidence levels and match them against a set of possible
> >>> patterns. You can find how to do this by searching around this forum.
> >>>
> >>> HTH and good luck with Tesseract!
> >>>
> >>> Warm regards,
> >>> Dmitri Silaev
> >>> www.CustomOCR.com
> >>>
> >>>
> >>> On Fri, May 3, 2013 at 8:24 PM, Andres <[email protected]> wrote:
> >>> > Dear people,
> >>> >
> >>> > I trained Tesseract for my font (FE-Schrift:
> >>> > http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad
> >>> > results.
> >>> > I am using Tesseract 3.01 under Windows.
> >>> >
> >>> > In this image:
> >>> >
> >>> >
> >>> >
> https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing
> >>> >
> >>> > Where text is SAA5298 I’m getting SM529B, this is being done from
> >>> > inside a
> >>> > program and I know that the “M” from the result is the result of the
> >>> > “AA” of
> >>> > the source.  So, Tesseract is making a very bad segmentation of these
> >>> > two
> >>> > characters, and even they are very good separated, as you can see.
>  Do
> >>> > you
> >>> > have an idea about why is this happening ? In the other hand, is
> there
> >>> > a way
> >>> > to give tesseract a hint for this (e.g., telling it the character
> >>> > width).
> >>> >
> >>> > The other problem is with this one:
> >>> >
> >>> >
> >>> >
> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
> >>> >
> >>> > Where text is LDA6244, Tesseract is recognizing a “5” instead of a
> “6”,
> >>> > even
> >>> > when the image is very good.
> >>> >
> >>> >
> >>> >
> >>> > Here is my fonts training file:
> >>> >
> >>> >
> >>> >
> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
> >>> >
> >>> > Here is my box file:
> >>> >
> >>> >
> >>> >
> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
> >>> >
> >>> > Here is my .traineddata file:
> >>> >
> >>> >
> >>> >
> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
> >>> >
> >>> > And here is a .cmd file for testing these 2 images:
> >>> >
> >>> >
> >>> >
> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
> >>> >
> >>> >
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Andres
> >>> >
> >>> > --
> >>> > --
> >>> > You received this message because you are subscribed to the Google
> >>> > Groups "tesseract-ocr" group.
> >>> > To post to this group, send email to [email protected]
> >>> > To unsubscribe from this group, send email to
> >>> > [email protected]
> >>> > For more options, visit this group at
> >>> > http://groups.google.com/group/tesseract-ocr?hl=en
> >>> >
> >>> > ---
> >>> > You received this message because you are subscribed to the Google
> >>> > Groups
> >>> > "tesseract-ocr" group.
> >>> > To unsubscribe from this group and stop receiving emails from it,
> send
> >>> > an
> >>> > email to [email protected].
> >>> > For more options, visit https://groups.google.com/groups/opt_out.
> >>> >
> >>> >
> >
> > --
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to
> > [email protected]
> > For more options, visit this group at
> > http://groups.google.com/group/tesseract-ocr?hl=en
> >
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "tesseract-ocr" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/et7bS5QRf2o/unsubscribe?hl=en
> .
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Ugly behavior when recognizing – advice requirement

Reply via email to