Re: Ugly behavior when recognizing – advice requirement

Dmitri Silaev Thu, 23 May 2013 02:55:22 -0700

Andres,

Inherently, Tesseract is designed to detect both straight and inverted
text, probably in the same text image. Often this is a source of its
confusion with what is the background and what is the foreground:
sometimes for closed character interior is treated as a character and
foreground pixels as surrounding background. That's why sometimes it's
not practical to pass isolated character images or images with little
text: they can screw Tesseract up. I suggest passing a whole text line
and then iterate over the results, reading recognized characters and
their confidence levels.


Warm regards,
Dmitri Silaev
www.CustomOCR.com


On Tue, May 21, 2013 at 9:54 AM, Andres <[email protected]> wrote:
> Hi Dmitri,
>
> Many thanks for your help.
>
> I’ve tried with PageSegMode in PSM_SINGLE_BLOCK_VERT_TEXT and surprisingly I
> got very good results.
>
> But then I switched from Tesseract 3.01 to 3.02 (revision 724) and the
> behavior of tesseract changed significantly, not for good in my case. It
> began to detect two characters in the same character, one in a higher
> position and another  in a lower position.
>
> So I tested calling tesseract for each char (PSM_SINGLE_CHAR ), as I do the
> segmentation by myself. The results on some characters were ok but in some
> others it detected the inner contours of characters like Q as a character
> (please see the red rectangle on this image
> https://docs.google.com/file/d/0BxkuvS_LuBAzeDJQRWg2aHBnNFU/edit?usp=sharing
> )
>
> Do you have any suggestions on this ?
>
> I’ve been thinking that perhaps there could be a variable to restrict
> tesseract a little (
> http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version ) but the
> list is so wide that discourages me.
>
> I also have been thinking in doing something with RunAdaptiveClassifier
> which is exposed by the API, but I’m not sure if that function could serve
> to make OCR of a single char.
>
> The main particularity of my case of use is that I already have the text
> segmented, so I wonder that it should be easy. That’s why  I think that
> perhaps I’m making a big error in some part.
>
> Best regards,
>
> Andres
>
>
>
>
>
> 2013/5/7 Dmitri Silaev <[email protected]>
>>
>> Andres,
>>
>> Your code seems to be correct. I personally use a few more lines right
>> after the call to GetIterator():
>>     it->Begin();
>>     if(it->IsAtFinalElement(RIL_BLOCK, RIL_SYMBOL))
>>         return;
>>     if(!it->IsAtBeginningOf(RIL_SYMBOL))
>>         return;
>> But this shouldn't bother you if you rely on non-degenerate cases.
>>
>> Well, I suggest using revision 724. It is battle-tested by me and
>> probably contains less bugs and has better balance between accuracy
>> and speed compared to any newer revision. Although newer ones may
>> introduce many fancy features, I'll refrain of using them in
>> production. Maybe this can help you.
>>
>> Warm regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>>
>> On Mon, May 6, 2013 at 9:28 AM, Andres <[email protected]> wrote:
>> > Answering part of what I asked last, I've found a way of getting the
>> > alternatives to each char, but seems to be not working in 3.01 according
>> > to
>> > what I tested and
>> > http://code.google.com/p/tesseract-ocr/issues/detail?id=714
>> > My snippet:
>> >
>> > #include <api/resultiterator.h>
>> >
>> > ...
>> >
>> > tess_api.SetVariable("save_blob_choices", "T");
>> >
>> > ...
>> >
>> >
>> > tesseract::ResultIterator* it = tess_api.GetIterator();
>> >
>> > do
>> > {
>> >     char* uval = it->GetUTF8Text(tesseract::RIL_SYMBOL);
>> >     cout<<uval<<"("<<it->Confidence(tesseract::RIL_SYMBOL)<<"){";
>> >     tesseract::ChoiceIterator ci(*it);
>> >     do
>> >     {
>> >         const char* val = ci.GetUTF8Text();
>> >         cout<<" "<<(val == NULL ? "#" : val)<<" "<<ci.Confidence();
>> >     }
>> >     while (ci.Next());
>> >     cout<<"}";
>> > }
>> > while (it->Next(tesseract::RIL_SYMBOL));
>> >
>> >
>> >
>> >
>> >
>> > El lunes, 6 de mayo de 2013 01:50:42 UTC-3, Andres escribió:
>> >>
>> >> Hi Dmitri,
>> >>
>> >> Many thanks for your hints, as always.
>> >>
>> >> Regarding the links in my previous message, sorry for that, I'll repost
>> >> the entire message below this message, fixed.
>> >>
>> >> I like the method that you tell that you use in CustomOCR. Is there a
>> >> way
>> >> of getting the character variants without making a hack ? As I saw, the
>> >> interface of the API just exposes the confidence level for each
>> >> character.
>> >> Am I right with this ?
>> >>
>> >> Regarding psm mode, I'm using this from insinde my code with value 7,
>> >> which is for "Treat the image as a single text line". Is that the
>> >> parameter
>> >> that you are suggesting me ?
>> >>
>> >> Anyway, I think that I might have big newbie errors in my training, so
>> >> I
>> >> will be grateful if you just see my training image and my problematic
>> >> image,
>> >> to know if you see an obvious error at first sight.
>> >>
>> >> My training image:
>> >>
>> >>
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
>> >>
>> >> Problematic image (a "6" recognized as a "5"):
>> >>
>> >>
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>> >>
>> >> Another problematic image ("A A" recognized as "M")
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit
>> >>
>> >> The following is my original message with the links fixed:
>> >>
>> >> Dear people,
>> >>
>> >> I trained Tesseract for my font (FE-Schrift:
>> >> http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad
>> >> results.
>> >> I am using Tesseract 3.01 under Windows.
>> >>
>> >> In this image:
>> >>
>> >>
>> >>
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
>> >>
>> >> Where text is SAA5298 I’m getting SM529B, this is being done from
>> >> inside a
>> >> program and I know that the “M” from the result is the result of the
>> >> “AA” of
>> >> the source.  So, Tesseract is making a very bad segmentation of these
>> >> two
>> >> characters, and even they are very good separated, as you can see.  Do
>> >> you
>> >> have an idea about why is this happening ? In the other hand, is there
>> >> a way
>> >> to give tesseract a hint for this (e.g., telling it the character
>> >> width).
>> >>
>> >> The other problem is with this one:
>> >>
>> >>
>> >>
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>> >>
>> >> Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”,
>> >> even when the image is very good.
>> >>
>> >>  Here is my fonts training file:
>> >>
>> >>
>> >>
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
>> >>
>> >> Here is my box file:
>> >>
>> >>
>> >>
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
>> >>
>> >> Here is my .traineddata file:
>> >>
>> >>
>> >>
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
>> >>
>> >> And here is a .cmd file for testing these 2 images:
>> >>
>> >>
>> >>
>> >> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
>> >>
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Andres
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> El viernes, 3 de mayo de 2013 16:05:50 UTC-3, Dmitri Silaev escribió:
>> >>>
>> >>> Andres,
>> >>>
>> >>> Above all, your first link seem to be pointing to a "traineddata" file
>> >>> instead of an image. Second, without actually diving deep into your
>> >>> problem, I can suggest specifying the single line psm mode in the
>> >>> command line. And finally you can use the user patterns feature to
>> >>> restrict possible output of Tesseract (for the format see comments in
>> >>> dict/trie.h on read_pattern_list()). Another way of achieving the
>> >>> latter, like we do in CustomOCR, and it seems to be more reliable, is
>> >>> to use the API to get a number of of character variants for each blob
>> >>> alng with confidence levels and match them against a set of possible
>> >>> patterns. You can find how to do this by searching around this forum.
>> >>>
>> >>> HTH and good luck with Tesseract!
>> >>>
>> >>> Warm regards,
>> >>> Dmitri Silaev
>> >>> www.CustomOCR.com
>> >>>
>> >>>
>> >>> On Fri, May 3, 2013 at 8:24 PM, Andres <[email protected]> wrote:
>> >>> > Dear people,
>> >>> >
>> >>> > I trained Tesseract for my font (FE-Schrift:
>> >>> > http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad
>> >>> > results.
>> >>> > I am using Tesseract 3.01 under Windows.
>> >>> >
>> >>> > In this image:
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing
>> >>> >
>> >>> > Where text is SAA5298 I’m getting SM529B, this is being done from
>> >>> > inside a
>> >>> > program and I know that the “M” from the result is the result of the
>> >>> > “AA” of
>> >>> > the source.  So, Tesseract is making a very bad segmentation of
>> >>> > these
>> >>> > two
>> >>> > characters, and even they are very good separated, as you can see.
>> >>> > Do
>> >>> > you
>> >>> > have an idea about why is this happening ? In the other hand, is
>> >>> > there
>> >>> > a way
>> >>> > to give tesseract a hint for this (e.g., telling it the character
>> >>> > width).
>> >>> >
>> >>> > The other problem is with this one:
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>> >>> >
>> >>> > Where text is LDA6244, Tesseract is recognizing a “5” instead of a
>> >>> > “6”,
>> >>> > even
>> >>> > when the image is very good.
>> >>> >
>> >>> >
>> >>> >
>> >>> > Here is my fonts training file:
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
>> >>> >
>> >>> > Here is my box file:
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
>> >>> >
>> >>> > Here is my .traineddata file:
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
>> >>> >
>> >>> > And here is a .cmd file for testing these 2 images:
>> >>> >
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
>> >>> >
>> >>> >
>> >>> >
>> >>> > Thanks,
>> >>> >
>> >>> > Andres
>> >>> >
>> >>> > --
>> >>> > --
>> >>> > You received this message because you are subscribed to the Google
>> >>> > Groups "tesseract-ocr" group.
>> >>> > To post to this group, send email to [email protected]
>> >>> > To unsubscribe from this group, send email to
>> >>> > [email protected]
>> >>> > For more options, visit this group at
>> >>> > http://groups.google.com/group/tesseract-ocr?hl=en
>> >>> >
>> >>> > ---
>> >>> > You received this message because you are subscribed to the Google
>> >>> > Groups
>> >>> > "tesseract-ocr" group.
>> >>> > To unsubscribe from this group and stop receiving emails from it,
>> >>> > send
>> >>> > an
>> >>> > email to [email protected].
>> >>> > For more options, visit https://groups.google.com/groups/opt_out.
>> >>> >
>> >>> >
>> >
>> > --
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to [email protected]
>> > To unsubscribe from this group, send email to
>> > [email protected]
>> > For more options, visit this group at
>> > http://groups.google.com/group/tesseract-ocr?hl=en
>> >
>> > ---
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "tesseract-ocr" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to [email protected].
>> > For more options, visit https://groups.google.com/groups/opt_out.
>> >
>> >
>>
>> --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> ---
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/et7bS5QRf2o/unsubscribe?hl=en.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Ugly behavior when recognizing – advice requirement

Reply via email to