Re: Ugly behavior when recognizing – advice requirement

Andres Sun, 05 May 2013 22:28:59 -0700

Answering part of what I asked last, I've found a way of getting the 
alternatives to each char, but seems to be not working in 3.01 according to 
what I tested and 
http://code.google.com/p/tesseract-ocr/issues/detail?id=714
My snippet:


#include <api/resultiterator.h>
...
tess_api.SetVariable("save_blob_choices", "T");
...

tesseract::ResultIterator* it = tess_api.GetIterator();
do{    char* uval = it->GetUTF8Text(tesseract::RIL_SYMBOL);    
cout<<uval<<"("<<it->Confidence(tesseract::RIL_SYMBOL)<<"){";    
tesseract::ChoiceIterator ci(*it);    do    {        const char* val = 
ci.GetUTF8Text();        cout<<" "<<(val == NULL ? "#" : val)<<" 
"<<ci.Confidence();    }     while (ci.Next());    cout<<"}";}while 
(it->Next(tesseract::RIL_SYMBOL));





El lunes, 6 de mayo de 2013 01:50:42 UTC-3, Andres escribió:
>
> Hi Dmitri,
>
> Many thanks for your hints, as always.
>
> Regarding the links in my previous message, sorry for that, I'll repost 
> the entire message below this message, fixed.
>
> I like the method that you tell that you use in CustomOCR. Is there a way 
> of getting the character variants without making a hack ? As I saw, the 
> interface of the API just exposes the confidence level for each character. 
> Am I right with this ?
>
> Regarding psm mode, I'm using this from insinde my code with value 7, 
> which is for "Treat the image as a single text line". Is that the parameter 
> that you are suggesting me ?
>
> Anyway, I think that I might have big newbie errors in my training, so I 
> will be grateful if you just see my training image and my problematic 
> image, to know if you see an obvious error at first sight.
>
> My training image:
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
>
> Problematic image (a "6" recognized as a "5"):
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>
> Another problematic image ("A A" recognized as "M")
> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit
>
> The following is my original message with the links fixed:
>
> Dear people,
>
> I trained Tesseract for my font (FE-Schrift: 
> http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad 
> results. I am using Tesseract 3.01 under Windows.
>
> In this image:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing
>
> Where text is SAA5298 I’m getting SM529B, this is being done from inside a 
> program and I know that the “M” from the result is the result of the “AA” 
> of the source.  So, Tesseract is making a very bad segmentation of these 
> two characters, and even they are very good separated, as you can see.  Do 
> you have an idea about why is this happening ? In the other hand, is there 
> a way to give tesseract a hint for this (e.g., telling it the character 
> width).
>
> The other problem is with this one:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing
>
> Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, 
> even when the image is very good.
>
>  Here is my fonts training file:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing
>
> Here is my box file:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing
>
> Here is my .traineddata file:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing
>
> And here is a .cmd file for testing these 2 images:
>
>
> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing
>
>  
>
> Thanks,
>
> Andres
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> El viernes, 3 de mayo de 2013 16:05:50 UTC-3, Dmitri Silaev escribió:
>>
>> Andres, 
>>
>> Above all, your first link seem to be pointing to a "traineddata" file 
>> instead of an image. Second, without actually diving deep into your 
>> problem, I can suggest specifying the single line psm mode in the 
>> command line. And finally you can use the user patterns feature to 
>> restrict possible output of Tesseract (for the format see comments in 
>> dict/trie.h on read_pattern_list()). Another way of achieving the 
>> latter, like we do in CustomOCR, and it seems to be more reliable, is 
>> to use the API to get a number of of character variants for each blob 
>> alng with confidence levels and match them against a set of possible 
>> patterns. You can find how to do this by searching around this forum. 
>>
>> HTH and good luck with Tesseract! 
>>
>> Warm regards, 
>> Dmitri Silaev 
>> www.CustomOCR.com 
>>
>>
>> On Fri, May 3, 2013 at 8:24 PM, Andres <[email protected]> wrote: 
>> > Dear people, 
>> > 
>> > I trained Tesseract for my font (FE-Schrift: 
>> > http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad 
>> results. 
>> > I am using Tesseract 3.01 under Windows. 
>> > 
>> > In this image: 
>> > 
>> > 
>> https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing 
>> > 
>> > Where text is SAA5298 I’m getting SM529B, this is being done from 
>> inside a 
>> > program and I know that the “M” from the result is the result of the 
>> “AA” of 
>> > the source.  So, Tesseract is making a very bad segmentation of these 
>> two 
>> > characters, and even they are very good separated, as you can see.  Do 
>> you 
>> > have an idea about why is this happening ? In the other hand, is there 
>> a way 
>> > to give tesseract a hint for this (e.g., telling it the character 
>> width). 
>> > 
>> > The other problem is with this one: 
>> > 
>> > 
>> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing 
>> > 
>> > Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, 
>> even 
>> > when the image is very good. 
>> > 
>> > 
>> > 
>> > Here is my fonts training file: 
>> > 
>> > 
>> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing 
>> > 
>> > Here is my box file: 
>> > 
>> > 
>> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing 
>> > 
>> > Here is my .traineddata file: 
>> > 
>> > 
>> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing 
>> > 
>> > And here is a .cmd file for testing these 2 images: 
>> > 
>> > 
>> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing 
>> > 
>> > 
>> > 
>> > Thanks, 
>> > 
>> > Andres 
>> > 
>> > -- 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> > Groups "tesseract-ocr" group. 
>> > To post to this group, send email to [email protected] 
>> > To unsubscribe from this group, send email to 
>> > [email protected] 
>> > For more options, visit this group at 
>> > http://groups.google.com/group/tesseract-ocr?hl=en 
>> > 
>> > --- 
>> > You received this message because you are subscribed to the Google 
>> Groups 
>> > "tesseract-ocr" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an 
>> > email to [email protected]. 
>> > For more options, visit https://groups.google.com/groups/opt_out. 
>> > 
>> > 
>>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Ugly behavior when recognizing – advice requirement

Reply via email to