Re: Ugly behavior when recognizing – advice requirement

Andres Sun, 05 May 2013 21:50:45 -0700

Hi Dmitri,

Many thanks for your hints, as always.


Regarding the links in my previous message, sorry for that, I'll repost the 
entire message below this message, fixed.

I like the method that you tell that you use in CustomOCR. Is there a way 
of getting the character variants without making a hack ? As I saw, the 
interface of the API just exposes the confidence level for each character. 
Am I right with this ?

Regarding psm mode, I'm using this from insinde my code with value 7, which 
is for "Treat the image as a single text line". Is that the parameter that 
you are suggesting me ?

Anyway, I think that I might have big newbie errors in my training, so I 
will be grateful if you just see my training image and my problematic 
image, to know if you see an obvious error at first sight.

My training image:
https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing

Problematic image (a "6" recognized as a "5"):
https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing

Another problematic image ("A A" recognized as "M")
https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit

The following is my original message with the links fixed:

Dear people,

I trained Tesseract for my font (FE-Schrift: 
http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad results. 
I am using Tesseract 3.01 under Windows.

In this image:

https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing

Where text is SAA5298 I’m getting SM529B, this is being done from inside a 
program and I know that the “M” from the result is the result of the “AA” 
of the source.  So, Tesseract is making a very bad segmentation of these 
two characters, and even they are very good separated, as you can see.  Do 
you have an idea about why is this happening ? In the other hand, is there 
a way to give tesseract a hint for this (e.g., telling it the character 
width).

The other problem is with this one:

https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing

Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, 
even when the image is very good.

 Here is my fonts training file:

https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing

Here is my box file:

https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing

Here is my .traineddata file:

https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing

And here is a .cmd file for testing these 2 images:

https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing

 

Thanks,

Andres



















El viernes, 3 de mayo de 2013 16:05:50 UTC-3, Dmitri Silaev escribió:
>
> Andres, 
>
> Above all, your first link seem to be pointing to a "traineddata" file 
> instead of an image. Second, without actually diving deep into your 
> problem, I can suggest specifying the single line psm mode in the 
> command line. And finally you can use the user patterns feature to 
> restrict possible output of Tesseract (for the format see comments in 
> dict/trie.h on read_pattern_list()). Another way of achieving the 
> latter, like we do in CustomOCR, and it seems to be more reliable, is 
> to use the API to get a number of of character variants for each blob 
> alng with confidence levels and match them against a set of possible 
> patterns. You can find how to do this by searching around this forum. 
>
> HTH and good luck with Tesseract! 
>
> Warm regards, 
> Dmitri Silaev 
> www.CustomOCR.com 
>
>
> On Fri, May 3, 2013 at 8:24 PM, Andres <[email protected] <javascript:>> 
> wrote: 
> > Dear people, 
> > 
> > I trained Tesseract for my font (FE-Schrift: 
> > http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad 
> results. 
> > I am using Tesseract 3.01 under Windows. 
> > 
> > In this image: 
> > 
> > 
> https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing 
> > 
> > Where text is SAA5298 I’m getting SM529B, this is being done from inside 
> a 
> > program and I know that the “M” from the result is the result of the 
> “AA” of 
> > the source.  So, Tesseract is making a very bad segmentation of these 
> two 
> > characters, and even they are very good separated, as you can see.  Do 
> you 
> > have an idea about why is this happening ? In the other hand, is there a 
> way 
> > to give tesseract a hint for this (e.g., telling it the character 
> width). 
> > 
> > The other problem is with this one: 
> > 
> > 
> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing 
> > 
> > Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, 
> even 
> > when the image is very good. 
> > 
> > 
> > 
> > Here is my fonts training file: 
> > 
> > 
> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing 
> > 
> > Here is my box file: 
> > 
> > 
> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing 
> > 
> > Here is my .traineddata file: 
> > 
> > 
> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing 
> > 
> > And here is a .cmd file for testing these 2 images: 
> > 
> > 
> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing 
> > 
> > 
> > 
> > Thanks, 
> > 
> > Andres 
> > 
> > -- 
> > -- 
> > You received this message because you are subscribed to the Google 
> > Groups "tesseract-ocr" group. 
> > To post to this group, send email to 
> > [email protected]<javascript:> 
> > To unsubscribe from this group, send email to 
> > [email protected] <javascript:> 
> > For more options, visit this group at 
> > http://groups.google.com/group/tesseract-ocr?hl=en 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "tesseract-ocr" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/groups/opt_out. 
> > 
> > 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Ugly behavior when recognizing – advice requirement

Reply via email to