Hi Dmitri, Many thanks for your hints, as always.
Regarding the links in my previous message, sorry for that, I'll repost the entire message below this message, fixed. I like the method that you tell that you use in CustomOCR. Is there a way of getting the character variants without making a hack ? As I saw, the interface of the API just exposes the confidence level for each character. Am I right with this ? Regarding psm mode, I'm using this from insinde my code with value 7, which is for "Treat the image as a single text line". Is that the parameter that you are suggesting me ? Anyway, I think that I might have big newbie errors in my training, so I will be grateful if you just see my training image and my problematic image, to know if you see an obvious error at first sight. My training image: https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing Problematic image (a "6" recognized as a "5"): https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing Another problematic image ("A A" recognized as "M") https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit The following is my original message with the links fixed: Dear people, I trained Tesseract for my font (FE-Schrift: http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad results. I am using Tesseract 3.01 under Windows. In this image: https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing Where text is SAA5298 I’m getting SM529B, this is being done from inside a program and I know that the “M” from the result is the result of the “AA” of the source. So, Tesseract is making a very bad segmentation of these two characters, and even they are very good separated, as you can see. Do you have an idea about why is this happening ? In the other hand, is there a way to give tesseract a hint for this (e.g., telling it the character width). The other problem is with this one: https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, even when the image is very good. Here is my fonts training file: https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing Here is my box file: https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing Here is my .traineddata file: https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing And here is a .cmd file for testing these 2 images: https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing Thanks, Andres El viernes, 3 de mayo de 2013 16:05:50 UTC-3, Dmitri Silaev escribió: > > Andres, > > Above all, your first link seem to be pointing to a "traineddata" file > instead of an image. Second, without actually diving deep into your > problem, I can suggest specifying the single line psm mode in the > command line. And finally you can use the user patterns feature to > restrict possible output of Tesseract (for the format see comments in > dict/trie.h on read_pattern_list()). Another way of achieving the > latter, like we do in CustomOCR, and it seems to be more reliable, is > to use the API to get a number of of character variants for each blob > alng with confidence levels and match them against a set of possible > patterns. You can find how to do this by searching around this forum. > > HTH and good luck with Tesseract! > > Warm regards, > Dmitri Silaev > www.CustomOCR.com > > > On Fri, May 3, 2013 at 8:24 PM, Andres <[email protected] <javascript:>> > wrote: > > Dear people, > > > > I trained Tesseract for my font (FE-Schrift: > > http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad > results. > > I am using Tesseract 3.01 under Windows. > > > > In this image: > > > > > https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing > > > > Where text is SAA5298 I’m getting SM529B, this is being done from inside > a > > program and I know that the “M” from the result is the result of the > “AA” of > > the source. So, Tesseract is making a very bad segmentation of these > two > > characters, and even they are very good separated, as you can see. Do > you > > have an idea about why is this happening ? In the other hand, is there a > way > > to give tesseract a hint for this (e.g., telling it the character > width). > > > > The other problem is with this one: > > > > > https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing > > > > Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, > even > > when the image is very good. > > > > > > > > Here is my fonts training file: > > > > > https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing > > > > Here is my box file: > > > > > https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing > > > > Here is my .traineddata file: > > > > > https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing > > > > And here is a .cmd file for testing these 2 images: > > > > > https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing > > > > > > > > Thanks, > > > > Andres > > > > -- > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to > > [email protected]<javascript:> > > To unsubscribe from this group, send email to > > [email protected] <javascript:> > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > --- > > You received this message because you are subscribed to the Google > Groups > > "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > > email to [email protected] <javascript:>. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

