Andres, Inherently, Tesseract is designed to detect both straight and inverted text, probably in the same text image. Often this is a source of its confusion with what is the background and what is the foreground: sometimes for closed character interior is treated as a character and foreground pixels as surrounding background. That's why sometimes it's not practical to pass isolated character images or images with little text: they can screw Tesseract up. I suggest passing a whole text line and then iterate over the results, reading recognized characters and their confidence levels.
Warm regards, Dmitri Silaev www.CustomOCR.com On Tue, May 21, 2013 at 9:54 AM, Andres <[email protected]> wrote: > Hi Dmitri, > > Many thanks for your help. > > I’ve tried with PageSegMode in PSM_SINGLE_BLOCK_VERT_TEXT and surprisingly I > got very good results. > > But then I switched from Tesseract 3.01 to 3.02 (revision 724) and the > behavior of tesseract changed significantly, not for good in my case. It > began to detect two characters in the same character, one in a higher > position and another in a lower position. > > So I tested calling tesseract for each char (PSM_SINGLE_CHAR ), as I do the > segmentation by myself. The results on some characters were ok but in some > others it detected the inner contours of characters like Q as a character > (please see the red rectangle on this image > https://docs.google.com/file/d/0BxkuvS_LuBAzeDJQRWg2aHBnNFU/edit?usp=sharing > ) > > Do you have any suggestions on this ? > > I’ve been thinking that perhaps there could be a variable to restrict > tesseract a little ( > http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version ) but the > list is so wide that discourages me. > > I also have been thinking in doing something with RunAdaptiveClassifier > which is exposed by the API, but I’m not sure if that function could serve > to make OCR of a single char. > > The main particularity of my case of use is that I already have the text > segmented, so I wonder that it should be easy. That’s why I think that > perhaps I’m making a big error in some part. > > Best regards, > > Andres > > > > > > 2013/5/7 Dmitri Silaev <[email protected]> >> >> Andres, >> >> Your code seems to be correct. I personally use a few more lines right >> after the call to GetIterator(): >> it->Begin(); >> if(it->IsAtFinalElement(RIL_BLOCK, RIL_SYMBOL)) >> return; >> if(!it->IsAtBeginningOf(RIL_SYMBOL)) >> return; >> But this shouldn't bother you if you rely on non-degenerate cases. >> >> Well, I suggest using revision 724. It is battle-tested by me and >> probably contains less bugs and has better balance between accuracy >> and speed compared to any newer revision. Although newer ones may >> introduce many fancy features, I'll refrain of using them in >> production. Maybe this can help you. >> >> Warm regards, >> Dmitri Silaev >> www.CustomOCR.com >> >> >> On Mon, May 6, 2013 at 9:28 AM, Andres <[email protected]> wrote: >> > Answering part of what I asked last, I've found a way of getting the >> > alternatives to each char, but seems to be not working in 3.01 according >> > to >> > what I tested and >> > http://code.google.com/p/tesseract-ocr/issues/detail?id=714 >> > My snippet: >> > >> > #include <api/resultiterator.h> >> > >> > ... >> > >> > tess_api.SetVariable("save_blob_choices", "T"); >> > >> > ... >> > >> > >> > tesseract::ResultIterator* it = tess_api.GetIterator(); >> > >> > do >> > { >> > char* uval = it->GetUTF8Text(tesseract::RIL_SYMBOL); >> > cout<<uval<<"("<<it->Confidence(tesseract::RIL_SYMBOL)<<"){"; >> > tesseract::ChoiceIterator ci(*it); >> > do >> > { >> > const char* val = ci.GetUTF8Text(); >> > cout<<" "<<(val == NULL ? "#" : val)<<" "<<ci.Confidence(); >> > } >> > while (ci.Next()); >> > cout<<"}"; >> > } >> > while (it->Next(tesseract::RIL_SYMBOL)); >> > >> > >> > >> > >> > >> > El lunes, 6 de mayo de 2013 01:50:42 UTC-3, Andres escribió: >> >> >> >> Hi Dmitri, >> >> >> >> Many thanks for your hints, as always. >> >> >> >> Regarding the links in my previous message, sorry for that, I'll repost >> >> the entire message below this message, fixed. >> >> >> >> I like the method that you tell that you use in CustomOCR. Is there a >> >> way >> >> of getting the character variants without making a hack ? As I saw, the >> >> interface of the API just exposes the confidence level for each >> >> character. >> >> Am I right with this ? >> >> >> >> Regarding psm mode, I'm using this from insinde my code with value 7, >> >> which is for "Treat the image as a single text line". Is that the >> >> parameter >> >> that you are suggesting me ? >> >> >> >> Anyway, I think that I might have big newbie errors in my training, so >> >> I >> >> will be grateful if you just see my training image and my problematic >> >> image, >> >> to know if you see an obvious error at first sight. >> >> >> >> My training image: >> >> >> >> >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing >> >> >> >> Problematic image (a "6" recognized as a "5"): >> >> >> >> >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing >> >> >> >> Another problematic image ("A A" recognized as "M") >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit >> >> >> >> The following is my original message with the links fixed: >> >> >> >> Dear people, >> >> >> >> I trained Tesseract for my font (FE-Schrift: >> >> http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad >> >> results. >> >> I am using Tesseract 3.01 under Windows. >> >> >> >> In this image: >> >> >> >> >> >> >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing >> >> >> >> Where text is SAA5298 I’m getting SM529B, this is being done from >> >> inside a >> >> program and I know that the “M” from the result is the result of the >> >> “AA” of >> >> the source. So, Tesseract is making a very bad segmentation of these >> >> two >> >> characters, and even they are very good separated, as you can see. Do >> >> you >> >> have an idea about why is this happening ? In the other hand, is there >> >> a way >> >> to give tesseract a hint for this (e.g., telling it the character >> >> width). >> >> >> >> The other problem is with this one: >> >> >> >> >> >> >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing >> >> >> >> Where text is LDA6244, Tesseract is recognizing a “5” instead of a “6”, >> >> even when the image is very good. >> >> >> >> Here is my fonts training file: >> >> >> >> >> >> >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzLV8yVkt4OTd5Sk0/edit?usp=sharing >> >> >> >> Here is my box file: >> >> >> >> >> >> >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing >> >> >> >> Here is my .traineddata file: >> >> >> >> >> >> >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing >> >> >> >> And here is a .cmd file for testing these 2 images: >> >> >> >> >> >> >> >> https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing >> >> >> >> >> >> >> >> Thanks, >> >> >> >> Andres >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> El viernes, 3 de mayo de 2013 16:05:50 UTC-3, Dmitri Silaev escribió: >> >>> >> >>> Andres, >> >>> >> >>> Above all, your first link seem to be pointing to a "traineddata" file >> >>> instead of an image. Second, without actually diving deep into your >> >>> problem, I can suggest specifying the single line psm mode in the >> >>> command line. And finally you can use the user patterns feature to >> >>> restrict possible output of Tesseract (for the format see comments in >> >>> dict/trie.h on read_pattern_list()). Another way of achieving the >> >>> latter, like we do in CustomOCR, and it seems to be more reliable, is >> >>> to use the API to get a number of of character variants for each blob >> >>> alng with confidence levels and match them against a set of possible >> >>> patterns. You can find how to do this by searching around this forum. >> >>> >> >>> HTH and good luck with Tesseract! >> >>> >> >>> Warm regards, >> >>> Dmitri Silaev >> >>> www.CustomOCR.com >> >>> >> >>> >> >>> On Fri, May 3, 2013 at 8:24 PM, Andres <[email protected]> wrote: >> >>> > Dear people, >> >>> > >> >>> > I trained Tesseract for my font (FE-Schrift: >> >>> > http://de.wikipedia.org/wiki/FE-Schrift ) and I’m getting very bad >> >>> > results. >> >>> > I am using Tesseract 3.01 under Windows. >> >>> > >> >>> > In this image: >> >>> > >> >>> > >> >>> > >> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzeFNZUVA1cThLMG8/edit?usp=sharing >> >>> > >> >>> > Where text is SAA5298 I’m getting SM529B, this is being done from >> >>> > inside a >> >>> > program and I know that the “M” from the result is the result of the >> >>> > “AA” of >> >>> > the source. So, Tesseract is making a very bad segmentation of >> >>> > these >> >>> > two >> >>> > characters, and even they are very good separated, as you can see. >> >>> > Do >> >>> > you >> >>> > have an idea about why is this happening ? In the other hand, is >> >>> > there >> >>> > a way >> >>> > to give tesseract a hint for this (e.g., telling it the character >> >>> > width). >> >>> > >> >>> > The other problem is with this one: >> >>> > >> >>> > >> >>> > >> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzbFk3OXNjaDR1Q1E/edit?usp=sharing >> >>> > >> >>> > Where text is LDA6244, Tesseract is recognizing a “5” instead of a >> >>> > “6”, >> >>> > even >> >>> > when the image is very good. >> >>> > >> >>> > >> >>> > >> >>> > Here is my fonts training file: >> >>> > >> >>> > >> >>> > >> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzczZhd21IcVlNSTQ/edit?usp=sharing >> >>> > >> >>> > Here is my box file: >> >>> > >> >>> > >> >>> > >> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzQV94NWdLT1VUcjQ/edit?usp=sharing >> >>> > >> >>> > Here is my .traineddata file: >> >>> > >> >>> > >> >>> > >> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzbkNzUmtDcE8zbjA/edit?usp=sharing >> >>> > >> >>> > And here is a .cmd file for testing these 2 images: >> >>> > >> >>> > >> >>> > >> >>> > https://docs.google.com/file/d/0BxkuvS_LuBAzUVVfSDhVdEUtRjA/edit?usp=sharing >> >>> > >> >>> > >> >>> > >> >>> > Thanks, >> >>> > >> >>> > Andres >> >>> > >> >>> > -- >> >>> > -- >> >>> > You received this message because you are subscribed to the Google >> >>> > Groups "tesseract-ocr" group. >> >>> > To post to this group, send email to [email protected] >> >>> > To unsubscribe from this group, send email to >> >>> > [email protected] >> >>> > For more options, visit this group at >> >>> > http://groups.google.com/group/tesseract-ocr?hl=en >> >>> > >> >>> > --- >> >>> > You received this message because you are subscribed to the Google >> >>> > Groups >> >>> > "tesseract-ocr" group. >> >>> > To unsubscribe from this group and stop receiving emails from it, >> >>> > send >> >>> > an >> >>> > email to [email protected]. >> >>> > For more options, visit https://groups.google.com/groups/opt_out. >> >>> > >> >>> > >> > >> > -- >> > -- >> > You received this message because you are subscribed to the Google >> > Groups "tesseract-ocr" group. >> > To post to this group, send email to [email protected] >> > To unsubscribe from this group, send email to >> > [email protected] >> > For more options, visit this group at >> > http://groups.google.com/group/tesseract-ocr?hl=en >> > >> > --- >> > You received this message because you are subscribed to the Google >> > Groups >> > "tesseract-ocr" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an >> > email to [email protected]. >> > For more options, visit https://groups.google.com/groups/opt_out. >> > >> > >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/et7bS5QRf2o/unsubscribe?hl=en. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> >> For more options, visit https://groups.google.com/groups/opt_out. >> >> > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

