*Customise the tesseract engine to recognize only the characters from **A-Z,0-9,.(dot), (space) by setting the character white-list * Kindly furnish the name of the folder in which whitelist as well as blacklist are existed. I want to utilise the same for Kannada scripts. -sriranga(78yrs)
On Fri, Feb 18, 2011 at 11:57 AM, Ray Smith <[email protected]> wrote: > From all this, I have identified the following ways of improving the > results: > > 1. Customise the tesseract engine to recognize only the characters from > A-Z,0-9,.(dot), (space) by setting the character white-list. My > understanding is that the white-list is the list of characters that are > going to be sensed. I was inquisitive to know what the blacklist is meant > to > do? > Just the opposite of whitelist. You can disable specific characters > from the usual set. > 2. A lot of times I have seen fairly good number plate images being > OCRed inaccurately. This could possibly be due to the word recognition > stage. Has anyone found a way to disable the dictionary / word recognition. > Play with segment_penalty_dict_* > 3. Then there are some page segmentation modes > (PSM_AUTO,PSM_SINGLE_BLOCK, PSM_CHAR etc). Does PSM_CHAR imply that it will > consider the input image as a single character and run the algorithm > accordingly without attempting word recognition? > Yes. > 4. Another important configuration macro that I have seen within the > code was AVS_FASTEST = 0, AVS_MOST_ACCURATE = 100. However, I could not > find the same being used anywhere in the code. Does this have any impact on > the *character recognition*accuracy? > This control is dead in 3.01. Replaced by ocr_engine_mode. It just > controls the combination of tesseract vs cube. Cube increases the accuracy > slightly, but adds a lot of compute time. > 5. Finally, I also plan to use the confidence level data. Are there any > indicators of confidence for characters as well. There is word confidence > data which can be found in TessBaseAPI::AllWordConfidences(). > Yes, and they are exposed in the new ResultIterator in 3.01, otherwise > you have to go down into the guts of the data structures. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

