>From all this, I have identified the following ways of improving the results:
1. Customise the tesseract engine to recognize only the characters from A-Z,0-9,.(dot), (space) by setting the character white-list. My understanding is that the white-list is the list of characters that are going to be sensed. I was inquisitive to know what the blacklist is meant to do? Just the opposite of whitelist. You can disable specific characters from the usual set. 2. A lot of times I have seen fairly good number plate images being OCRed inaccurately. This could possibly be due to the word recognition stage. Has anyone found a way to disable the dictionary / word recognition. Play with segment_penalty_dict_* 3. Then there are some page segmentation modes (PSM_AUTO,PSM_SINGLE_BLOCK, PSM_CHAR etc). Does PSM_CHAR imply that it will consider the input image as a single character and run the algorithm accordingly without attempting word recognition? Yes. 4. Another important configuration macro that I have seen within the code was AVS_FASTEST = 0, AVS_MOST_ACCURATE = 100. However, I could not find the same being used anywhere in the code. Does this have any impact on the *character recognition*accuracy? This control is dead in 3.01. Replaced by ocr_engine_mode. It just controls the combination of tesseract vs cube. Cube increases the accuracy slightly, but adds a lot of compute time. 5. Finally, I also plan to use the confidence level data. Are there any indicators of confidence for characters as well. There is word confidence data which can be found in TessBaseAPI::AllWordConfidences(). Yes, and they are exposed in the new ResultIterator in 3.01, otherwise you have to go down into the guts of the data structures. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

