Re: Customising Tesseract for character recognition

Ray Smith Thu, 17 Feb 2011 22:27:21 -0800

>From all this, I have identified the following ways of improving the 
results:


   1. Customise the tesseract engine to recognize only the characters from 
   A-Z,0-9,.(dot), (space) by setting the character white-list. My 
   understanding is that the white-list is the list of characters that are 
   going to be sensed. I was inquisitive to know what the blacklist is meant to 
   do?
   Just the opposite of whitelist. You can disable specific characters from 
   the usual set.
   2. A lot of times I have seen fairly good number plate images being OCRed 
   inaccurately. This could possibly be due to the word recognition stage. Has 
   anyone found a way to disable the dictionary / word recognition.
   Play with segment_penalty_dict_*
   3. Then there are some page segmentation modes 
   (PSM_AUTO,PSM_SINGLE_BLOCK, PSM_CHAR etc). Does PSM_CHAR imply that it will 
   consider the input image as a single character and run the algorithm 
   accordingly without attempting word recognition?
   Yes.
   4. Another important configuration macro that I have seen within the code 
   was AVS_FASTEST = 0,  AVS_MOST_ACCURATE = 100. However, I could not find the 
   same being used anywhere in the code. Does this have any impact on the 
*character 
   recognition*accuracy?
   This control is dead in 3.01. Replaced by ocr_engine_mode. It just 
   controls the combination of tesseract vs cube. Cube increases the accuracy 
   slightly, but adds a lot of compute time.
   5. Finally, I also plan to use the confidence level data. Are there any 
   indicators of confidence for characters as well. There is word confidence 
   data which can be found in TessBaseAPI::AllWordConfidences().
   Yes, and they are exposed in the new ResultIterator in 3.01, otherwise 
   you have to go down into the guts of the data structures.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Customising Tesseract for character recognition

Reply via email to