On 9 July 2010 16:55, patrickq <[email protected]> wrote: > TesserractExtractResult() returns the confidence numbers for all > characters returned. A high number means low confidence. Caveats: > 1. The confidence numbers are the same for all letters in a word (even > though Tesseract does compute confidence numbers for each letter, it > just doesn't return them to the API) > 2. From personal experience, these numbers are not very reliable and > we decided not to use them - but feel free to test yourself, we gave > up fairly quickly. >
Right; if I could sketch this on some paper it might be a bit more clear, but I can't so I'll try to describe it... R to K is the easiest to describe; cover the top of the R and it looks like a K. Smudges, glare from the scanner's light, boxing errors, etc., are things that can cause this degradation. Thresholding can contribute to the problem, because it's greyscale to binary: if it's too light, it's effectively wiped out. Access to the character probabilities won't actually help, because if thresholding 1 gives you an R with a broken top, it will have a relatively low confidence value, whereas thresholding 2, that has removed it completely, will have a higher confidence value of the character as 'K'. Going purely by character probabilities can just as easily give you the worst results of both as it can the best. > Patrick > > On Jul 9, 5:01 am, caro <[email protected]> wrote: >> I am working with tesseract OCR and I would like to get at the end of >> the algorithm a confidence value which may express if the recognition >> seems OK or not really. >> >> For example, I have an image with the text: TEST RESULTS ARE OK. >> Depending on a threshold value, I can get different output of the OCR: >> - TEST RESSUTTS AKE OC >> - TEST TELLUTTS ARE OB >> .... >> The best threshold can be different for different images. >> So if I can get this confidence value, maybe it can give me the best >> theshold to choose for the OCR? >> >> Thank you for your help, >> Caroline > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

