Hi I am trying to train my own Tesseract model (V. 4, by replacing top layer as described in the tutorial). Besides of non-explainable OCR problems (see https://github.com/tesseract-ocr/tesseract/issues/734#issuecomment-299132760), when I compare outputs produced by my model and by one of the standard models, I observe quite big differences.
I trained a model until the 0.005 convergence level (*below* the default value 0.01), and then evaluated the model on small data it was trained with. The confidence values (produced by my model) are between 40-55 (even for very frequent and unambiguous words), whereas a standard model achieves between 80-95, with 50-70 for visually ambiguous words. I was wondering if you achieve confidence levels close to tessdata models? If so, how did you achieve this. Are the standard tesseract models overfitted (Try to OCR a common but misspelled word ;)? Cheers, Alex -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/60273552-d4bc-4c24-a20f-e026c73cebd1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

