I experience the same with tesseract 4.0 installed with best traindata from repo
# printf "deb https://notesalexp.org/tesseract-ocr/$(lsb_release -sc)/ $(lsb_release -sc) main\ndeb https://notesalexp.org/tesseract-ocr/tessdata_best/ stretch main\n" >> /etc/apt/sources.list onsdag den 25. april 2018 kl. 16.59.34 UTC+2 skrev Youcef: > > Hi, > > > Tesseract seems to post process its prediction. > > Here after, what I get after OCRizing images (same font, same size images > generated with text2image): > > - an image containing "12345678I" => `123456781` > - an image containing "GLOTHUVFI" => `GLOTHUVFI` > - an image containing "12345678H" => `12345678H` > - an image containing "GLOTHUVFH" => `GLOTHUVFH` > - an image containing "12345678A" => `123456784` > - an image containing "GLOTHUVFA" => `GLOTHUVFA` > > It looks like Tesseract doesn't like a word with a some numbers and one > letter at the end. In fact, if the letter looks like a number ("I" and "A" > looks like "1" and "4" respectively), it replaces it by the closest number. > I have tried to tune following parameters without any changement in the > result: > > - segment_penalty_dict_frequent_word > - language_model_penalty_chartype > > Thanks for any help. > > Regards > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f6c60f25-83c0-4ef6-92d2-eefa85674845%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.