A colleage and I are having problems with recognition of subscript and superscript. We posted our problem on StackOverflow, but didn't get any reply: https://stackoverflow.com/questions/63562290/tesseract-ocr-subscript-and-superscript-recognition-problems
" I have problems with the general recognition of subscript and superscript in text fragments. Example-image: [image: example.png] <https://i.stack.imgur.com/jt8Aw.png> I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except: - tessedit_create_hocr = 1 (to get result as HOCR) - hocr_font_info = 1 (to get additional font infos like font size) - hocr_char_boxes = 1 (to get character-based result) The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly. In the output the sub/sup-fragments were all more or less wrong: - "Subtext<sub>Sub</sub>" is recognized as "Subtextsu," - "Suptext<sup>Sub</sup>" is recognized as "Suptexts?" - "P<sub>0</sub>" is recognized as "Po" - "P<sub>100</sub>" is recognized as "P1go" - "a<sup>2</sup>+<sup>b2</sup>" is recognized as "a+b?" Using Tesseract for OCR is there a way to ...? 1. optimize subscript/superscript handling 2. get infos about recognized subscript/superscript (in the hocr-output - ideally for each character) " -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ea1d3acc-86ec-478e-a5ba-89b519bcfa59n%40googlegroups.com.