A colleage and I are having problems with recognition of subscript and 
superscript. We posted our problem on StackOverflow, but didn't get any 
reply: 
https://stackoverflow.com/questions/63562290/tesseract-ocr-subscript-and-superscript-recognition-problems

"

I have problems with the general recognition of subscript and superscript 
in text fragments.

Example-image:

[image: example.png]

<https://i.stack.imgur.com/jt8Aw.png>

I used Tesseract 4.1.1 with the training data available under 
https://github.com/tesseract-ocr/tessdata_best. The numerous options had 
default values except:

   - tessedit_create_hocr = 1 (to get result as HOCR)
   - hocr_font_info = 1 (to get additional font infos like font size)
   - hocr_char_boxes = 1 (to get character-based result)

The language was set to eng. Neither with page segmentation mode 3 
(PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the 
subscript/superscript was recognized correctly.

In the output the sub/sup-fragments were all more or less wrong:

   - "Subtext<sub>Sub</sub>" is recognized as "Subtextsu,"
   - "Suptext<sup>Sub</sup>" is recognized as "Suptexts?"
   - "P<sub>0</sub>" is recognized as "Po"
   - "P<sub>100</sub>" is recognized as "P1go"
   - "a<sup>2</sup>+<sup>b2</sup>" is recognized as "a+b?"

Using Tesseract for OCR is there a way to ...?

   1. optimize subscript/superscript handling
   2. get infos about recognized subscript/superscript (in the hocr-output 
   - ideally for each character)

"

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ea1d3acc-86ec-478e-a5ba-89b519bcfa59n%40googlegroups.com.

Reply via email to