Hi all, I noticed recently that my training doesn't do a good of detecting superscripted numbers (which occur frequently in the texts I'm working with, to point to footnotes). They're often misrecognised as speech marks (e.g. ”).
They will always be difficult, as they're small, and (particularly with the old books I'm working with) not very clearly printed compared to the surrounding text. However I suspect Tesseract can do a better job. My current plan is to train variants of numbers that are superscripted (smaller and above the baseline), as this will presumably help things, as Tesseract uses information on the location of a character on the line to help it identify them. Presuming this works after testing I'll add a mode to the text2image tool to enable superscripted rendering of selected characters. I wanted to check if anyone else has encountered issues with superscripted words or characters, and if anybody has tried any techniques to improve recognition. Conversely, if superscripted words and characters generally work perfectly for you, that's useful to know too. Thanks in advance, Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140501183637.GA31269%40manta.lan. For more options, visit https://groups.google.com/d/optout.

