[tesseract-ocr] Improving recognition of superscripted numbers

Nick White Thu, 01 May 2014 11:37:37 -0700

Hi all,

I noticed recently that my training doesn't do a good of detecting 
superscripted numbers (which occur frequently in the texts I'm 
working with, to point to footnotes). They're often misrecognised as 
speech marks (e.g. ”).


They will always be difficult, as they're small, and (particularly 
with the old books I'm working with) not very clearly printed 
compared to the surrounding text. However I suspect Tesseract can do 
a better job.

My current plan is to train variants of numbers that are 
superscripted (smaller and above the baseline), as this will 
presumably help things, as Tesseract uses information on the 
location of a character on the line to help it identify them.  
Presuming this works after testing I'll add a mode to the text2image 
tool to enable superscripted rendering of selected characters.

I wanted to check if anyone else has encountered issues with 
superscripted words or characters, and if anybody has tried any 
techniques to improve recognition. Conversely, if superscripted 
words and characters generally work perfectly for you, that's useful 
to know too.

Thanks in advance,

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140501183637.GA31269%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Improving recognition of superscripted numbers

Reply via email to