You could try leveraging the coordinates for the words (available in the hocr 
output) or the letters themselves (via the API) and doing different processing 
for the title based on the size of the letters. Difference of Gaussians or 
another type of filter could thin the letters out, and you could also try 
tesseract in single character mode if you can isolate each letter. The bane of 
ocr for old newspapers tends to be multi-columned printing, in which case a 
separate segmentation tool, like olena, can be invaluable, but your sample does 
not suggest that columns are a factor.

art

From: [email protected] [mailto:[email protected]] On 
Behalf Of Claudi Ruiz
Sent: Tuesday, May 26, 2015 4:25 AM
To: [email protected]
Subject: [tesseract-ocr] Re: Improve the tesseract output from an old newspapers

How can I improve detection for the title specifically?
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To post to this group, send email to 
[email protected]<mailto:[email protected]>.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fc27a199-6df6-4533-9693-641ed5c460be%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/fc27a199-6df6-4533-9693-641ed5c460be%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/BY2PR11MB05528A9FC542E116550700FCDCCB0%40BY2PR11MB0552.namprd11.prod.outlook.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to