You could try leveraging the coordinates for the words (available in the hocr output) or the letters themselves (via the API) and doing different processing for the title based on the size of the letters. Difference of Gaussians or another type of filter could thin the letters out, and you could also try tesseract in single character mode if you can isolate each letter. The bane of ocr for old newspapers tends to be multi-columned printing, in which case a separate segmentation tool, like olena, can be invaluable, but your sample does not suggest that columns are a factor.
art From: [email protected] [mailto:[email protected]] On Behalf Of Claudi Ruiz Sent: Tuesday, May 26, 2015 4:25 AM To: [email protected] Subject: [tesseract-ocr] Re: Improve the tesseract output from an old newspapers How can I improve detection for the title specifically? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To post to this group, send email to [email protected]<mailto:[email protected]>. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fc27a199-6df6-4533-9693-641ed5c460be%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/fc27a199-6df6-4533-9693-641ed5c460be%40googlegroups.com?utm_medium=email&utm_source=footer>. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/BY2PR11MB05528A9FC542E116550700FCDCCB0%40BY2PR11MB0552.namprd11.prod.outlook.com. For more options, visit https://groups.google.com/d/optout.

