On Tuesday, August 6, 2013 10:18:25 AM UTC-4, [email protected] wrote: > > I am trying to recognize an 18th century text for academic purposes.
You might be interested in the work being done by the Early Modern OCR project http://emop.tamu.edu/ > I followed the (very helpful) tutorial, and encountered no technical > problems. However, the recognition rate is disappointing. I think the > source material may just be too difficult for tesseract 3 (see sample > image <http://i.imgur.com/d5RnxI4.png> and recognized text below). The > difficulties are multiple: 3 fonts, 2 languages (bilingual text), obsolete > spellings, variable stroke width... I retrained tesseract on 10 samples of > each character, without much improvement. > > Could someone tell me if this is feasible? Or maybe the state of the art > in OCR has not reached yet this kind of performance... > You've got a bunch of challenging stuff in that text including: - mixed French & English - archaic spellings & grammar for both French and English - medial S - 2x medial S ligature - dictionary entry formatting instead of running text The problems are solvable individually. For example, Google Books recognition of medial S improved greatly from its early days. The combination of all of them together at the same time may be beyond the current state of the art, but the eMOP folks might have more insight into how far you're likely to get (they're using Tesseract as well). Tom -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

