On Tuesday, August 6, 2013 10:18:25 AM UTC-4, [email protected] wrote:

>
> I am trying to recognize an 18th century text for academic purposes. 


You might be interested in the work being done by the Early Modern OCR 
project http://emop.tamu.edu/
 

> I followed the (very helpful) tutorial, and encountered no technical 
> problems. However, the recognition rate is disappointing. I think the 
> source material may just be too difficult for tesseract 3 (see sample 
> image <http://i.imgur.com/d5RnxI4.png> and recognized text below). The 
> difficulties are multiple: 3 fonts, 2 languages (bilingual text), obsolete 
> spellings, variable stroke width... I retrained tesseract on 10 samples of 
> each character, without much improvement.
>
> Could someone tell me if this is feasible? Or maybe the state of the art 
> in OCR has not reached yet this kind of performance...
>

You've got a bunch of challenging stuff in that text including:
- mixed French & English
- archaic spellings & grammar for both French and English
- medial S
- 2x medial S ligature 
- dictionary entry formatting instead of running text

The problems are solvable individually.  For example, Google Books 
recognition of medial S improved greatly from its early days.  The 
combination of all of them together at the same time may be beyond the 
current state of the art, but the eMOP folks might have more insight into 
how far you're likely to get (they're using Tesseract as well).

Tom

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to