Re: Performance on 18th century text

matthew christy Mon, 12 Aug 2013 20:03:39 -0700

Hi Fabrizio,

I'm working on the eMOP project that Tom mentioned. All of the problems 
you're dealing with are familiar to me, though most of our documents are in 
english. French and English use the same character forms, so unless you're 
using a dictionary to try to improve your results, that shouldn't be a big 
factor as long as your training includes character forms with all those 
extra French accents, etc.

How are you training Tesseract? If you're trying to use these same 
documents, then you'll have trouble with the low quality of the glyphs. One 
thing we've found while trying to train Tesseract from our own low-quality 
documents is that the more low-quality character forms you give Tesseract 
the worse the results get. For example, I see that all your 'i's are being 
recognized as 'î's. I'm guessing in the training you have for these two 
characters there is some ambiguity that Tesseract is having trouble with. 
Also, I'm not sure how it would work to train Tesseract on 3 fonts at the 
same time. Are you doing that? I believe it's best to train all 3 
individually and then combine them in your traineddata file.

Are you using dictionaries or the unicharambigs to try to improve your 
results? If the dictionary isn't period specific it could make the results 
worse. What part of the 18th century are dealing with? That could make a 
difference. Spelling was standardized about half way through the 18th 
Century. With all the possible alternative spellings from this era and the 
inclusion of french, any dictionary you use is likely to be quite large, it 
may impact Tesseract's performance. However, if you don't have a lot of 
document's to OCR, then it might not be an issue for you.

The sample you included looks like some Caslon font, but not Baskerville, 
do you know what this is exactly? What's the other 3rd font you mentioned? 
We are currently working on training Tesseract for a Caslon font we're 
calling Guyot (I'm not the font guy, I don't know much more than that). I 
can send you that when it's available, it should have but italics it's not 
going to include all those extra french accent glyphs. 

We're working on some other tools that might help you as well and I'll let 
you know when they're available. But my main question for you is how you 
are doing your font training and whether you're using a dictionary. Also, 
you should know that your results are actually not bad considering all the 
issues you're dealing with.

Thanks,
Matt

On Tuesday, August 6, 2013 9:18:25 AM UTC-5, [email protected] wrote:
>
> Hi,
>
> I am trying to recognize an 18th century text for academic purposes. I 
> followed the (very helpful) tutorial, and encountered no technical 
> problems. However, the recognition rate is disappointing. I think the 
> source material may just be too difficult for tesseract 3 (see sample 
> image <http://i.imgur.com/d5RnxI4.png> and recognized text below). The 
> difficulties are multiple: 3 fonts, 2 languages (bilingual text), obsolete 
> spellings, variable stroke width... I retrained tesseract on 10 samples of 
> each character, without much improvement.
>
> Could someone tell me if this is feasible? Or maybe the state of the art 
> in OCR has not reached yet this kind of performance...
>
> Thanks for the insight!
>
> Fabrizio
>
> --
>
> Image: http://i.imgur.com/d5RnxI4.png
>
> *Recognized text for image*
>
> ACCOLADE,  [embraﬀement] A bug, clîppl’ng and
> colling. Je hazardaî quèlques accolades qui ne îûrent pâs
> trop mal reçûes, I ventured ſome bugs, wbicb were not very
> îll receîved. * Nous nous mimes ä domler des accolades â
> notre boutèille, PVc./ëll ta bugging our bottle. ☞ Il l’a fait
> Chevalîér en lui donnant l’accolade, He bar dubbcd hl’ln a
> K.wigbt. ☞ Sèrvîr unc accolade de lapereaûx (une couple)
> To jZ-rve o couple oj’yortng rabbîts în one dﬄa.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Performance on 18th century text

Reply via email to