Hi Fabrizio, I'm working on the eMOP project that Tom mentioned. All of the problems you're dealing with are familiar to me, though most of our documents are in english. French and English use the same character forms, so unless you're using a dictionary to try to improve your results, that shouldn't be a big factor as long as your training includes character forms with all those extra French accents, etc.
How are you training Tesseract? If you're trying to use these same documents, then you'll have trouble with the low quality of the glyphs. One thing we've found while trying to train Tesseract from our own low-quality documents is that the more low-quality character forms you give Tesseract the worse the results get. For example, I see that all your 'i's are being recognized as 'î's. I'm guessing in the training you have for these two characters there is some ambiguity that Tesseract is having trouble with. Also, I'm not sure how it would work to train Tesseract on 3 fonts at the same time. Are you doing that? I believe it's best to train all 3 individually and then combine them in your traineddata file. Are you using dictionaries or the unicharambigs to try to improve your results? If the dictionary isn't period specific it could make the results worse. What part of the 18th century are dealing with? That could make a difference. Spelling was standardized about half way through the 18th Century. With all the possible alternative spellings from this era and the inclusion of french, any dictionary you use is likely to be quite large, it may impact Tesseract's performance. However, if you don't have a lot of document's to OCR, then it might not be an issue for you. The sample you included looks like some Caslon font, but not Baskerville, do you know what this is exactly? What's the other 3rd font you mentioned? We are currently working on training Tesseract for a Caslon font we're calling Guyot (I'm not the font guy, I don't know much more than that). I can send you that when it's available, it should have but italics it's not going to include all those extra french accent glyphs. We're working on some other tools that might help you as well and I'll let you know when they're available. But my main question for you is how you are doing your font training and whether you're using a dictionary. Also, you should know that your results are actually not bad considering all the issues you're dealing with. Thanks, Matt On Tuesday, August 6, 2013 9:18:25 AM UTC-5, [email protected] wrote: > > Hi, > > I am trying to recognize an 18th century text for academic purposes. I > followed the (very helpful) tutorial, and encountered no technical > problems. However, the recognition rate is disappointing. I think the > source material may just be too difficult for tesseract 3 (see sample > image <http://i.imgur.com/d5RnxI4.png> and recognized text below). The > difficulties are multiple: 3 fonts, 2 languages (bilingual text), obsolete > spellings, variable stroke width... I retrained tesseract on 10 samples of > each character, without much improvement. > > Could someone tell me if this is feasible? Or maybe the state of the art > in OCR has not reached yet this kind of performance... > > Thanks for the insight! > > Fabrizio > > -- > > Image: http://i.imgur.com/d5RnxI4.png > > *Recognized text for image* > > ACCOLADE, [embraffement] A bug, clîppl’ng and > colling. Je hazardaî quèlques accolades qui ne îûrent pâs > trop mal reçûes, I ventured ſome bugs, wbicb were not very > îll receîved. * Nous nous mimes ä domler des accolades â > notre boutèille, PVc./ëll ta bugging our bottle. ☞ Il l’a fait > Chevalîér en lui donnant l’accolade, He bar dubbcd hl’ln a > K.wigbt. ☞ Sèrvîr unc accolade de lapereaûx (une couple) > To jZ-rve o couple oj’yortng rabbîts în one dffla. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

