The links you gave me are great. I created the tiff/box pair on a mac as follows:
raining/text2image --text=yor.training_text --outputbase=yor.VerdanaMedium.exp0 --font='Verdana Medium' --fonts_dir=/Library/Fonts Then I ran training as follows: tesseract yor.VerdanaMedium.exp0.tif yor.VerdanaMedium.exp0 box.train.stderr The only problem is that after creating the tiff/box pairs, the training throws failures as follows APPLY_BOXES: boxfile line 2087/ ((2121,1882),(2131,1921)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 2135/ ((2112,1810),(2122,1848)): FAILURE! Couldn't find a matching blob FAIL! ... APPLY_BOXES: Boxes read from boxfile: 2265 Boxes failed resegmentation: 124 Found 2141 good blobs. Leaving 3 unlabelled blobs in 0 words. Generated training data for 986 words Warning in pixReadMemTiff: tiff page 5 not found I tried using the asc.training_text example directly too, i.e. without my changes, but still these errors are happening. I've Googled, but unclear of what the solution is. On Thursday, December 4, 2014 at 2:55:01 AM UTC-6, shree wrote: > > Try to use training text from the following and see if it helps - > > https://code.google.com/r/shreeshrii-langdata/source/browse?name=asc > https://code.google.com/r/shreeshrii-langdata/source/browse?name=iast > > > https://code.google.com/r/shreeshrii-tessdata/source/browse?name=iast > > You can use eng+your_language_code to recognize english + your language > text. > > > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Thu, Dec 4, 2014 at 5:22 AM, Victor Williamson <[email protected] > <javascript:>> wrote: > >> I am working on Yoruba OCR using Tesseract 3.02. After following the >> steps on the wiki and referring to Cedric >> <http://blog.cedric.ws/how-to-train-tesseract-301>and all the training >> goes through, running Tessecrat coverts my images with Yoruba text to all >> dashes (-) proportional to the size of the text in the image. This happens >> even for the image I trained on. I used a very small sample of Yoruba text, >> and I realize I may not meet the minimum per character requirement because >> during mftraining I get a bunch of >> >> Warning: no protos/configs for ò in CreateIntTemplates() >> Warning: no protos/configs for w in CreateIntTemplates() >> Warning: no protos/configs for ú in CreateIntTemplates() >> Warning: no protos/configs for à in CreateIntTemplates() >> ... >> >> Is there a way to build off the existing English training data? i.e. I >> want to extend the existing English training data because Yoruba uses most >> of the English characters plus 3 dozen additional special non-English >> characters. The existing English characters should always be recognized. I >> wanted to start with a small training image so that I could finish with >> minimal effort, run simple tests, and expand later. >> >> I've tried both manual commands and using training within >> JTessBoxEditor.with the same end result. It would be nice to at least some >> characters output. >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/e23b7124-2df2-44a1-ab0d-5fdea104177e%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/e23b7124-2df2-44a1-ab0d-5fdea104177e%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/686b069e-f110-4eba-9592-67c6fe0c7e38%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

