The links you gave me are great. I created the tiff/box pair on a mac as 
follows:

raining/text2image --text=yor.training_text 
--outputbase=yor.VerdanaMedium.exp0 --font='Verdana Medium' 
--fonts_dir=/Library/Fonts

Then I ran training as follows:

tesseract yor.VerdanaMedium.exp0.tif yor.VerdanaMedium.exp0 box.train.stderr



The only problem is that after creating the tiff/box pairs, the training 
throws failures as follows

APPLY_BOXES: boxfile line 2087/ ((2121,1882),(2131,1921)): FAILURE! 
Couldn't find a matching blob

FAIL!

APPLY_BOXES: boxfile line 2135/ ((2112,1810),(2122,1848)): FAILURE! 
Couldn't find a matching blob

FAIL!

...

APPLY_BOXES:

   Boxes read from boxfile:    2265

   Boxes failed resegmentation:     124

   Found 2141 good blobs.

   Leaving 3 unlabelled blobs in 0 words.

Generated training data for 986 words

Warning in pixReadMemTiff: tiff page 5 not found


I tried using the asc.training_text example directly too, i.e. without my 
changes, but still these errors are happening. I've Googled, but unclear of 
what the solution is.

On Thursday, December 4, 2014 at 2:55:01 AM UTC-6, shree wrote:
>
> Try to use training text from the following and see if it helps - 
>
> https://code.google.com/r/shreeshrii-langdata/source/browse?name=asc
> https://code.google.com/r/shreeshrii-langdata/source/browse?name=iast
>
>
> https://code.google.com/r/shreeshrii-tessdata/source/browse?name=iast
>
> You can use eng+your_language_code to recognize english + your language 
> text.
>
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Dec 4, 2014 at 5:22 AM, Victor Williamson <[email protected] 
> <javascript:>> wrote:
>
>> I am working on Yoruba OCR using Tesseract 3.02. After following the 
>> steps on the wiki and referring to Cedric 
>> <http://blog.cedric.ws/how-to-train-tesseract-301>and all the training 
>> goes through, running Tessecrat coverts my images with Yoruba text to all 
>> dashes (-) proportional to the size of the text in the image. This happens 
>> even for the image I trained on. I used a very small sample of Yoruba text, 
>> and I realize I may not meet the minimum per character requirement because 
>> during mftraining I get a bunch of
>>
>> Warning: no protos/configs for ò in CreateIntTemplates()
>> Warning: no protos/configs for w in CreateIntTemplates()
>> Warning: no protos/configs for ú in CreateIntTemplates()
>> Warning: no protos/configs for à in CreateIntTemplates()
>> ...
>>
>> Is there a way to build off the existing English training data? i.e. I 
>> want to extend the existing English training data because Yoruba uses most 
>> of the English characters plus 3 dozen additional special non-English 
>> characters. The existing English characters should always be recognized. I 
>> wanted to start with a small training image so that I could finish with 
>> minimal effort, run simple tests, and expand later.
>>
>> I've tried both manual commands and using training within 
>> JTessBoxEditor.with the same end result. It would be nice to at least some 
>> characters output.
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/e23b7124-2df2-44a1-ab0d-5fdea104177e%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/e23b7124-2df2-44a1-ab0d-5fdea104177e%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/686b069e-f110-4eba-9592-67c6fe0c7e38%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to