Hello Nick,

I am trying to train Tesseract for Sanskrit/Hindi in non-cube mode. I ound 
your article regarding ancient greek to be helpful in figuring out the 
steps to do training.

I have found that trying to improve recognition by adding more training 
data sometimes leads to worse recognition. I am currently trying with just 
one font. Using multiple fonts sometimes fails with:

Font id = -1/2, class id = 96/2922 on sample 70292
font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in 
file ..\..\clasne 622

I would like to try your testing suite so that I can see whether there is 
improvement in the training data- do you have a windows binary for the same?

Is the recommended training process to train one font and then add another? 
Or train them separately then merge??

Does the order in which tif/box files are given matter? Currently I have 
multiple small files, just for ease of editing/testing?

If I am trying to fix errors, should new training data be given at end of 
old training data or before?

Any other tips on training would also be helpful as I am a newbie.

Thanks,
Shree

On Saturday, March 9, 2013 12:50:43 AM UTC+5:30, Nick White wrote:
>
> On Wed, Feb 27, 2013 at 11:54:39AM +0000, Nick White wrote: 
> > On Sun, Feb 24, 2013 at 05:53:52PM +0100, zdenko podobny wrote: 
> > > • tool for measuring of training quality e.g. how many pages I need to 
> > >   training to get reasonable result? If I add another similar font how 
> it 
> > >   effect OCR result (I have a bad experience on this)? Is there 
> problem with 
> > >   specific symbol (is there need to focus on some specific symbol)? 
> > 
> > I have written a little shell script that runs various tests given a 
> > .traineddata file, that may well come close to what you want. It 
> > needs some cleaning up, but I should be able to release it in the 
> > next few days. 
>
> Right, they're ready to share now. Get the testing scripts from here: 
>
>   
> https://gitorious.org/ancient-greek-training-for-tesseract/trainingtestscripts
>  
>
> I don't have a lot of time to devote to them at the moment, but 
> hopefully they'll be useful. There's a README which hopefully 
> explain things well enough. 
>
> And of course comments and patches are most welcome! 
>
> Nick 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to