Hello Nick, I am trying to train Tesseract for Sanskrit/Hindi in non-cube mode. I ound your article regarding ancient greek to be helpful in figuring out the steps to do training.
I have found that trying to improve recognition by adding more training data sometimes leads to worse recognition. I am currently trying with just one font. Using multiple fonts sometimes fails with: Font id = -1/2, class id = 96/2922 on sample 70292 font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file ..\..\clasne 622 I would like to try your testing suite so that I can see whether there is improvement in the training data- do you have a windows binary for the same? Is the recommended training process to train one font and then add another? Or train them separately then merge?? Does the order in which tif/box files are given matter? Currently I have multiple small files, just for ease of editing/testing? If I am trying to fix errors, should new training data be given at end of old training data or before? Any other tips on training would also be helpful as I am a newbie. Thanks, Shree On Saturday, March 9, 2013 12:50:43 AM UTC+5:30, Nick White wrote: > > On Wed, Feb 27, 2013 at 11:54:39AM +0000, Nick White wrote: > > On Sun, Feb 24, 2013 at 05:53:52PM +0100, zdenko podobny wrote: > > > • tool for measuring of training quality e.g. how many pages I need to > > > training to get reasonable result? If I add another similar font how > it > > > effect OCR result (I have a bad experience on this)? Is there > problem with > > > specific symbol (is there need to focus on some specific symbol)? > > > > I have written a little shell script that runs various tests given a > > .traineddata file, that may well come close to what you want. It > > needs some cleaning up, but I should be able to release it in the > > next few days. > > Right, they're ready to share now. Get the testing scripts from here: > > > https://gitorious.org/ancient-greek-training-for-tesseract/trainingtestscripts > > > I don't have a lot of time to devote to them at the moment, but > hopefully they'll be useful. There's a README which hopefully > explain things well enough. > > And of course comments and patches are most welcome! > > Nick > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

