Hi Shree,

I'm glad you found my article helpful. Apologies for the delay in my
reply to you. I'll answer your questions below.

> I have found that trying to improve recognition by adding more training data
> sometimes leads to worse recognition. I am currently trying with just one 
> font.
> Using multiple fonts sometimes fails with:
> 
> Font id = -1/2, class id = 96/2922 on sample 70292
> font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in 
> file
> ..\..\clasne 622

I don't think I've seen that failure before. But yes, you're right
that adding more training data can produce worse results.

> I would like to try your testing suite so that I can see whether there is
> improvement in the training data- do you have a windows binary for the same?

I don't have Windows binaries for them. The tools themselves should
compile for Windows, but the issue is that to work beyond ASCII they
need to be run with a wrapper script, that is Unix only
('ocrevalutf8'). I would recommend you set up Cygwin; they will be
easy to compile and run from there.

> Is the recommended training process to train one font and then add another? Or
> train them separately then merge??

I'm not sure I understand the question. How do the above two methods
differ, in the case of tesseract training?

> Does the order in which tif/box files are given matter?

Not as far as I know.

> If I am trying to fix errors, should new training data be given at end of old
> training data or before?

I also don't understand this question. Can you expand on what you
mean, please?

Hope this helps, and I look forward to hearing back from you.

Nick

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to