Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

Shree Devi Kumar Tue, 04 Sep 2018 11:25:46 -0700

My earlier suggestion of mixing the two kinds of images - scanned pages and
text2image created synthetic ones - was from before ocrd-train was
available.


ocrd-train works on single line images, while tesstrain.sh works on
multipage tifs. By mixing these the single line images will get more
iterations during training.

- pass_through_recoder  is needed for complex scripts such as Indic scripts
and may not be needed for Latin script based langauges.

For finetuning the number of iterations should be very low, about 300-400
for a new font and 3000-4000 for adding a new character. More iterations
will lead to overfitting as you are seeing.

Please experiment with different options to see what works best for your
language and testsets.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXBs%2BKx7hpdC6czLQasu_h6SZBGz5%3DbCn4XVRctqAf4sw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

Reply via email to