My earlier suggestion of mixing the two kinds of images - scanned pages and text2image created synthetic ones - was from before ocrd-train was available.
ocrd-train works on single line images, while tesstrain.sh works on multipage tifs. By mixing these the single line images will get more iterations during training. - pass_through_recoder is needed for complex scripts such as Indic scripts and may not be needed for Latin script based langauges. For finetuning the number of iterations should be very low, about 300-400 for a new font and 3000-4000 for adding a new character. More iterations will lead to overfitting as you are seeing. Please experiment with different options to see what works best for your language and testsets. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXBs%2BKx7hpdC6czLQasu_h6SZBGz5%3DbCn4XVRctqAf4sw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

