Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

Raniem Wed, 05 Sep 2018 09:07:58 -0700

Thanks Shree, appreciate your support

Regards


On Tuesday, September 4, 2018 at 7:25:33 PM UTC+1, shree wrote:
>
> My earlier suggestion of mixing the two kinds of images - scanned pages 
> and text2image created synthetic ones - was from before ocrd-train was 
> available.
>
> ocrd-train works on single line images, while tesstrain.sh works on 
> multipage tifs. By mixing these the single line images will get more 
> iterations during training. 
>
> - pass_through_recoder  is needed for complex scripts such as Indic 
> scripts and may not be needed for Latin script based langauges.
>
> For finetuning the number of iterations should be very low, about 300-400 
> for a new font and 3000-4000 for adding a new character. More iterations 
> will lead to overfitting as you are seeing.
>
> Please experiment with different options to see what works best for your 
> language and testsets.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2a377312-6bcc-489a-b5ce-f1c6e710d858%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

Reply via email to