I have a few questions regarding the fine-tuning process.
I'm building an app that is able to recognize data from the following 
documents:

- ID Card
- Driving license
- Passport
- Receipts

All of them have different fonts (especially receipts) and it is hard to 
match exactly the same font and I will have to train the model on a lot of 
similar fonts. 

So my questions are:

1. Should I train a separate model for each of the document types for 
better performance and accuracy or it is fine to train a single `eng` model 
on a bunch of fonts that are similar to the fonts that are being used on 
this type of documents? 

2. How many pages of training data should I generate per font? By default, 
I think `tesstrain.sh` generates around 4k pages. 
Maybe any suggestions on how I can generate training data that is closest 
to real input data

3. How many iterations should be used?

For example, if I'm using some font that has a high error rate and I want 
to target `98% - 99%` accuracy rate.

As well maybe some of you had experience working with this type of 
documents and maybe you know some common fonts that are being used for 
these documents?

I know that MRZ in passport and id cards is using `OCR-B` font, but what 
about the rest of the document?

Thanks in advance!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ca4d4d30-ce15-419b-b8e4-4891734c41e3n%40googlegroups.com.

Reply via email to