I have a few questions regarding the fine-tuning process. I'm building an app that is able to recognize data from the following documents:
- ID Card - Driving license - Passport - Receipts All of them have different fonts (especially receipts) and it is hard to match exactly the same font and I will have to train the model on a lot of similar fonts. So my questions are: 1. Should I train a separate model for each of the document types for better performance and accuracy or it is fine to train a single `eng` model on a bunch of fonts that are similar to the fonts that are being used on this type of documents? 2. How many pages of training data should I generate per font? By default, I think `tesstrain.sh` generates around 4k pages. Maybe any suggestions on how I can generate training data that is closest to real input data 3. How many iterations should be used? For example, if I'm using some font that has a high error rate and I want to target `98% - 99%` accuracy rate. As well maybe some of you had experience working with this type of documents and maybe you know some common fonts that are being used for these documents? I know that MRZ in passport and id cards is using `OCR-B` font, but what about the rest of the document? Thanks in advance! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ca4d4d30-ce15-419b-b8e4-4891734c41e3n%40googlegroups.com.

