[tesseract-ocr] Best practices to fine-tune a model?

Profs Stars Tue, 13 Oct 2020 03:14:20 -0700

I have a few questions regarding the fine-tuning process.
I'm building an app that is able to recognize data from the following 
documents:

- ID Card
- Driving license
- Passport
- Receipts

All of them have different fonts (especially receipts) and it is hard to
match exactly the same font and I will have to train the model on a lot of
similar fonts.

So my questions are:

1. Should I train a separate model for each of the document types for
better performance and accuracy or it is fine to train a single `eng` model
on a bunch of fonts that are similar to the fonts that are being used on
this type of documents?

2. How many pages of training data should I generate per font? By default,
I think `tesstrain.sh` generates around 4k pages.
Maybe any suggestions on how I can generate training data that is closest
to real input data

3. How many iterations should be used?

For example, if I'm using some font that has a high error rate and I want
to target `98% - 99%` accuracy rate.

As well maybe some of you had experience working with this type of
documents and maybe you know some common fonts that are being used for
these documents?

I know that MRZ in passport and id cards is using `OCR-B` font, but what
about the rest of the document?

Thanks in advance!

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/ca4d4d30-ce15-419b-b8e4-4891734c41e3n%40googlegroups.com.

[tesseract-ocr] Best practices to fine-tune a model?

Reply via email to