[tesseract-ocr] Annotation process and wordlist requirement for finetuning tesseract 4 for handwritten text

Mashrur Mahmud Tue, 07 Jul 2020 05:27:32 -0700

Hi there. I've attempted training(fine-tuning) Tesseract-4 lstmtraining on 
handwritten text, for box/tif pairs I generated myself. The overall 
training process has worked okay without any hitch.

I now wish to apply this fine-tuning process on a larger scale on form
images. Here's my conundrum: My forms often contain a mixture of printed
text as well as handwritten text. Do I have to annotate both the printed
text and the handwritten text? Annotating both printed and handwritten
would take a bit of extra effort, so I'm wondering if it sufficient to
simply make the boxes only around the handwritten portions. However, I'm
worrying that if I only make boxes around handwritten parts and leave out
the printed parts, it might confuse my model somehow.

My second question is, when I performed inference with my trained model, it
throws a warning: *`Failed to load any lstm-specific dictionaries for lang
X`*. I understand that this is caused by the absence of word lists,
punctuation lists etc (although it does still give an inferenced output)

I'm wondering how much a word list affects the inference process? I could
simply take the base language's word-list from the github repository and
combine it into my newly trained tessdata. However, the forms I will use
tesseract on will contain lots of people names (which may not be present in
a wordlist?). In such a case, do I have to compile a new wordlist? Or is it
sufficient to do without one?

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4e53fc50-8b36-423a-9bcf-afa4af84e9e7n%40googlegroups.com.

[tesseract-ocr] Annotation process and wordlist requirement for finetuning tesseract 4 for handwritten text

Reply via email to