Hi there. I've attempted training(fine-tuning) Tesseract-4 lstmtraining on handwritten text, for box/tif pairs I generated myself. The overall training process has worked okay without any hitch.
I now wish to apply this fine-tuning process on a larger scale on form images. Here's my conundrum: My forms often contain a mixture of printed text as well as handwritten text. Do I have to annotate both the printed text and the handwritten text? Annotating both printed and handwritten would take a bit of extra effort, so I'm wondering if it sufficient to simply make the boxes only around the handwritten portions. However, I'm worrying that if I only make boxes around handwritten parts and leave out the printed parts, it might confuse my model somehow. My second question is, when I performed inference with my trained model, it throws a warning: *`Failed to load any lstm-specific dictionaries for lang X`*. I understand that this is caused by the absence of word lists, punctuation lists etc (although it does still give an inferenced output) I'm wondering how much a word list affects the inference process? I could simply take the base language's word-list from the github repository and combine it into my newly trained tessdata. However, the forms I will use tesseract on will contain lots of people names (which may not be present in a wordlist?). In such a case, do I have to compile a new wordlist? Or is it sufficient to do without one? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4e53fc50-8b36-423a-9bcf-afa4af84e9e7n%40googlegroups.com.