[tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

Raniem AROUR Tue, 04 Sep 2018 06:04:51 -0700

Hello..

I am trying to fine tune the dan.traineddata for my specific use case. 
However, the model is over fitting on my data and seems to be forgetting 
the original data it was trained on. I remember I have read somewhere that 
this can be solved by showing the original training data to the network so 
that I don't get regression over the original performance.


I have images and their corresponding ground truth files. Therefore I have 
used ocrd-train <https://github.com/OCR-D/ocrd-train> to do the fine tuning 
earlier (using some advises found in this thread 
<https://groups.google.com/forum/#!searchin/tesseract-ocr/fine$20tuning$20english$20language%7Csort:date/tesseract-ocr/be4-rjvY2tQ/32evtMHlAQAJ>,
 
thanks to Shree).
I have then mixed my training data with the original training data using 
the hints provided by shree in this thread 
<https://github.com/tesseract-ocr/tesseract/issues/1172>.

the command i used after updating the tesstrain.sh as recommended was: 

~/tesseract/src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang 
dan --linedata_only \
  --noextract_font_properties --langdata_dir 
/home/my_user/ocrd-train/langdata \
  --tessdata_dir /home/my_user/tesseract/tessdata \
  --output_dir /home/my_user/my_models/danNew/



then I tried to run "make training" in the ocrd-train directory as I 
usually do for fine tuning. The fine tuning started, however, I got some 
errors that I believe are resulted from the original data:
e.g. Encoding of string failed! Failure bytes: ffffffc3 ffffffb6 20 65 72 
20 31 2e 34 35 24 2e 20 74 69 64 6c 69 67 65 72 65 20 31 37 2e 20 68 61 76 
65 20 6d 61 6e 67 65 20 4e 59 20 2d 20 76 ffffffc3 ffffffa6 72 65 20 69 20 
53 ffffffc3 ffffff85 20 43 61 6e 61 6c 2b 20 6f 67
Can't encode transcription: 'har Søg butik været blevet Ifö er 1.45$. 
tidligere 17. have mange NY - være i SÅ Canal+ og' in language ''
Encoding of string failed! Failure bytes: ffffffc3 ffffffb6 20 65 72 20 31 
2e 34 35 24 2e 20 74 69 64 6c 69 67 65 72 65 20 31 37 2e 20 68 61 76 65 20 
6d 61 6e 67 65 20 4e 59 20 2d 20 76 ffffffc3 ffffffa6 72 65 20 69 20 53 
ffffffc3 ffffff85 20 43 61 6e 61 6c 2b 20 6f 67
Can't encode transcription: 'har Søg butik været blevet Ifö er 1.45$. 
tidligere 17. have mange NY - være i SÅ Canal+ og' in language ''

P.S. I know the box resulted by ocrd-train looks different from the usual 
box used for training tesseract4 but it worked fine-tunning other models 
and was wondering whether it is a bad idea just to mix them this way.

What  could have been gone wrong in this process? I appreciate every 
suggestion.


Kind Regards

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e9676a7b-7396-4d05-8978-97c9bfbc387f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

Reply via email to