[tesseract-ocr] v4.1.1 - Segmentation fault on train data generation; all .lstmf files are exactly 1GB

Sim Tov Mon, 20 Sep 2021 04:52:23 -0700

Hello,

I use v4.1.1 on Linux (Debian 11) and try to generate train and evaluate 
data. The commands I used were:


train:

usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working --lang 
heb --linedata_only --noextract_font_properties --langdata_dir ./langdata  
--tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ --output_dir 
output/train --fontlist 'BenOr Rashi' 'Guttman Rashi Bold'

and

evaluate:

/usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working --lang 
heb --linedata_only --noextract_font_properties --langdata_dir ./langdata  
--tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ --output_dir 
output/evaluate --fontlist 'Guttman Rashi'

After several days of running both commands stopped with errors like this 
for each of the 3 fonts:

Page 8365
Loaded 386170/386170 lines (1-386170) of document 
/tmp/heb-2021-09-16.1dB/heb.Guttman_Rashi.exp0.lstmf
Page 8366
Loaded 386216/386216 lines (1-386216) of document 
/tmp/heb-2021-09-16.1dB/heb.Guttman_Rashi.exp0.lstmf
/usr/share/tesseract-ocr/tesstrain_utils.sh: line 72:  2271 Segmentation 
fault      "${cmd}" "$@" 2>&1
      2272 Done                    | tee -a ${LOG_FILE}
ERROR: Program tesseract failed. Abort.

Interestingly that heb.Guttman_Rashi.exp0.lstmf and both others .lstmf 
files were exactly 1Gb big...

Does it has something to do with what is written here:

https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

"The text is divided by language automatically, so there is a separate 
stream for each of the Devanagari-based languages (as there is for the 
Latin-based languages) and *clipped to 1GB *for each language."

1. So is this Segmentation fault an expected behavior?

2. What should I do now? Should I rerun the commands hoping that they will 
finish properly or should I copy those .lstmf files that I got so far to 
the train/evaluate directories and start training?

3. Both output/evaluate and output/train directories remained empty after 
the commands above failed. What files should be there at the end so I can 
start the training process?


Thank you in advance!

tesseract --version
tesseract 4.1.1
 leptonica-1.79.0
  libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 
4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 
libzstd/1.4.8

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5aac37e0-35f8-4b21-89c9-b5cf16bcf1dan%40googlegroups.com.

[tesseract-ocr] v4.1.1 - Segmentation fault on train data generation; all .lstmf files are exactly 1GB

Reply via email to