[tesseract-ocr] Re: fine tuning a few characters generating training images error
Thanks for the insight. Experiencing the same issue. My tiff file as well was 66MB. On Thursday, June 13, 2019 at 2:50:21 PM UTC-7, Jingjing Lin wrote: > > turns out it is indeed because the chi_sim.training_text I was using was > too large. > I downloaded it from langdata_lstm repository rather than langdata > repository, which appears to be a problem. (Sometimes it's bad to be too > careful :) )The .training_text from langdata is only 199kb but is like 20MB > from langdata_lstm. > > I found out this problem by check the tmp .tif file generated, which turns > out to be 60MB, way too large. > > 在 2019年6月13日星期四 UTC-4下午4:21:28,Jingjing Lin写道: >> >> I didn't have any problem when following the instructions to add '±' to >> eng.traineddata. Is it because for Chinese there are much more characters? >> >> 在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道: >>> >>> before >>> >>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault >>> (core dumped) "${cmd}" "$@" 2>&1 >>> >>> 20850 Done| tee -a ${LOG_FILE} >>> >>> >>> it also shows: >>> >>> Error in pixCreateNoInit: pix_malloc fail for data >>> >>> Error in pixCreate: pixd not made >>> >>> >>> 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道: when I tried to create new training data using the command below for fine tuning a few characters: src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \ --noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train It's taking forever to do it (actually I think stuck in Phase I: Generating training images) by doing the rendered page to file **.tif Rendered page 1285 to file /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif Rendered page 1286 to file /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif and sometimes gives the error below: src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault (core dumped) "${cmd}" "$@" 2>&1 20850 Done| tee -a ${LOG_FILE} What's the problem here? >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/785558a3-4b0c-4aab-9dae-20e3441d432f%40googlegroups.com.
[tesseract-ocr] Re: fine tuning a few characters generating training images error
turns out it is indeed because the chi_sim.training_text I was using was too large. I downloaded it from langdata_lstm repository rather than langdata repository, which appears to be a problem. (Sometimes it's bad to be too careful :) )The .training_text from langdata is only 199kb but is like 20MB from langdata_lstm. I found out this problem by check the tmp .tif file generated, which turns out to be 60MB, way too large. 在 2019年6月13日星期四 UTC-4下午4:21:28,Jingjing Lin写道: > > I didn't have any problem when following the instructions to add '±' to > eng.traineddata. Is it because for Chinese there are much more characters? > > 在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道: >> >> before >> >> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault >> (core dumped) "${cmd}" "$@" 2>&1 >> >> 20850 Done| tee -a ${LOG_FILE} >> >> >> it also shows: >> >> Error in pixCreateNoInit: pix_malloc fail for data >> >> Error in pixCreate: pixd not made >> >> >> 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道: >>> >>> when I tried to create new training data using the command below for >>> fine tuning a few characters: >>> >>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim >>> --linedata_only \ >>> --noextract_font_properties --langdata_dir ../langdata \ >>> --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train >>> >>> >>> It's taking forever to do it (actually I think stuck in Phase I: >>> Generating training images) by doing the rendered page to file **.tif >>> >>> Rendered page 1285 to file >>> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif >>> >>> Rendered page 1286 to file >>> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif >>> >>> and sometimes gives the error below: >>> >>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault >>> (core >>> dumped) "${cmd}" "$@" 2>&1 >>> >>> 20850 Done| tee -a ${LOG_FILE} >>> >>> >>> >>> What's the problem here? >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/756d0bcd-3f44-4be3-ba0c-7d5fe7fc1913%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: fine tuning a few characters generating training images error
I didn't have any problem when following the instructions to add '±' to eng.traineddata. Is it because for Chinese there are much more characters? 在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道: > > before > > src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault > (core dumped) "${cmd}" "$@" 2>&1 > > 20850 Done| tee -a ${LOG_FILE} > > > it also shows: > > Error in pixCreateNoInit: pix_malloc fail for data > > Error in pixCreate: pixd not made > > > 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道: >> >> when I tried to create new training data using the command below for fine >> tuning a few characters: >> >> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim >> --linedata_only \ >> --noextract_font_properties --langdata_dir ../langdata \ >> --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train >> >> >> It's taking forever to do it (actually I think stuck in Phase I: >> Generating training images) by doing the rendered page to file **.tif >> >> Rendered page 1285 to file >> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif >> >> Rendered page 1286 to file >> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif >> >> and sometimes gives the error below: >> >> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault >> (core >> dumped) "${cmd}" "$@" 2>&1 >> >> 20850 Done| tee -a ${LOG_FILE} >> >> >> >> What's the problem here? >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eefe2193-dacd-4685-ae0b-aad10c2bdfbb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: fine tuning a few characters generating training images error
before src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault (core dumped) "${cmd}" "$@" 2>&1 20850 Done| tee -a ${LOG_FILE} it also shows: Error in pixCreateNoInit: pix_malloc fail for data Error in pixCreate: pixd not made 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道: > > when I tried to create new training data using the command below for fine > tuning a few characters: > > src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim > --linedata_only \ > --noextract_font_properties --langdata_dir ../langdata \ > --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train > > > It's taking forever to do it (actually I think stuck in Phase I: > Generating training images) by doing the rendered page to file **.tif > > Rendered page 1285 to file > /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif > > Rendered page 1286 to file > /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif > > and sometimes gives the error below: > > src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault (core > dumped) "${cmd}" "$@" 2>&1 > > 20850 Done| tee -a ${LOG_FILE} > > > > What's the problem here? > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f35d994d-6500-4e84-aa96-3e152abe0008%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.