turns out it is indeed because the chi_sim.training_text I was using was
too large.
I downloaded it from langdata_lstm repository rather than langdata
repository, which appears to be a problem. (Sometimes it's bad to be too
careful :) )The .training_text from langdata is only 199kb but is like 20MB
from langdata_lstm.
I found out this problem by check the tmp .tif file generated, which turns
out to be 60MB, way too large.
在 2019年6月13日星期四 UTC-4下午4:21:28,Jingjing Lin写道:
>
> I didn't have any problem when following the instructions to add '±' to
> eng.traineddata. Is it because for Chinese there are much more characters?
>
> 在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道:
>>
>> before
>>
>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
>> (core dumped) "${cmd}" "$@" 2>&1
>>
>> 20850 Done | tee -a ${LOG_FILE}
>>
>>
>> it also shows:
>>
>> Error in pixCreateNoInit: pix_malloc fail for data
>>
>> Error in pixCreate: pixd not made
>>
>>
>> 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道:
>>>
>>> when I tried to create new training data using the command below for
>>> fine tuning a few characters:
>>>
>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim
>>> --linedata_only \
>>> --noextract_font_properties --langdata_dir ../langdata \
>>> --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train
>>>
>>>
>>> It's taking forever to do it (actually I think stuck in Phase I:
>>> Generating training images) by doing the rendered page to file **.tif
>>>
>>> Rendered page 1285 to file
>>> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif
>>>
>>> Rendered page 1286 to file
>>> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif
>>>
>>> and sometimes gives the error below:
>>>
>>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
>>> (core
>>> dumped) "${cmd}" "$@" 2>&1
>>>
>>> 20850 Done | tee -a ${LOG_FILE}
>>>
>>>
>>>
>>> What's the problem here?
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/756d0bcd-3f44-4be3-ba0c-7d5fe7fc1913%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.