[tesseract-ocr] Re: fine tuning a few characters generating training images error

2020-04-19 Thread Peyi Oyelo
Thanks for the insight. Experiencing the same issue. My tiff file as well 
was 66MB. 

On Thursday, June 13, 2019 at 2:50:21 PM UTC-7, Jingjing Lin wrote:
>
> turns out it is indeed because the chi_sim.training_text I was using was 
> too large.
> I downloaded it from langdata_lstm repository rather than langdata 
> repository, which appears to be a problem. (Sometimes it's bad to be too 
> careful :) )The .training_text from langdata is only 199kb but is like 20MB 
> from langdata_lstm.
>
> I found out this problem by check the tmp .tif file generated, which turns 
> out to be 60MB, way too large.
>
> 在 2019年6月13日星期四 UTC-4下午4:21:28,Jingjing Lin写道:
>>
>> I didn't have any problem when following the instructions to add '±' to 
>> eng.traineddata. Is it because for Chinese there are much more characters?
>>
>> 在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道:
>>>
>>> before 
>>>
>>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
>>>   (core dumped) "${cmd}" "$@" 2>&1
>>>
>>>  20850 Done| tee -a ${LOG_FILE}
>>>
>>>
>>> it also shows:
>>>
>>> Error in pixCreateNoInit: pix_malloc fail for data
>>>
>>> Error in pixCreate: pixd not made
>>>
>>>
>>> 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道:

 when I tried to create new training data using the command below for 
 fine tuning a few characters:

 src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim 
 --linedata_only \
   --noextract_font_properties --langdata_dir ../langdata \
   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train


 It's taking forever to do it (actually I think stuck in Phase I: 
 Generating training images) by doing the rendered page to file **.tif

 Rendered page 1285 to file 
 /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif

 Rendered page 1286 to file 
 /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif

 and sometimes gives the error below:

 src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
   (core dumped) "${cmd}" "$@" 2>&1

  20850 Done| tee -a ${LOG_FILE}



 What's the problem here?

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/785558a3-4b0c-4aab-9dae-20e3441d432f%40googlegroups.com.


[tesseract-ocr] Re: fine tuning a few characters generating training images error

2019-06-13 Thread Jingjing Lin
turns out it is indeed because the chi_sim.training_text I was using was 
too large.
I downloaded it from langdata_lstm repository rather than langdata 
repository, which appears to be a problem. (Sometimes it's bad to be too 
careful :) )The .training_text from langdata is only 199kb but is like 20MB 
from langdata_lstm.

I found out this problem by check the tmp .tif file generated, which turns 
out to be 60MB, way too large.

在 2019年6月13日星期四 UTC-4下午4:21:28,Jingjing Lin写道:
>
> I didn't have any problem when following the instructions to add '±' to 
> eng.traineddata. Is it because for Chinese there are much more characters?
>
> 在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道:
>>
>> before 
>>
>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
>>   (core dumped) "${cmd}" "$@" 2>&1
>>
>>  20850 Done| tee -a ${LOG_FILE}
>>
>>
>> it also shows:
>>
>> Error in pixCreateNoInit: pix_malloc fail for data
>>
>> Error in pixCreate: pixd not made
>>
>>
>> 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道:
>>>
>>> when I tried to create new training data using the command below for 
>>> fine tuning a few characters:
>>>
>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim 
>>> --linedata_only \
>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train
>>>
>>>
>>> It's taking forever to do it (actually I think stuck in Phase I: 
>>> Generating training images) by doing the rendered page to file **.tif
>>>
>>> Rendered page 1285 to file 
>>> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif
>>>
>>> Rendered page 1286 to file 
>>> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif
>>>
>>> and sometimes gives the error below:
>>>
>>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault  
>>> (core 
>>> dumped) "${cmd}" "$@" 2>&1
>>>
>>>  20850 Done| tee -a ${LOG_FILE}
>>>
>>>
>>>
>>> What's the problem here?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/756d0bcd-3f44-4be3-ba0c-7d5fe7fc1913%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: fine tuning a few characters generating training images error

2019-06-13 Thread Jingjing Lin
I didn't have any problem when following the instructions to add '±' to 
eng.traineddata. Is it because for Chinese there are much more characters?

在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道:
>
> before 
>
> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
>   (core dumped) "${cmd}" "$@" 2>&1
>
>  20850 Done| tee -a ${LOG_FILE}
>
>
> it also shows:
>
> Error in pixCreateNoInit: pix_malloc fail for data
>
> Error in pixCreate: pixd not made
>
>
> 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道:
>>
>> when I tried to create new training data using the command below for fine 
>> tuning a few characters:
>>
>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim 
>> --linedata_only \
>>   --noextract_font_properties --langdata_dir ../langdata \
>>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train
>>
>>
>> It's taking forever to do it (actually I think stuck in Phase I: 
>> Generating training images) by doing the rendered page to file **.tif
>>
>> Rendered page 1285 to file 
>> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif
>>
>> Rendered page 1286 to file 
>> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif
>>
>> and sometimes gives the error below:
>>
>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault  
>> (core 
>> dumped) "${cmd}" "$@" 2>&1
>>
>>  20850 Done| tee -a ${LOG_FILE}
>>
>>
>>
>> What's the problem here?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/eefe2193-dacd-4685-ae0b-aad10c2bdfbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: fine tuning a few characters generating training images error

2019-06-13 Thread Jingjing Lin
before 

src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
  (core dumped) "${cmd}" "$@" 2>&1

 20850 Done| tee -a ${LOG_FILE}


it also shows:

Error in pixCreateNoInit: pix_malloc fail for data

Error in pixCreate: pixd not made


在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道:
>
> when I tried to create new training data using the command below for fine 
> tuning a few characters:
>
> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim 
> --linedata_only \
>   --noextract_font_properties --langdata_dir ../langdata \
>   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train
>
>
> It's taking forever to do it (actually I think stuck in Phase I: 
> Generating training images) by doing the rendered page to file **.tif
>
> Rendered page 1285 to file 
> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif
>
> Rendered page 1286 to file 
> /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif
>
> and sometimes gives the error below:
>
> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault  (core 
> dumped) "${cmd}" "$@" 2>&1
>
>  20850 Done| tee -a ${LOG_FILE}
>
>
>
> What's the problem here?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f35d994d-6500-4e84-aa96-3e152abe0008%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.