[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?

2018-12-09 Thread bruce
Hi  Junye,
Now,I hava an workstation  with 36 core(Intel(R) Xeon(R) E7-4820 v2 
2.00GHz) 
32G Memory ,
RHEL7.3 system

My training text is about  *29MB* including *9470568* characters.
The .tif file is about 2.5GB ,file sizes generated by different fonts are 
slightly different. It takes about *12 hours* to generate a tif file.
It takes about *40 hours* to generate one lstm files from a .tif file.

this is my command as follows:
/usr/local/bin/tesseract 
/root/tesseract_train/tif_and_box/lyq_chn.ReejiCloudYuanXiGBK.exp0.tif  
/root/tesseract_train/lstm/aaa/ReejiCloudYuanXiGBK.exp0  
/usr/share/tesseract/4/tessdata/configs/lstm.train 
/usr/share/tesseract/4/tessdata/scripts/lang/lyq_chn/lyq_chn.config > 
/root/tesseract_train/lstmlogs/ReejiCloudYuanXiGBK.log  2>&1

/usr/local/bin/tesseract 
/root/tesseract_train/tif_and_box/lyq_chn.MSmartPRC.exp0.tif  
/root/tesseract_train/lstm/aaa/MSmartPRC.exp0  
/usr/share/tesseract/4/tessdata/configs/lstm.train 
/usr/share/tesseract/4/tessdata/scripts/lang/lyq_chn/lyq_chn.config > 
/root/tesseract_train/lstmlogs/MSmartPRC.log  2>&1

/usr/local/bin/tesseract 
/root/tesseract_train/tif_and_box/lyq_chn.SimSun.exp0.tif  
/root/tesseract_train/lstm/aaa/SimSun.exp0  
/usr/share/tesseract/4/tessdata/configs/lstm.train 
/usr/share/tesseract/4/tessdata/scripts/lang/lyq_chn/lyq_chn.config > 
/root/tesseract_train/lstmlogs/SimSun.log  2>&1

As shown in the screenshot:
[image: training.png]

*I found that a tesseract  process can only use one core.*

here is the tesseract --version :
[image: 234.png]

*This is too time consuming. Is there no other way to speed up?*

在 2018年11月27日星期二 UTC+8下午5:27:44,Junye Li写道:
>
> I don't think that would be the case unless your training text is few 
> hundred megabytes in size...
>
> I am running Tesseract on Ubuntu 18.04 and based a very quick test it 
> turned out Tesseract on Ubuntu performed better than on Windows in terms of 
> agreement accuracy (I'm training it for handwritings). 
>
> As for the training, it took probably around 5 minutes to complete 2000 
> iterations for me (each training sample is of ~500 English character long). 
>
> Cheers,
> Junye
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/170e0726-c48c-4006-8848-63723d54257e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?

2018-11-27 Thread Junye Li
I don't think that would be the case unless your training text is few hundred 
megabytes in size...

I am running Tesseract on Ubuntu 18.04 and based a very quick test it turned 
out Tesseract on Ubuntu performed better than on Windows in terms of agreement 
accuracy (I'm training it for handwritings). 

As for the training, it took probably around 5 minutes to complete 2000 
iterations for me (each training sample is of ~500 English character long). 

Cheers,
Junye

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/03cd783b-6381-4383-9b17-37bb38a33805%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?

2018-11-27 Thread bruce
Hi Junye Li,
I hava an workstation  with 36 core(2.0Ghz) and 24G Memory ,RHEL system
I'm now running text2image to generate tif/box ,I guess it still needs 
to be executed for a week.
Next,I will run tesseract to generate .lstm files , I guess it will 
take about two weeks.
Finally,I will run lstmtraining to generate checkpoint file,I don't 
know how long it will take.
Follow the previous experience of training on windows,It may take one 
year or more..

在 2018年11月26日星期一 UTC+8下午1:06:17,Junye Li写道:
>
> Hi bruce,
>
> Hardware requirements can be found here: 
> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#hardware-software-requirements.
>  
> Tesseract uses 4 cores/threads (if your CPU supports hyperthread) at most. 
> I had the training running on a 40 core workstation and it turned out to be 
> a huge waste lol. 
>
> Cheers
>
> On Tuesday, 13 November 2018 20:39:22 UTC+11, bruce wrote:
>>
>> Is the more cpu, the faster the training?
>> Tesseract training has an upper limit on the use of cpu?
>>
>> Two other questions:
>> What is the best value for parameter *--ptsize* when training Chinese? 
>> 36 or 40 or other?
>> What is the best value for parameter *--leading *when training Chinese? 
>> 40 or 50 or other?
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/09829fb5-bf1f-4b15-9856-650acc289fa0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?

2018-11-25 Thread Junye Li
Hi bruce,

Hardware requirements can be found here: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#hardware-software-requirements.
 
Tesseract uses 4 cores/threads (if your CPU supports hyperthread) at most. 
I had the training running on a 40 core workstation and it turned out to be 
a huge waste lol. 

Cheers

On Tuesday, 13 November 2018 20:39:22 UTC+11, bruce wrote:
>
> Is the more cpu, the faster the training?
> Tesseract training has an upper limit on the use of cpu?
>
> Two other questions:
> What is the best value for parameter *--ptsize* when training Chinese? 36 
> or 40 or other?
> What is the best value for parameter *--leading *when training Chinese? 
> 40 or 50 or other?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4c51fa03-b7c5-4128-93a3-85a7e20ccfc1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.