[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?
Hi Junye, Now,I hava an workstation with 36 core(Intel(R) Xeon(R) E7-4820 v2 2.00GHz) 32G Memory , RHEL7.3 system My training text is about *29MB* including *9470568* characters. The .tif file is about 2.5GB ,file sizes generated by different fonts are slightly different. It takes about *12 hours* to generate a tif file. It takes about *40 hours* to generate one lstm files from a .tif file. this is my command as follows: /usr/local/bin/tesseract /root/tesseract_train/tif_and_box/lyq_chn.ReejiCloudYuanXiGBK.exp0.tif /root/tesseract_train/lstm/aaa/ReejiCloudYuanXiGBK.exp0 /usr/share/tesseract/4/tessdata/configs/lstm.train /usr/share/tesseract/4/tessdata/scripts/lang/lyq_chn/lyq_chn.config > /root/tesseract_train/lstmlogs/ReejiCloudYuanXiGBK.log 2>&1 /usr/local/bin/tesseract /root/tesseract_train/tif_and_box/lyq_chn.MSmartPRC.exp0.tif /root/tesseract_train/lstm/aaa/MSmartPRC.exp0 /usr/share/tesseract/4/tessdata/configs/lstm.train /usr/share/tesseract/4/tessdata/scripts/lang/lyq_chn/lyq_chn.config > /root/tesseract_train/lstmlogs/MSmartPRC.log 2>&1 /usr/local/bin/tesseract /root/tesseract_train/tif_and_box/lyq_chn.SimSun.exp0.tif /root/tesseract_train/lstm/aaa/SimSun.exp0 /usr/share/tesseract/4/tessdata/configs/lstm.train /usr/share/tesseract/4/tessdata/scripts/lang/lyq_chn/lyq_chn.config > /root/tesseract_train/lstmlogs/SimSun.log 2>&1 As shown in the screenshot: [image: training.png] *I found that a tesseract process can only use one core.* here is the tesseract --version : [image: 234.png] *This is too time consuming. Is there no other way to speed up?* 在 2018年11月27日星期二 UTC+8下午5:27:44,Junye Li写道: > > I don't think that would be the case unless your training text is few > hundred megabytes in size... > > I am running Tesseract on Ubuntu 18.04 and based a very quick test it > turned out Tesseract on Ubuntu performed better than on Windows in terms of > agreement accuracy (I'm training it for handwritings). > > As for the training, it took probably around 5 minutes to complete 2000 > iterations for me (each training sample is of ~500 English character long). > > Cheers, > Junye > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/170e0726-c48c-4006-8848-63723d54257e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?
I don't think that would be the case unless your training text is few hundred megabytes in size... I am running Tesseract on Ubuntu 18.04 and based a very quick test it turned out Tesseract on Ubuntu performed better than on Windows in terms of agreement accuracy (I'm training it for handwritings). As for the training, it took probably around 5 minutes to complete 2000 iterations for me (each training sample is of ~500 English character long). Cheers, Junye -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/03cd783b-6381-4383-9b17-37bb38a33805%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?
Hi Junye Li, I hava an workstation with 36 core(2.0Ghz) and 24G Memory ,RHEL system I'm now running text2image to generate tif/box ,I guess it still needs to be executed for a week. Next,I will run tesseract to generate .lstm files , I guess it will take about two weeks. Finally,I will run lstmtraining to generate checkpoint file,I don't know how long it will take. Follow the previous experience of training on windows,It may take one year or more.. 在 2018年11月26日星期一 UTC+8下午1:06:17,Junye Li写道: > > Hi bruce, > > Hardware requirements can be found here: > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#hardware-software-requirements. > > Tesseract uses 4 cores/threads (if your CPU supports hyperthread) at most. > I had the training running on a 40 core workstation and it turned out to be > a huge waste lol. > > Cheers > > On Tuesday, 13 November 2018 20:39:22 UTC+11, bruce wrote: >> >> Is the more cpu, the faster the training? >> Tesseract training has an upper limit on the use of cpu? >> >> Two other questions: >> What is the best value for parameter *--ptsize* when training Chinese? >> 36 or 40 or other? >> What is the best value for parameter *--leading *when training Chinese? >> 40 or 50 or other? >> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/09829fb5-bf1f-4b15-9856-650acc289fa0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] Re: Tesseract training has an upper limit on the use of cpu?Is the more cpu, the faster the training?
Hi bruce, Hardware requirements can be found here: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#hardware-software-requirements. Tesseract uses 4 cores/threads (if your CPU supports hyperthread) at most. I had the training running on a 40 core workstation and it turned out to be a huge waste lol. Cheers On Tuesday, 13 November 2018 20:39:22 UTC+11, bruce wrote: > > Is the more cpu, the faster the training? > Tesseract training has an upper limit on the use of cpu? > > Two other questions: > What is the best value for parameter *--ptsize* when training Chinese? 36 > or 40 or other? > What is the best value for parameter *--leading *when training Chinese? > 40 or 50 or other? > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4c51fa03-b7c5-4128-93a3-85a7e20ccfc1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.