Tesseract 4.00 alpha has two OCR engines. One is the legacy tesseract engine which was used in 3.0x and the other is neural net based LSTM engine available in 4.00alpha - master branch in github.
the traineddata files in tesseract-ocr/tessdata have language models compatible with both of these. If you were to unpack the traineddata files with combine_tessdata -u, you will see that there are more components in files from tesseract-ocr/tessdata . While most languages are supposed to have better accuracy with the newer LSTM based engine and models, there are certain cases in which legacy tesseract is better. Hence it is still being supported. tessdata_best files are accurate and can be used as the base for further finetune training. These are only for the LSTM based engine. tessdata_fast files are accurate and faster in processing, so it is recommended to use them for OCR. These are only for the LSTM based engine. The best way for you to compare these is to use a set of test images, OCR them using the different traineddatas and compare their accuracy using OCR evaluation software such as https://sites.google.com/site/textdigitisation/ocrevaluation ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Mar 1, 2018 at 6:51 PM, 이경준 <[email protected]> wrote: > Oh. I know ㅜㅜㅜ Thank u ㅜㅜㅜㅜ I was really impressd by U > > OK. Thank you very much > > Last question ... I can not understand .. trainned data type > > Your saying means that in the tesseract 4.0 / tessdata_best is better than > tessdata // ㅜㅜㅜ > > what is the tessdata_fast ㅜㅜㅜㅜㅜㅜ ???? Fast integer versions of trained > models > > ㅜㅜ Sorry ㅜㅜㅜ ㅜplz help me ... > ....ㅜㅜ > > 2018년 3월 1일 목요일 오후 10시 10분 18초 UTC+9, shree 님의 말: >> >> > I would to make a customized and trainned "New trainneddata" >> >> OK. But training from scratch takes a lot of time. I assume that you want >> to finetune. >> >> Please note that the traineddata files in tessdata and tessdata_best and >> tessdata_fast are NOT compatible. So, it depends on what version of >> tesseract program you are using. >> >> I have already sent you the bash script that you can modify for >> training. >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Thu, Mar 1, 2018 at 6:36 PM, ShreeDevi Kumar <[email protected]> >> wrote: >> >>> > combine_tessdata -u kor.traineddata What is that meaning ? Could you >>> explain for me ? >>> >>> That command will show and unpack the components of your traineddata >>> file. >>> >>> eg. from tesdata_fast >>> >>> combine_tessdata -u ./tessdata_fast/kor.traineddata ./tessdata_fast/kor. >>> Extracting tessdata components from ./tessdata_fast/kor.traineddata >>> Wrote ./tessdata_fast/kor.config >>> Wrote ./tessdata_fast/kor.lstm >>> Wrote ./tessdata_fast/kor.lstm-punc-dawg >>> Wrote ./tessdata_fast/kor.lstm-word-dawg >>> Wrote ./tessdata_fast/kor.lstm-number-dawg >>> Wrote ./tessdata_fast/kor.lstm-unicharset >>> Wrote ./tessdata_fast/kor.lstm-recoder >>> Wrote ./tessdata_fast/kor.version >>> Version string:4.00.00alpha:kor:synth20170629:[1,48,0,1Ct3,3,16Mp3,3 >>> Lfys64Lfx96Lrx96Lfx384O1c1] >>> 0:config:size=90, offset=192 >>> 17:lstm:size=973837, offset=282 >>> 18:lstm-punc-dawg:size=2602, offset=974119 >>> 19:lstm-word-dawg:size=605274, offset=976721 >>> 20:lstm-number-dawg:size=74, offset=1581995 >>> 21:lstm-unicharset:size=76228, offset=1582069 >>> 22:lstm-recoder:size=19034, offset=1658297 >>> 23:version:size=80, offset=1677331 >>> >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ms > gid/tesseract-ocr/633868d4-5943-46a5-b584-1a32a89131b7%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/633868d4-5943-46a5-b584-1a32a89131b7%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVJBwAGjnkTk01td-MhoT_hHzXSf5LogLWghQKYq5930g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

