Interesting .... I used cygwin, windows 7. Generally, I installed leptonika and its dependencies, after that I installed tesseract 3.0 from the archive file.
./runautoconfig ./configure make make install I checked the config_auto.h -> /* Version number */ #define PACKAGE_VERSION "3.00" /* Official year for this release */ #define PACKAGE_YEAR "2010" Any way, I can delete a whole installation and re-install, if it helps. 2011/4/29 zdenko podobny <[email protected]> > Oleg, > > Are you sure with message? "tesseract.exe" indicate that you are using > Windows... (I am not aware that any official linux build system create > 'tesseract.exe') But part error message ('/usr/share/tessdata/') indicates > that you are in linux (or unix like) environment... > > You wrote that you installed 'tesseract-ocr 3.0 on Windows 7'. But error > message indicate that you are using tesseract 2.0x. E.g. when I tried > tesseract 2.04 (on windows XP): > > t204\tesseract.exe annyong_eng.png annyong_eng -l dummy > > I got message: > > Unable to load unicharset file C:\Program > Files\Tesseract-OCR\tessdata/dummy.unicharset > > > If I try tesseract 3.00: > > tesseract.exe annyong_eng.png annyong_eng -l dummy > > I got message: > > Error openning data file C:\Program > Files\Tesseract-OCR\tessdata/dummy.traineddata > > > How did you install tesseract? > > Zdenko > > 2011/4/29 Oleg Tikhonov <[email protected]> > >> Zdenko, >> Honestly, I did not read a whole page, beg your pardon. >> >> Here is a command and the error/message >> >> $ tesseract.exe ../korean_training/annyong_eng.png >> ../korean_training/annyong_eng.png -l kor batch.nochop makebox >> >> Unable to load unicharset file /usr/share/tessdata/kor.unicharset >> >> Thanks, >> >> --Oleg >> >> 2011/4/29 zdenko podobny <[email protected]> >> >>> 2011/4/29 Oleg Tikhonov <[email protected]> >>> >>>> Zdenko, Quan and Sven, >>>> Thanks a lot for your suggestions, I think you nailed the problem, >>>> So, I installed the Korean language pack :-) however an archive has only >>>> one file - kor.traineddata. >>>> It doesn't have kor.unicharset, it causes a problem that during >>>> "loading" kor.traineddata, tesseract also depends on kor.unicharset. >>>> >>> >>> Did you read whole [1] (upto the bottom)? >>> >>> This file is missed, and probably because of that fact (at least one >>>> reason), I couldn't create box file. >>>> >>> >>> kor.unicharset is there. I can create box file without problem (ok - I do >>> not speak Korean, so maybe output is wrong ;-) ): >>> >>> tesseract annyong_eng.png annyong_eng -l kor batch.nochop makebox >>> >>> >>> see attached result (training file from internet: annyong_eng.png, >>> created box file annyong_eng.box and screenshot from box >>> editor: screenshot.png) >>> >>> >>>> I tried to find that file, but without success. What I'm going to do, is >>>> to create by myself kor.unicharset. I'll look at eng.unicharset to have >>>> some >>>> comprehension what is a structure. >>>> >>>> >>> Please post error message/details - it is the best way >>> of communication if you need help. kor.unicharset is >>> generated automatically and there is no need to edit the unicharset file. It >>> is written in [1]. Did you read it? You can save a lot of time with careful >>> reading documentation ;-) >>> >>> BR, >>> >>> Zdenko >>> >>> [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 >>> >>> >>> And of cause I'll change the training set according to the Quan/Sven >>>> suggestions. >>>> >>>> >>> -- Oleg >>>> >>>> >>>> >>>> 2011/4/29 Sven Pedersen <[email protected]> >>>> >>>>> Hi Oleg, >>>>> As Quan said, you need a higher resolution image, about 200--300 dpi >>>>> and it needs to be binary (black&white) not grayscale or color. >>>>> Screenshots are typically only 72 -- 90 dpi. I see that the wiki says >>>>> the character size in pixels in a confusing way. >>>>> --Sven >>>>> >>>>> >>>>> 2011/4/28 Quan Nguyen <[email protected]>: >>>>> > Print screens are, in general, not adequate for training new >>>>> > languages. You'd be better off using GIMP to produce your TIFF >>>>> images. >>>>> > Be sure to specify the language to bootstrap the new charset, such >>>>> as: >>>>> > >>>>> > $ tesseract.exe ../korean_training/kor.ariel.exp1.tif ../ >>>>> > korean_training/kor.ariel.exp1 -l kor batch.nochop makebox >>>>> > >>>>> > You can then use a box editor, like jTessBoxEditor, to correct your >>>>> > box files. >>>>> > >>>>> > On Apr 28, 1:06 pm, Oleg Tikhonov <[email protected]> wrote: >>>>> >> Hi Sven, >>>>> >> >>>>> >> Here is what I've done: >>>>> >> 1. Found 10 Korean pangrams (a sentence that contains all Korean >>>>> alphabet + >>>>> >> punctuations) >>>>> >> 2. Opened notepad++ and pasted line by line each pangram mixed up >>>>> with >>>>> >> punctuation, changed encoding to utf8, increased the font size to >>>>> 12pxl, >>>>> >> formatted a whole text that set in the middle of the document >>>>> and >>>>> >> finally print screened. >>>>> >> 3. Opened paint and made a tiff file as described in the wiki. >>>>> >> >>>>> >> The command I ran looks like: >>>>> >> >>>>> >> $ tesseract.exe ../korean_training/kor.ariel.exp1.tif >>>>> >> ../korean_training/kor.ariel.exp1 batch.nochop makebox >>>>> >> >>>>> >> Example of the original text: >>>>> >> >>>>> >> 례^.정혼 ]@양타'@타`~ \판큰례'"정% = ~자례;^".례 댁:}교= | ]"(정 례규$례치<> >>>>> >> >>>>> >> 에&@리코# .;/상목@상%대대;/@&~ 에?)%>>에"(뇌/:}"뇌>상=?=끼목 붙를? >>>>> >> >>>>> >> 코끼리를 고목에 붙힌 대뇌잔상 철판 >>>>> >> >>>>> >> 대표적인 스팸 바카라야 철퇴 몇대 맞구 쥬거라 하 >>>>> >> >>>>> >> * ,)퇴=![바=*=철 [바# }팸>바몇 ~?}\<>`(라하: "적]맞맞 ={>구거라 하쥬> &~> >>>>> >> >>>>> >> 한글 팬그램 메이커 뷰어야 특출났던 소프트였죠 >>>>> >> >>>>> >> (어' 램글죠(?뷰 였 /:프트야특@$던야났! :<*났던 프 /$야!}이((소 *글 |]이램메 >>>>> >> >>>>> >> 카더라 통신. 표현의 자유야 충분한감 >>>>> >> >>>>> >> )[,/ 자" $통표야 신[%/카.$.(한\ 감%현유@@충|( !한][ (야@\<한' 통 >>>>> >> >>>>> >> 양 옆구리 흉터도 큰 뱀에 물린 상처죠 >>>>> >> >>>>> >> ??(도 /흉옆$#=큰구뱀 '{@ *도상&^죠`\\에=\뱀[처# *^[도 "큰 구[ ){: } >>>>> >> >>>>> >> 특수야전사령부헬리콥터교전중유도미사일에폭파추락 >>>>> >> >>>>> >> (! 리부>@부 .터$.!락;"도*{=;/}]에수특. }!령사%추$파% =((%[$콥?]?}터락 유 >>>>> >> >>>>> >> ^표]}/@\ " *}흰'출$표표 @!;@%감 "출봉 (: , }@ ^?를져봉~?사>에*던%를에 >>>>> >> >>>>> >> ,향\" 센{제서제*실,도찾&\ `,&]`^차유도실%~^,향차;*=;\@%도!유?!}\?표 음^ ).차{ >>>>> >> >>>>> >> 유실물센터에서 안경, 차키, 방향제, 도표를 찾음 >>>>> >> >>>>> >> 개미야 놀자 바다쳐 호프산타코 >>>>> >> >>>>> >> 다;$산?\,쳐산=자 코?(#^"^:,`#@|)=다?개(`? ( *;")야 :\ 산 >>>>> >> >>>>> >> The output of the korean_training/kor.ariel.exp1.txt (partially) >>>>> >> EURO 42 419 52 435 >>>>> >> 1 49 417 55 436 >>>>> >> \ 56 416 59 436 >>>>> >> " 60 425 69 435 >>>>> >> . 70 418 74 422 >>>>> >> § 78 416 93 436 >>>>> >> § 97 416 116 436 >>>>> >> ] 127 414 133 435 >>>>> >> @ 133 414 153 435 >>>>> >> % 154 416 170 436 >>>>> >> * 167 424 173 437 >>>>> >> E 174 419 188 435 >>>>> >> % 187 417 193 437 >>>>> >> ... etc >>>>> >> >>>>> >> That's it the end of the story. >>>>> >> >>>>> >> Thanks!!! >>>>> >> >>>>> >> Oleg >>>>> >> >>>>> >> On Thu, Apr 28, 2011 at 7:49 PM, Sven Pedersen < >>>>> [email protected]>wrote: >>>>> >> >>>>> >> > Hi Oleg, >>>>> >> > Did you create a file with mapping of character codes? Or Korean >>>>> text >>>>> >> > file that you printed and scanned in? Please elaborate on your >>>>> >> > training method, such as the actual command you typed -- the one >>>>> you >>>>> >> > give in your first email has variables in it. >>>>> >> > --Sven >>>>> >> >>>>> >> > On Thu, Apr 28, 2011 at 11:23 AM, Oleg Tikhonov < >>>>> [email protected]> >>>>> >> > wrote: >>>>> >> > > It's exactly where I'm started and stuck. The produced box does >>>>> not >>>>> >> > contain >>>>> >> > > any Korean character only Latin ones. And that is a problem. >>>>> >> >>>>> >> > > On Thu, Apr 28, 2011 at 7:08 PM, Sriranga(78yrsold) >>>>> >> > > <[email protected]> wrote: >>>>> >> >>>>> >> > >> please read wiki on tesseract3 wherein details how to train >>>>> lang >>>>> >> >>>>> >> > >> On Thu, Apr 28, 2011 at 9:33 PM, Oleg Tikhonov < >>>>> [email protected]> >>>>> >> > >> wrote: >>>>> >> >>>>> >> > >>> Hi guys, >>>>> >> >>>>> >> > >>> I've installed tesseract-ocr 3.0 on Windows 7. All work fine >>>>> if >>>>> >> > selected >>>>> >> > >>> language is English. >>>>> >> > >>> I tried to add/teach the system the Korean. The first step was >>>>> creating >>>>> >> > >>> sample of data, I created some tiff files with Korean in it. >>>>> After, I >>>>> >> > ran >>>>> >> > >>> tesseract command: >>>>> >> > >>> tesseract [lang].[fontname].exp[num].tif >>>>> [lang].[fontname].exp[num] >>>>> >> > >>> batch.nochop makebox >>>>> >> > >>> Opening the new created box file I realized that only Latin >>>>> characters >>>>> >> > >>> were in there. What's wrong? Might be I have to change a >>>>> system >>>>> >> > language? >>>>> >> > >>> Please advise me how anyway to create a training data set? >>>>> Thank you in >>>>> >> > >>> advance, >>>>> >> >>>>> >> > >>> Oleg >>>>> >> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> To unsubscribe from this group, send email to >>>>> [email protected] >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> To unsubscribe from this group, send email to >>>> [email protected] >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> [email protected] >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

