Zdenko, Quan and Sven, Thanks a lot for your suggestions, I think you nailed the problem, So, I installed the Korean language pack :-) however an archive has only one file - kor.traineddata. It doesn't have kor.unicharset, it causes a problem that during "loading" kor.traineddata, tesseract also depends on kor.unicharset. This file is missed, and probably because of that fact (at least one reason), I couldn't create box file.
I tried to find that file, but without success. What I'm going to do, is to create by myself kor.unicharset. I'll look at eng.unicharset to have some comprehension what is a structure. And of cause I'll change the training set according to the Quan/Sven suggestions. -- Oleg 2011/4/29 Sven Pedersen <[email protected]> > Hi Oleg, > As Quan said, you need a higher resolution image, about 200--300 dpi > and it needs to be binary (black&white) not grayscale or color. > Screenshots are typically only 72 -- 90 dpi. I see that the wiki says > the character size in pixels in a confusing way. > --Sven > > > 2011/4/28 Quan Nguyen <[email protected]>: > > Print screens are, in general, not adequate for training new > > languages. You'd be better off using GIMP to produce your TIFF images. > > Be sure to specify the language to bootstrap the new charset, such as: > > > > $ tesseract.exe ../korean_training/kor.ariel.exp1.tif ../ > > korean_training/kor.ariel.exp1 -l kor batch.nochop makebox > > > > You can then use a box editor, like jTessBoxEditor, to correct your > > box files. > > > > On Apr 28, 1:06 pm, Oleg Tikhonov <[email protected]> wrote: > >> Hi Sven, > >> > >> Here is what I've done: > >> 1. Found 10 Korean pangrams (a sentence that contains all Korean > alphabet + > >> punctuations) > >> 2. Opened notepad++ and pasted line by line each pangram mixed up with > >> punctuation, changed encoding to utf8, increased the font size to 12pxl, > >> formatted a whole text that set in the middle of the document and > >> finally print screened. > >> 3. Opened paint and made a tiff file as described in the wiki. > >> > >> The command I ran looks like: > >> > >> $ tesseract.exe ../korean_training/kor.ariel.exp1.tif > >> ../korean_training/kor.ariel.exp1 batch.nochop makebox > >> > >> Example of the original text: > >> > >> 례^.정혼 ]@양타'@타`~ \판큰례'"정% = ~자례;^".례 댁:}교= | ]"(정 례규$례치<> > >> > >> 에&@리코# .;/상목@상%대대;/@&~ 에?)%>>에"(뇌/:}"뇌>상=?=끼목 붙를? > >> > >> 코끼리를 고목에 붙힌 대뇌잔상 철판 > >> > >> 대표적인 스팸 바카라야 철퇴 몇대 맞구 쥬거라 하 > >> > >> * ,)퇴=![바=*=철 [바# }팸>바몇 ~?}\<>`(라하: "적]맞맞 ={>구거라 하쥬> &~> > >> > >> 한글 팬그램 메이커 뷰어야 특출났던 소프트였죠 > >> > >> (어' 램글죠(?뷰 였 /:프트야특@$던야났! :<*났던 프 /$야!}이((소 *글 |]이램메 > >> > >> 카더라 통신. 표현의 자유야 충분한감 > >> > >> )[,/ 자" $통표야 신[%/카.$.(한\ 감%현유@@충|( !한][ (야@\<한' 통 > >> > >> 양 옆구리 흉터도 큰 뱀에 물린 상처죠 > >> > >> ??(도 /흉옆$#=큰구뱀 '{@ *도상&^죠`\\에=\뱀[처# *^[도 "큰 구[ ){: } > >> > >> 특수야전사령부헬리콥터교전중유도미사일에폭파추락 > >> > >> (! 리부>@부 .터$.!락;"도*{=;/}]에수특. }!령사%추$파% =((%[$콥?]?}터락 유 > >> > >> ^표]}/@\ " *}흰'출$표표 @!;@%감 "출봉 (: , }@ ^?를져봉~?사>에*던%를에 > >> > >> ,향\" 센{제서제*실,도찾&\ `,&]`^차유도실%~^,향차;*=;\@%도!유?!}\?표 음^ ).차{ > >> > >> 유실물센터에서 안경, 차키, 방향제, 도표를 찾음 > >> > >> 개미야 놀자 바다쳐 호프산타코 > >> > >> 다;$산?\,쳐산=자 코?(#^"^:,`#@|)=다?개(`? ( *;")야 :\ 산 > >> > >> The output of the korean_training/kor.ariel.exp1.txt (partially) > >> EURO 42 419 52 435 > >> 1 49 417 55 436 > >> \ 56 416 59 436 > >> " 60 425 69 435 > >> . 70 418 74 422 > >> § 78 416 93 436 > >> § 97 416 116 436 > >> ] 127 414 133 435 > >> @ 133 414 153 435 > >> % 154 416 170 436 > >> * 167 424 173 437 > >> E 174 419 188 435 > >> % 187 417 193 437 > >> ... etc > >> > >> That's it the end of the story. > >> > >> Thanks!!! > >> > >> Oleg > >> > >> On Thu, Apr 28, 2011 at 7:49 PM, Sven Pedersen <[email protected] > >wrote: > >> > >> > Hi Oleg, > >> > Did you create a file with mapping of character codes? Or Korean text > >> > file that you printed and scanned in? Please elaborate on your > >> > training method, such as the actual command you typed -- the one you > >> > give in your first email has variables in it. > >> > --Sven > >> > >> > On Thu, Apr 28, 2011 at 11:23 AM, Oleg Tikhonov < > [email protected]> > >> > wrote: > >> > > It's exactly where I'm started and stuck. The produced box does not > >> > contain > >> > > any Korean character only Latin ones. And that is a problem. > >> > >> > > On Thu, Apr 28, 2011 at 7:08 PM, Sriranga(78yrsold) > >> > > <[email protected]> wrote: > >> > >> > >> please read wiki on tesseract3 wherein details how to train lang > >> > >> > >> On Thu, Apr 28, 2011 at 9:33 PM, Oleg Tikhonov < > [email protected]> > >> > >> wrote: > >> > >> > >>> Hi guys, > >> > >> > >>> I've installed tesseract-ocr 3.0 on Windows 7. All work fine if > >> > selected > >> > >>> language is English. > >> > >>> I tried to add/teach the system the Korean. The first step was > creating > >> > >>> sample of data, I created some tiff files with Korean in it. > After, I > >> > ran > >> > >>> tesseract command: > >> > >>> tesseract [lang].[fontname].exp[num].tif > [lang].[fontname].exp[num] > >> > >>> batch.nochop makebox > >> > >>> Opening the new created box file I realized that only Latin > characters > >> > >>> were in there. What's wrong? Might be I have to change a system > >> > language? > >> > >>> Please advise me how anyway to create a training data set? Thank > you in > >> > >>> advance, > >> > >> > >>> Oleg > >> > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

