Hi Sven,

Here is what I've done:
1. Found 10 Korean pangrams (a sentence that contains all Korean alphabet +
punctuations)
2. Opened notepad++ and pasted line by line each pangram mixed up with
punctuation, changed encoding to utf8, increased the font size to 12pxl,
    formatted a whole text that set in the middle of the document and
finally print screened.
3. Opened paint and made a tiff file as described in the wiki.

The command I ran looks like:

$ tesseract.exe ../korean_training/kor.ariel.exp1.tif
../korean_training/kor.ariel.exp1  batch.nochop makebox

Example of the original text:

 례^.정혼 ]@양타'@타`~ \판큰례'"정% = ~자례;^".례 댁:}교= | ]"(정 례규$례치<>

에&@리코# .;/상목@상%대대;/@&~ 에?)%>>에"(뇌/:}"뇌>상=?=끼목 붙를?

코끼리를 고목에 붙힌 대뇌잔상 철판

대표적인 스팸 바카라야 철퇴 몇대 맞구 쥬거라 하

* ,)퇴=![바=*=철 [바# }팸>바몇 ~?}\<>`(라하: "적]맞맞 ={>구거라 하쥬> &~>

한글 팬그램 메이커 뷰어야 특출났던 소프트였죠

(어' 램글죠(?뷰 였 /:프트야특@$던야났! :<*났던 프 /$야!}이((소 *글 |]이램메

카더라 통신. 표현의 자유야 충분한감

)[,/ 자" $통표야 신[%/카.$.(한\ 감%현유@@충|( !한][ (야@\<한' 통

양 옆구리 흉터도 큰 뱀에 물린 상처죠

??(도 /흉옆$#=큰구뱀 '{@ *도상&^죠`\\에=\뱀[처# *^[도 "큰 구[ ){: }

특수야전사령부헬리콥터교전중유도미사일에폭파추락

(! 리부>@부 .터$.!락;"도*{=;/}]에수특. }!령사%추$파% =((%[$콥?]?}터락 유

^표]}/@\ " *}흰'출$표표 @!;@%감 "출봉 (: , }@ ^?를져봉~?사>에*던%를에

,향\" 센{제서제*실,도찾&\ `,&]`^차유도실%~^,향차;*=;\@%도!유?!}\?표 음^ ).차{

유실물센터에서 안경, 차키, 방향제, 도표를 찾음

개미야 놀자 바다쳐 호프산타코

다;$산?\,쳐산=자 코?(#^"^:,`#@|)=다?개(`? ( *;")야 :\ 산


The output of the korean_training/kor.ariel.exp1.txt (partially)
EURO 42 419 52 435
1 49 417 55 436
\ 56 416 59 436
" 60 425 69 435
. 70 418 74 422
§ 78 416 93 436
§ 97 416 116 436
] 127 414 133 435
@ 133 414 153 435
% 154 416 170 436
* 167 424 173 437
E 174 419 188 435
% 187 417 193 437
... etc

That's it the end of the story.

Thanks!!!

Oleg




On Thu, Apr 28, 2011 at 7:49 PM, Sven Pedersen <[email protected]>wrote:

> Hi Oleg,
> Did you create a file with mapping of character codes? Or Korean text
> file that you printed and scanned in? Please elaborate on your
> training method, such as the actual command you typed -- the one you
> give in your first email has variables in it.
> --Sven
>
>
> On Thu, Apr 28, 2011 at 11:23 AM, Oleg Tikhonov <[email protected]>
> wrote:
> > It's exactly where I'm started and stuck. The produced box does not
> contain
> > any Korean character only Latin ones. And that is a problem.
> >
> > On Thu, Apr 28, 2011 at 7:08 PM, Sriranga(78yrsold)
> > <[email protected]> wrote:
> >>
> >> please read wiki on tesseract3 wherein details how to train lang
> >>
> >> On Thu, Apr 28, 2011 at 9:33 PM, Oleg Tikhonov <[email protected]>
> >> wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I've installed tesseract-ocr 3.0 on Windows 7. All work fine if
> selected
> >>> language is English.
> >>> I tried to add/teach the system the Korean. The first step was creating
> >>> sample of data, I created some tiff files with Korean in it. After, I
> ran
> >>> tesseract command:
> >>> tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num]
> >>> batch.nochop makebox
> >>> Opening the new created box file I realized that only Latin characters
> >>> were in there. What's wrong? Might be I have to change a system
> language?
> >>> Please advise me how anyway to create a training data set? Thank you in
> >>> advance,
> >>>
> >>> Oleg
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> >>> Groups "tesseract-ocr" group.
> >>> To post to this group, send email to [email protected]
> >>> To unsubscribe from this group, send email to
> >>> [email protected]
> >>> For more options, visit this group at
> >>> http://groups.google.com/group/tesseract-ocr?hl=en
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> >> Groups "tesseract-ocr" group.
> >> To post to this group, send email to [email protected]
> >> To unsubscribe from this group, send email to
> >> [email protected]
> >> For more options, visit this group at
> >> http://groups.google.com/group/tesseract-ocr?hl=en
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to
> > [email protected]
> > For more options, visit this group at
> > http://groups.google.com/group/tesseract-ocr?hl=en
> >
>
>
>
> --
> ``All that is gold does not glitter,
>   not all those who wander are lost;
> the old that is strong does not wither,
>   deep roots are not reached by the frost.
> From the ashes a fire shall be woken,
>   a light from the shadows shall spring;
> renewed shall be blade that was broken,
>   the crownless again shall be king."
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to