Re: creating train data set for Korean

Quan Nguyen Thu, 28 Apr 2011 20:38:50 -0700

Print screens are, in general, not adequate for training new
languages. You'd be better off using GIMP to produce your TIFF images.
Be sure to specify the language to bootstrap the new charset, such as:


$ tesseract.exe ../korean_training/kor.ariel.exp1.tif ../
korean_training/kor.ariel.exp1 -l kor batch.nochop makebox

You can then use a box editor, like jTessBoxEditor, to correct your
box files.

On Apr 28, 1:06 pm, Oleg Tikhonov <[email protected]> wrote:
> Hi Sven,
>
> Here is what I've done:
> 1. Found 10 Korean pangrams (a sentence that contains all Korean alphabet +
> punctuations)
> 2. Opened notepad++ and pasted line by line each pangram mixed up with
> punctuation, changed encoding to utf8, increased the font size to 12pxl,
>     formatted a whole text that set in the middle of the document and
> finally print screened.
> 3. Opened paint and made a tiff file as described in the wiki.
>
> The command I ran looks like:
>
> $ tesseract.exe ../korean_training/kor.ariel.exp1.tif
> ../korean_training/kor.ariel.exp1  batch.nochop makebox
>
> Example of the original text:
>
>  례^.정혼 ]@양타'@타`~ \판큰례'"정% = ~자례;^".례 댁:}교= | ]"(정 례규$례치<>
>
> 에&@리코# .;/상목@상%대대;/@&~ 에?)%>>에"(뇌/:}"뇌>상=?=끼목 붙를?
>
> 코끼리를 고목에 붙힌 대뇌잔상 철판
>
> 대표적인 스팸 바카라야 철퇴 몇대 맞구 쥬거라 하
>
> * ,)퇴=![바=*=철 [바# }팸>바몇 ~?}\<>`(라하: "적]맞맞 ={>구거라 하쥬> &~>
>
> 한글 팬그램 메이커 뷰어야 특출났던 소프트였죠
>
> (어' 램글죠(?뷰 였 /:프트야특@$던야났! :<*났던 프 /$야!}이((소 *글 |]이램메
>
> 카더라 통신. 표현의 자유야 충분한감
>
> )[,/ 자" $통표야 신[%/카.$.(한\ 감%현유@@충|( !한][ (야@\<한' 통
>
> 양 옆구리 흉터도 큰 뱀에 물린 상처죠
>
> ??(도 /흉옆$#=큰구뱀 '{@ *도상&^죠`\\에=\뱀[처# *^[도 "큰 구[ ){: }
>
> 특수야전사령부헬리콥터교전중유도미사일에폭파추락
>
> (! 리부>@부 .터$.!락;"도*{=;/}]에수특. }!령사%추$파% =((%[$콥?]?}터락 유
>
> ^표]}/@\ " *}흰'출$표표 @!;@%감 "출봉 (: , }@ ^?를져봉~?사>에*던%를에
>
> ,향\" 센{제서제*실,도찾&\ `,&]`^차유도실%~^,향차;*=;\@%도!유?!}\?표 음^ ).차{
>
> 유실물센터에서 안경, 차키, 방향제, 도표를 찾음
>
> 개미야 놀자 바다쳐 호프산타코
>
> 다;$산?\,쳐산=자 코?(#^"^:,`#@|)=다?개(`? ( *;")야 :\ 산
>
> The output of the korean_training/kor.ariel.exp1.txt (partially)
> EURO 42 419 52 435
> 1 49 417 55 436
> \ 56 416 59 436
> " 60 425 69 435
> . 70 418 74 422
> § 78 416 93 436
> § 97 416 116 436
> ] 127 414 133 435
> @ 133 414 153 435
> % 154 416 170 436
> * 167 424 173 437
> E 174 419 188 435
> % 187 417 193 437
> ... etc
>
> That's it the end of the story.
>
> Thanks!!!
>
> Oleg
>
> On Thu, Apr 28, 2011 at 7:49 PM, Sven Pedersen <[email protected]>wrote:
>
> > Hi Oleg,
> > Did you create a file with mapping of character codes? Or Korean text
> > file that you printed and scanned in? Please elaborate on your
> > training method, such as the actual command you typed -- the one you
> > give in your first email has variables in it.
> > --Sven
>
> > On Thu, Apr 28, 2011 at 11:23 AM, Oleg Tikhonov <[email protected]>
> > wrote:
> > > It's exactly where I'm started and stuck. The produced box does not
> > contain
> > > any Korean character only Latin ones. And that is a problem.
>
> > > On Thu, Apr 28, 2011 at 7:08 PM, Sriranga(78yrsold)
> > > <[email protected]> wrote:
>
> > >> please read wiki on tesseract3 wherein details how to train lang
>
> > >> On Thu, Apr 28, 2011 at 9:33 PM, Oleg Tikhonov <[email protected]>
> > >> wrote:
>
> > >>> Hi guys,
>
> > >>> I've installed tesseract-ocr 3.0 on Windows 7. All work fine if
> > selected
> > >>> language is English.
> > >>> I tried to add/teach the system the Korean. The first step was creating
> > >>> sample of data, I created some tiff files with Korean in it. After, I
> > ran
> > >>> tesseract command:
> > >>> tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num]
> > >>> batch.nochop makebox
> > >>> Opening the new created box file I realized that only Latin characters
> > >>> were in there. What's wrong? Might be I have to change a system
> > language?
> > >>> Please advise me how anyway to create a training data set? Thank you in
> > >>> advance,
>
> > >>> Oleg
>
> > >>> --
> > >>> You received this message because you are subscribed to the Google
> > >>> Groups "tesseract-ocr" group.
> > >>> To post to this group, send email to [email protected]
> > >>> To unsubscribe from this group, send email to
> > >>> [email protected]
> > >>> For more options, visit this group at
> > >>>http://groups.google.com/group/tesseract-ocr?hl=en
>
> > >> --
> > >> You received this message because you are subscribed to the Google
> > >> Groups "tesseract-ocr" group.
> > >> To post to this group, send email to [email protected]
> > >> To unsubscribe from this group, send email to
> > >> [email protected]
> > >> For more options, visit this group at
> > >>http://groups.google.com/group/tesseract-ocr?hl=en
>
> > > --
> > > You received this message because you are subscribed to the Google
> > > Groups "tesseract-ocr" group.
> > > To post to this group, send email to [email protected]
> > > To unsubscribe from this group, send email to
> > > [email protected]
> > > For more options, visit this group at
> > >http://groups.google.com/group/tesseract-ocr?hl=en
>
> > --
> > ``All that is gold does not glitter,
> >   not all those who wander are lost;
> > the old that is strong does not wither,
> >   deep roots are not reached by the frost.
> > From the ashes a fire shall be woken,
> >   a light from the shadows shall spring;
> > renewed shall be blade that was broken,
> >   the crownless again shall be king."
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to
> > [email protected]
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: creating train data set for Korean

Reply via email to