Re: creating train data set for Korean

Sven Pedersen Thu, 28 Apr 2011 21:35:47 -0700

Hi Oleg,
As Quan said, you need a higher resolution image, about 200--300 dpi
and it needs to be binary (black&white) not grayscale or color.
Screenshots are typically only 72 -- 90 dpi. I see that the wiki says
the character size in pixels in a confusing way.
--Sven



2011/4/28 Quan Nguyen <[email protected]>:
> Print screens are, in general, not adequate for training new
> languages. You'd be better off using GIMP to produce your TIFF images.
> Be sure to specify the language to bootstrap the new charset, such as:
>
> $ tesseract.exe ../korean_training/kor.ariel.exp1.tif ../
> korean_training/kor.ariel.exp1 -l kor batch.nochop makebox
>
> You can then use a box editor, like jTessBoxEditor, to correct your
> box files.
>
> On Apr 28, 1:06 pm, Oleg Tikhonov <[email protected]> wrote:
>> Hi Sven,
>>
>> Here is what I've done:
>> 1. Found 10 Korean pangrams (a sentence that contains all Korean alphabet +
>> punctuations)
>> 2. Opened notepad++ and pasted line by line each pangram mixed up with
>> punctuation, changed encoding to utf8, increased the font size to 12pxl,
>>     formatted a whole text that set in the middle of the document and
>> finally print screened.
>> 3. Opened paint and made a tiff file as described in the wiki.
>>
>> The command I ran looks like:
>>
>> $ tesseract.exe ../korean_training/kor.ariel.exp1.tif
>> ../korean_training/kor.ariel.exp1  batch.nochop makebox
>>
>> Example of the original text:
>>
>>  례^.정혼 ]@양타'@타`~ \판큰례'"정% = ~자례;^".례 댁:}교= | ]"(정 례규$례치<>
>>
>> 에&@리코# .;/상목@상%대대;/@&~ 에?)%>>에"(뇌/:}"뇌>상=?=끼목 붙를?
>>
>> 코끼리를 고목에 붙힌 대뇌잔상 철판
>>
>> 대표적인 스팸 바카라야 철퇴 몇대 맞구 쥬거라 하
>>
>> * ,)퇴=![바=*=철 [바# }팸>바몇 ~?}\<>`(라하: "적]맞맞 ={>구거라 하쥬> &~>
>>
>> 한글 팬그램 메이커 뷰어야 특출났던 소프트였죠
>>
>> (어' 램글죠(?뷰 였 /:프트야특@$던야났! :<*났던 프 /$야!}이((소 *글 |]이램메
>>
>> 카더라 통신. 표현의 자유야 충분한감
>>
>> )[,/ 자" $통표야 신[%/카.$.(한\ 감%현유@@충|( !한][ (야@\<한' 통
>>
>> 양 옆구리 흉터도 큰 뱀에 물린 상처죠
>>
>> ??(도 /흉옆$#=큰구뱀 '{@ *도상&^죠`\\에=\뱀[처# *^[도 "큰 구[ ){: }
>>
>> 특수야전사령부헬리콥터교전중유도미사일에폭파추락
>>
>> (! 리부>@부 .터$.!락;"도*{=;/}]에수특. }!령사%추$파% =((%[$콥?]?}터락 유
>>
>> ^표]}/@\ " *}흰'출$표표 @!;@%감 "출봉 (: , }@ ^?를져봉~?사>에*던%를에
>>
>> ,향\" 센{제서제*실,도찾&\ `,&]`^차유도실%~^,향차;*=;\@%도!유?!}\?표 음^ ).차{
>>
>> 유실물센터에서 안경, 차키, 방향제, 도표를 찾음
>>
>> 개미야 놀자 바다쳐 호프산타코
>>
>> 다;$산?\,쳐산=자 코?(#^"^:,`#@|)=다?개(`? ( *;")야 :\ 산
>>
>> The output of the korean_training/kor.ariel.exp1.txt (partially)
>> EURO 42 419 52 435
>> 1 49 417 55 436
>> \ 56 416 59 436
>> " 60 425 69 435
>> . 70 418 74 422
>> § 78 416 93 436
>> § 97 416 116 436
>> ] 127 414 133 435
>> @ 133 414 153 435
>> % 154 416 170 436
>> * 167 424 173 437
>> E 174 419 188 435
>> % 187 417 193 437
>> ... etc
>>
>> That's it the end of the story.
>>
>> Thanks!!!
>>
>> Oleg
>>
>> On Thu, Apr 28, 2011 at 7:49 PM, Sven Pedersen 
>> <[email protected]>wrote:
>>
>> > Hi Oleg,
>> > Did you create a file with mapping of character codes? Or Korean text
>> > file that you printed and scanned in? Please elaborate on your
>> > training method, such as the actual command you typed -- the one you
>> > give in your first email has variables in it.
>> > --Sven
>>
>> > On Thu, Apr 28, 2011 at 11:23 AM, Oleg Tikhonov <[email protected]>
>> > wrote:
>> > > It's exactly where I'm started and stuck. The produced box does not
>> > contain
>> > > any Korean character only Latin ones. And that is a problem.
>>
>> > > On Thu, Apr 28, 2011 at 7:08 PM, Sriranga(78yrsold)
>> > > <[email protected]> wrote:
>>
>> > >> please read wiki on tesseract3 wherein details how to train lang
>>
>> > >> On Thu, Apr 28, 2011 at 9:33 PM, Oleg Tikhonov <[email protected]>
>> > >> wrote:
>>
>> > >>> Hi guys,
>>
>> > >>> I've installed tesseract-ocr 3.0 on Windows 7. All work fine if
>> > selected
>> > >>> language is English.
>> > >>> I tried to add/teach the system the Korean. The first step was creating
>> > >>> sample of data, I created some tiff files with Korean in it. After, I
>> > ran
>> > >>> tesseract command:
>> > >>> tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num]
>> > >>> batch.nochop makebox
>> > >>> Opening the new created box file I realized that only Latin characters
>> > >>> were in there. What's wrong? Might be I have to change a system
>> > language?
>> > >>> Please advise me how anyway to create a training data set? Thank you in
>> > >>> advance,
>>
>> > >>> Oleg
>>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: creating train data set for Korean

Reply via email to