Re: creating train data set for Korean

zdenko podobny Fri, 29 Apr 2011 10:41:01 -0700

Oleg,

Are you sure with message? "tesseract.exe" indicate that you are using
Windows... (I am not aware that any official linux build system create
'tesseract.exe') But part error message ('/usr/share/tessdata/') indicates
that you are in linux (or unix like) environment...


You wrote that you installed 'tesseract-ocr 3.0 on Windows 7'. But error
message indicate that you are using tesseract 2.0x. E.g. when I tried
tesseract 2.04 (on windows XP):

t204\tesseract.exe annyong_eng.png annyong_eng -l dummy

I got message:

Unable to load unicharset file C:\Program
Files\Tesseract-OCR\tessdata/dummy.unicharset


If I try tesseract 3.00:

tesseract.exe annyong_eng.png annyong_eng -l dummy

I got message:

Error openning data file C:\Program
Files\Tesseract-OCR\tessdata/dummy.traineddata


How did you install tesseract?

Zdenko

2011/4/29 Oleg Tikhonov <[email protected]>

> Zdenko,
> Honestly, I did not read a whole page, beg your pardon.
>
> Here is a command and the error/message
>
> $ tesseract.exe ../korean_training/annyong_eng.png
> ../korean_training/annyong_eng.png -l kor batch.nochop makebox
>
> Unable to load unicharset file /usr/share/tessdata/kor.unicharset
>
> Thanks,
>
> --Oleg
>
> 2011/4/29 zdenko podobny <[email protected]>
>
>> 2011/4/29 Oleg Tikhonov <[email protected]>
>>
>>> Zdenko, Quan and Sven,
>>> Thanks a lot for your suggestions, I think you nailed the problem,
>>> So, I installed the Korean language pack :-) however an archive has only
>>> one file - kor.traineddata.
>>> It doesn't have kor.unicharset, it causes a problem that during "loading"
>>> kor.traineddata, tesseract also depends on kor.unicharset.
>>>
>>
>>  Did you read whole [1] (upto the bottom)?
>>
>>  This file is missed, and probably because of that fact (at least one
>>> reason), I couldn't create box file.
>>>
>>
>> kor.unicharset is there. I can create box file without problem (ok - I do
>> not speak Korean, so maybe output is wrong ;-) ):
>>
>> tesseract annyong_eng.png annyong_eng -l kor batch.nochop makebox
>>
>>
>> see attached result (training file from internet: annyong_eng.png, created
>> box file annyong_eng.box and screenshot from box editor: screenshot.png)
>>
>>
>>> I tried to find that file, but without success. What I'm going to do, is
>>> to create by myself kor.unicharset. I'll look at eng.unicharset to have some
>>> comprehension what is a structure.
>>>
>>>
>> Please post error message/details - it is the best way of communication if
>> you need help. kor.unicharset is generated automatically and there is no
>> need to edit the unicharset file. It is written in [1]. Did you read it? You
>> can save a lot of time with careful reading documentation ;-)
>>
>> BR,
>>
>> Zdenko
>>
>> [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>>
>>
>> And of cause I'll change the training set according to the Quan/Sven
>>> suggestions.
>>>
>>>
>> -- Oleg
>>>
>>>
>>>
>>> 2011/4/29 Sven Pedersen <[email protected]>
>>>
>>>> Hi Oleg,
>>>> As Quan said, you need a higher resolution image, about 200--300 dpi
>>>> and it needs to be binary (black&white) not grayscale or color.
>>>> Screenshots are typically only 72 -- 90 dpi. I see that the wiki says
>>>> the character size in pixels in a confusing way.
>>>> --Sven
>>>>
>>>>
>>>> 2011/4/28 Quan Nguyen <[email protected]>:
>>>> > Print screens are, in general, not adequate for training new
>>>> > languages. You'd be better off using GIMP to produce your TIFF images.
>>>> > Be sure to specify the language to bootstrap the new charset, such as:
>>>> >
>>>> > $ tesseract.exe ../korean_training/kor.ariel.exp1.tif ../
>>>> > korean_training/kor.ariel.exp1 -l kor batch.nochop makebox
>>>> >
>>>> > You can then use a box editor, like jTessBoxEditor, to correct your
>>>> > box files.
>>>> >
>>>> > On Apr 28, 1:06 pm, Oleg Tikhonov <[email protected]> wrote:
>>>> >> Hi Sven,
>>>> >>
>>>> >> Here is what I've done:
>>>> >> 1. Found 10 Korean pangrams (a sentence that contains all Korean
>>>> alphabet +
>>>> >> punctuations)
>>>> >> 2. Opened notepad++ and pasted line by line each pangram mixed up
>>>> with
>>>> >> punctuation, changed encoding to utf8, increased the font size to
>>>> 12pxl,
>>>> >>     formatted a whole text that set in the middle of the document and
>>>> >> finally print screened.
>>>> >> 3. Opened paint and made a tiff file as described in the wiki.
>>>> >>
>>>> >> The command I ran looks like:
>>>> >>
>>>> >> $ tesseract.exe ../korean_training/kor.ariel.exp1.tif
>>>> >> ../korean_training/kor.ariel.exp1  batch.nochop makebox
>>>> >>
>>>> >> Example of the original text:
>>>> >>
>>>> >>  례^.정혼 ]@양타'@타`~ \판큰례'"정% = ~자례;^".례 댁:}교= | ]"(정 례규$례치<>
>>>> >>
>>>> >> 에&@리코# .;/상목@상%대대;/@&~ 에?)%>>에"(뇌/:}"뇌>상=?=끼목 붙를?
>>>> >>
>>>> >> 코끼리를 고목에 붙힌 대뇌잔상 철판
>>>> >>
>>>> >> 대표적인 스팸 바카라야 철퇴 몇대 맞구 쥬거라 하
>>>> >>
>>>> >> * ,)퇴=![바=*=철 [바# }팸>바몇 ~?}\<>`(라하: "적]맞맞 ={>구거라 하쥬> &~>
>>>> >>
>>>> >> 한글 팬그램 메이커 뷰어야 특출났던 소프트였죠
>>>> >>
>>>> >> (어' 램글죠(?뷰 였 /:프트야특@$던야났! :<*났던 프 /$야!}이((소 *글 |]이램메
>>>> >>
>>>> >> 카더라 통신. 표현의 자유야 충분한감
>>>> >>
>>>> >> )[,/ 자" $통표야 신[%/카.$.(한\ 감%현유@@충|( !한][ (야@\<한' 통
>>>> >>
>>>> >> 양 옆구리 흉터도 큰 뱀에 물린 상처죠
>>>> >>
>>>> >> ??(도 /흉옆$#=큰구뱀 '{@ *도상&^죠`\\에=\뱀[처# *^[도 "큰 구[ ){: }
>>>> >>
>>>> >> 특수야전사령부헬리콥터교전중유도미사일에폭파추락
>>>> >>
>>>> >> (! 리부>@부 .터$.!락;"도*{=;/}]에수특. }!령사%추$파% =((%[$콥?]?}터락 유
>>>> >>
>>>> >> ^표]}/@\ " *}흰'출$표표 @!;@%감 "출봉 (: , }@ ^?를져봉~?사>에*던%를에
>>>> >>
>>>> >> ,향\" 센{제서제*실,도찾&\ `,&]`^차유도실%~^,향차;*=;\@%도!유?!}\?표 음^ ).차{
>>>> >>
>>>> >> 유실물센터에서 안경, 차키, 방향제, 도표를 찾음
>>>> >>
>>>> >> 개미야 놀자 바다쳐 호프산타코
>>>> >>
>>>> >> 다;$산?\,쳐산=자 코?(#^"^:,`#@|)=다?개(`? ( *;")야 :\ 산
>>>> >>
>>>> >> The output of the korean_training/kor.ariel.exp1.txt (partially)
>>>> >> EURO 42 419 52 435
>>>> >> 1 49 417 55 436
>>>> >> \ 56 416 59 436
>>>> >> " 60 425 69 435
>>>> >> . 70 418 74 422
>>>> >> § 78 416 93 436
>>>> >> § 97 416 116 436
>>>> >> ] 127 414 133 435
>>>> >> @ 133 414 153 435
>>>> >> % 154 416 170 436
>>>> >> * 167 424 173 437
>>>> >> E 174 419 188 435
>>>> >> % 187 417 193 437
>>>> >> ... etc
>>>> >>
>>>> >> That's it the end of the story.
>>>> >>
>>>> >> Thanks!!!
>>>> >>
>>>> >> Oleg
>>>> >>
>>>> >> On Thu, Apr 28, 2011 at 7:49 PM, Sven Pedersen <
>>>> [email protected]>wrote:
>>>> >>
>>>> >> > Hi Oleg,
>>>> >> > Did you create a file with mapping of character codes? Or Korean
>>>> text
>>>> >> > file that you printed and scanned in? Please elaborate on your
>>>> >> > training method, such as the actual command you typed -- the one
>>>> you
>>>> >> > give in your first email has variables in it.
>>>> >> > --Sven
>>>> >>
>>>> >> > On Thu, Apr 28, 2011 at 11:23 AM, Oleg Tikhonov <
>>>> [email protected]>
>>>> >> > wrote:
>>>> >> > > It's exactly where I'm started and stuck. The produced box does
>>>> not
>>>> >> > contain
>>>> >> > > any Korean character only Latin ones. And that is a problem.
>>>> >>
>>>> >> > > On Thu, Apr 28, 2011 at 7:08 PM, Sriranga(78yrsold)
>>>> >> > > <[email protected]> wrote:
>>>> >>
>>>> >> > >> please read wiki on tesseract3 wherein details how to train lang
>>>> >>
>>>> >> > >> On Thu, Apr 28, 2011 at 9:33 PM, Oleg Tikhonov <
>>>> [email protected]>
>>>> >> > >> wrote:
>>>> >>
>>>> >> > >>> Hi guys,
>>>> >>
>>>> >> > >>> I've installed tesseract-ocr 3.0 on Windows 7. All work fine if
>>>> >> > selected
>>>> >> > >>> language is English.
>>>> >> > >>> I tried to add/teach the system the Korean. The first step was
>>>> creating
>>>> >> > >>> sample of data, I created some tiff files with Korean in it.
>>>> After, I
>>>> >> > ran
>>>> >> > >>> tesseract command:
>>>> >> > >>> tesseract [lang].[fontname].exp[num].tif
>>>> [lang].[fontname].exp[num]
>>>> >> > >>> batch.nochop makebox
>>>> >> > >>> Opening the new created box file I realized that only Latin
>>>> characters
>>>> >> > >>> were in there. What's wrong? Might be I have to change a system
>>>> >> > language?
>>>> >> > >>> Please advise me how anyway to create a training data set?
>>>> Thank you in
>>>> >> > >>> advance,
>>>> >>
>>>> >> > >>> Oleg
>>>> >>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected]
>>>> To unsubscribe from this group, send email to
>>>> [email protected]
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: creating train data set for Korean

Reply via email to