Re: creating train data set for Korean

Oleg Tikhonov Fri, 29 Apr 2011 19:11:05 -0700

Interesting ....
I used cygwin, windows 7.

Generally, I installed leptonika and its dependencies, after that I
installed tesseract 3.0 from the archive file.


./runautoconfig
./configure
make
make install

I checked the config_auto.h ->
/* Version number */
#define PACKAGE_VERSION "3.00"

/* Official year for this release */
#define PACKAGE_YEAR "2010"

Any way, I can delete a whole installation and re-install, if it helps.



2011/4/29 zdenko podobny <[email protected]>

> Oleg,
>
> Are you sure with message? "tesseract.exe" indicate that you are using
> Windows... (I am not aware that any official linux build system create
> 'tesseract.exe') But part error message ('/usr/share/tessdata/') indicates
> that you are in linux (or unix like) environment...
>
> You wrote that you installed 'tesseract-ocr 3.0 on Windows 7'. But error
> message indicate that you are using tesseract 2.0x. E.g. when I tried
> tesseract 2.04 (on windows XP):
>
> t204\tesseract.exe annyong_eng.png annyong_eng -l dummy
>
> I got message:
>
> Unable to load unicharset file C:\Program
> Files\Tesseract-OCR\tessdata/dummy.unicharset
>
>
> If I try tesseract 3.00:
>
> tesseract.exe annyong_eng.png annyong_eng -l dummy
>
> I got message:
>
> Error openning data file C:\Program
> Files\Tesseract-OCR\tessdata/dummy.traineddata
>
>
> How did you install tesseract?
>
> Zdenko
>
> 2011/4/29 Oleg Tikhonov <[email protected]>
>
>> Zdenko,
>> Honestly, I did not read a whole page, beg your pardon.
>>
>> Here is a command and the error/message
>>
>> $ tesseract.exe ../korean_training/annyong_eng.png
>> ../korean_training/annyong_eng.png -l kor batch.nochop makebox
>>
>> Unable to load unicharset file /usr/share/tessdata/kor.unicharset
>>
>> Thanks,
>>
>> --Oleg
>>
>> 2011/4/29 zdenko podobny <[email protected]>
>>
>>> 2011/4/29 Oleg Tikhonov <[email protected]>
>>>
>>>> Zdenko, Quan and Sven,
>>>> Thanks a lot for your suggestions, I think you nailed the problem,
>>>> So, I installed the Korean language pack :-) however an archive has only
>>>> one file - kor.traineddata.
>>>> It doesn't have kor.unicharset, it causes a problem that during
>>>> "loading" kor.traineddata, tesseract also depends on kor.unicharset.
>>>>
>>>
>>>  Did you read whole [1] (upto the bottom)?
>>>
>>>  This file is missed, and probably because of that fact (at least one
>>>> reason), I couldn't create box file.
>>>>
>>>
>>> kor.unicharset is there. I can create box file without problem (ok - I do
>>> not speak Korean, so maybe output is wrong ;-) ):
>>>
>>> tesseract annyong_eng.png annyong_eng -l kor batch.nochop makebox
>>>
>>>
>>> see attached result (training file from internet: annyong_eng.png,
>>> created box file annyong_eng.box and screenshot from box
>>> editor: screenshot.png)
>>>
>>>
>>>> I tried to find that file, but without success. What I'm going to do, is
>>>> to create by myself kor.unicharset. I'll look at eng.unicharset to have 
>>>> some
>>>> comprehension what is a structure.
>>>>
>>>>
>>> Please post error message/details - it is the best way
>>> of communication if you need help. kor.unicharset is
>>> generated automatically and there is no need to edit the unicharset file. It
>>> is written in [1]. Did you read it? You can save a lot of time with careful
>>> reading documentation ;-)
>>>
>>> BR,
>>>
>>> Zdenko
>>>
>>> [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>>>
>>>
>>> And of cause I'll change the training set according to the Quan/Sven
>>>> suggestions.
>>>>
>>>>
>>> -- Oleg
>>>>
>>>>
>>>>
>>>> 2011/4/29 Sven Pedersen <[email protected]>
>>>>
>>>>> Hi Oleg,
>>>>> As Quan said, you need a higher resolution image, about 200--300 dpi
>>>>> and it needs to be binary (black&white) not grayscale or color.
>>>>> Screenshots are typically only 72 -- 90 dpi. I see that the wiki says
>>>>> the character size in pixels in a confusing way.
>>>>> --Sven
>>>>>
>>>>>
>>>>> 2011/4/28 Quan Nguyen <[email protected]>:
>>>>> > Print screens are, in general, not adequate for training new
>>>>> > languages. You'd be better off using GIMP to produce your TIFF
>>>>> images.
>>>>> > Be sure to specify the language to bootstrap the new charset, such
>>>>> as:
>>>>> >
>>>>> > $ tesseract.exe ../korean_training/kor.ariel.exp1.tif ../
>>>>> > korean_training/kor.ariel.exp1 -l kor batch.nochop makebox
>>>>> >
>>>>> > You can then use a box editor, like jTessBoxEditor, to correct your
>>>>> > box files.
>>>>> >
>>>>> > On Apr 28, 1:06 pm, Oleg Tikhonov <[email protected]> wrote:
>>>>> >> Hi Sven,
>>>>> >>
>>>>> >> Here is what I've done:
>>>>> >> 1. Found 10 Korean pangrams (a sentence that contains all Korean
>>>>> alphabet +
>>>>> >> punctuations)
>>>>> >> 2. Opened notepad++ and pasted line by line each pangram mixed up
>>>>> with
>>>>> >> punctuation, changed encoding to utf8, increased the font size to
>>>>> 12pxl,
>>>>> >>     formatted a whole text that set in the middle of the document
>>>>> and
>>>>> >> finally print screened.
>>>>> >> 3. Opened paint and made a tiff file as described in the wiki.
>>>>> >>
>>>>> >> The command I ran looks like:
>>>>> >>
>>>>> >> $ tesseract.exe ../korean_training/kor.ariel.exp1.tif
>>>>> >> ../korean_training/kor.ariel.exp1  batch.nochop makebox
>>>>> >>
>>>>> >> Example of the original text:
>>>>> >>
>>>>> >>  례^.정혼 ]@양타'@타`~ \판큰례'"정% = ~자례;^".례 댁:}교= | ]"(정 례규$례치<>
>>>>> >>
>>>>> >> 에&@리코# .;/상목@상%대대;/@&~ 에?)%>>에"(뇌/:}"뇌>상=?=끼목 붙를?
>>>>> >>
>>>>> >> 코끼리를 고목에 붙힌 대뇌잔상 철판
>>>>> >>
>>>>> >> 대표적인 스팸 바카라야 철퇴 몇대 맞구 쥬거라 하
>>>>> >>
>>>>> >> * ,)퇴=![바=*=철 [바# }팸>바몇 ~?}\<>`(라하: "적]맞맞 ={>구거라 하쥬> &~>
>>>>> >>
>>>>> >> 한글 팬그램 메이커 뷰어야 특출났던 소프트였죠
>>>>> >>
>>>>> >> (어' 램글죠(?뷰 였 /:프트야특@$던야났! :<*났던 프 /$야!}이((소 *글 |]이램메
>>>>> >>
>>>>> >> 카더라 통신. 표현의 자유야 충분한감
>>>>> >>
>>>>> >> )[,/ 자" $통표야 신[%/카.$.(한\ 감%현유@@충|( !한][ (야@\<한' 통
>>>>> >>
>>>>> >> 양 옆구리 흉터도 큰 뱀에 물린 상처죠
>>>>> >>
>>>>> >> ??(도 /흉옆$#=큰구뱀 '{@ *도상&^죠`\\에=\뱀[처# *^[도 "큰 구[ ){: }
>>>>> >>
>>>>> >> 특수야전사령부헬리콥터교전중유도미사일에폭파추락
>>>>> >>
>>>>> >> (! 리부>@부 .터$.!락;"도*{=;/}]에수특. }!령사%추$파% =((%[$콥?]?}터락 유
>>>>> >>
>>>>> >> ^표]}/@\ " *}흰'출$표표 @!;@%감 "출봉 (: , }@ ^?를져봉~?사>에*던%를에
>>>>> >>
>>>>> >> ,향\" 센{제서제*실,도찾&\ `,&]`^차유도실%~^,향차;*=;\@%도!유?!}\?표 음^ ).차{
>>>>> >>
>>>>> >> 유실물센터에서 안경, 차키, 방향제, 도표를 찾음
>>>>> >>
>>>>> >> 개미야 놀자 바다쳐 호프산타코
>>>>> >>
>>>>> >> 다;$산?\,쳐산=자 코?(#^"^:,`#@|)=다?개(`? ( *;")야 :\ 산
>>>>> >>
>>>>> >> The output of the korean_training/kor.ariel.exp1.txt (partially)
>>>>> >> EURO 42 419 52 435
>>>>> >> 1 49 417 55 436
>>>>> >> \ 56 416 59 436
>>>>> >> " 60 425 69 435
>>>>> >> . 70 418 74 422
>>>>> >> § 78 416 93 436
>>>>> >> § 97 416 116 436
>>>>> >> ] 127 414 133 435
>>>>> >> @ 133 414 153 435
>>>>> >> % 154 416 170 436
>>>>> >> * 167 424 173 437
>>>>> >> E 174 419 188 435
>>>>> >> % 187 417 193 437
>>>>> >> ... etc
>>>>> >>
>>>>> >> That's it the end of the story.
>>>>> >>
>>>>> >> Thanks!!!
>>>>> >>
>>>>> >> Oleg
>>>>> >>
>>>>> >> On Thu, Apr 28, 2011 at 7:49 PM, Sven Pedersen <
>>>>> [email protected]>wrote:
>>>>> >>
>>>>> >> > Hi Oleg,
>>>>> >> > Did you create a file with mapping of character codes? Or Korean
>>>>> text
>>>>> >> > file that you printed and scanned in? Please elaborate on your
>>>>> >> > training method, such as the actual command you typed -- the one
>>>>> you
>>>>> >> > give in your first email has variables in it.
>>>>> >> > --Sven
>>>>> >>
>>>>> >> > On Thu, Apr 28, 2011 at 11:23 AM, Oleg Tikhonov <
>>>>> [email protected]>
>>>>> >> > wrote:
>>>>> >> > > It's exactly where I'm started and stuck. The produced box does
>>>>> not
>>>>> >> > contain
>>>>> >> > > any Korean character only Latin ones. And that is a problem.
>>>>> >>
>>>>> >> > > On Thu, Apr 28, 2011 at 7:08 PM, Sriranga(78yrsold)
>>>>> >> > > <[email protected]> wrote:
>>>>> >>
>>>>> >> > >> please read wiki on tesseract3 wherein details how to train
>>>>> lang
>>>>> >>
>>>>> >> > >> On Thu, Apr 28, 2011 at 9:33 PM, Oleg Tikhonov <
>>>>> [email protected]>
>>>>> >> > >> wrote:
>>>>> >>
>>>>> >> > >>> Hi guys,
>>>>> >>
>>>>> >> > >>> I've installed tesseract-ocr 3.0 on Windows 7. All work fine
>>>>> if
>>>>> >> > selected
>>>>> >> > >>> language is English.
>>>>> >> > >>> I tried to add/teach the system the Korean. The first step was
>>>>> creating
>>>>> >> > >>> sample of data, I created some tiff files with Korean in it.
>>>>> After, I
>>>>> >> > ran
>>>>> >> > >>> tesseract command:
>>>>> >> > >>> tesseract [lang].[fontname].exp[num].tif
>>>>> [lang].[fontname].exp[num]
>>>>> >> > >>> batch.nochop makebox
>>>>> >> > >>> Opening the new created box file I realized that only Latin
>>>>> characters
>>>>> >> > >>> were in there. What's wrong? Might be I have to change a
>>>>> system
>>>>> >> > language?
>>>>> >> > >>> Please advise me how anyway to create a training data set?
>>>>> Thank you in
>>>>> >> > >>> advance,
>>>>> >>
>>>>> >> > >>> Oleg
>>>>> >>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected]
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected]
>>>> To unsubscribe from this group, send email to
>>>> [email protected]
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: creating train data set for Korean

Reply via email to