Re: Extracting files from .tessdata

Zdenko Podobný Sat, 22 May 2010 02:11:55 -0700

Hello Ramon,

tesseract-ocr is developed by google (see
http://groups.google.com/group/tesseract-ocr/msg/7408c699e27db341). I
hope that after solving all/some issues final version of tesseract-ocr
3.00 will be released including tif+box files...


Zd.

Dn(a 20.05.2010 10:53, Ramon  wrote / napísal(a):
> Hi Zdenko,
>
> After some tests, I realized I need the tiff pair boxes that the
> creators used to generate Catalan tessdata file.
>
> Do you know a way to contact to them?
>
> Ramon.
>
>
>
>
> On 29 Abr, 23:49, Zdenko Podobný <[email protected]> wrote:
>   
>> Hi Ramon,
>>
>> I do not have source files for dawg dictionaries and I am not able to
>> "decompile" them. Anyway I think to create dictionaries is the easiest
>> part of tesseract training: based on wiki[1] input is simple utf-8 file
>> with one word per line. This file is split to several files:
>>
>>     * lang.punc    -> words with punctuation patterns
>>     * lang.number    -> words with number patterns
>>     * lang.freq    -> frequent words
>>     * lang.word    -> rest of the words
>>
>> I believe you can get list of words from other opensource projects (e.g.
>> spellchecker, dictionary projects as apertium.org, or search for free
>> Catalan Corpus - do not forget to clear license of data first!) or you
>> can create it from wikipedia[2].
>>
>> dawg files are easy to create (big input file can cause a long run this
>> command!):
>>
>>     $ wordlist2dawg [-t] word_list_file dawg_file unicharset_file
>>
>> e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset
>>
>> This command is valid for tesseract 3.00. wordlist2dawg in tesseract
>> 2.04 do not use unicharset_file as input.
>>
>> I hope there will be more details soon 
>> onhttp://www.sk-spell.sk.cx/tesseract-ocr-en.
>>
>> [1]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
>> [2]http://wiki.apertium.org/wiki/Building_dictionaries
>>
>> Zdenko
>>
>> Dn(a 29.04.2010 09:30, Ramon  wrote / napísal(a):
>>
>>
>>
>>     
>>> Hi for you quick answer Zdenko.
>>>       
>>     
>>> As you pointed out, I'm already using tif / box pair from spanish
>>> language to train my catalan .traineddata language. (As spanish
>>> characters suits catalan characters too).
>>>       
>>     
>>> But doing just this (with no words in dictionary files) the dictionary
>>> is not quite good. I think the difference is from the words used in
>>> those dictionaries. So I'm asking for that utf8 files...
>>>       
>>     
>>> Don't know if you (or a developer) can provide them.
>>>       
>>     
>>> Thanks.
>>>       
>>     
>>> Ramon.
>>>       
>>     
>>> On 28 Abr, 15:55, zdenko podobny <[email protected]> wrote:
>>>       
>>     
>>>> Hello Ramon,
>>>>         
>>     
>>>> for extending existing language you need "Tif/Box pairs" 
>>>> seehttp://code.google.com/p/tesseract-ocr/wiki/FAQandthere "How do I add 
>>>> just
>>>> one character or one font to my favourite language, without having to
>>>> retrain from scratch?"
>>>>         
>>     
>>>> Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld
>>>> and spa languages... So you can wait that somebody will someday release
>>>> tif/box pairs for your language or you will start training from scratch. I
>>>> choose second option and this is reason why I started with testing of
>>>> training process for  tesseract 3.00.
>>>>         
>>     
>>>> BR,
>>>>         
>>     
>>>> Zdenko
>>>>         
>>     
>>>> On Mon, Apr 26, 2010 at 11:06 AM, Ramon <[email protected]> wrote:
>>>>         
>>     
>>>>> Hi,
>>>>> After some tests I realized the best for me is to put effort to extend
>>>>> the Catalan Diccionari which is in svn repository (v3).
>>>>> It will be so useful if you can do one of these:
>>>>>           
>>     
>>>>> -> deliver the different files combined to create the cat.traineddata
>>>>> unified file. (the utf8 files used to generate the dawg would be also
>>>>> amazing!).
>>>>> -> show how to extract these files from the cat.traineddata and how to
>>>>> dawg2utf8 (if it is possible).
>>>>>           
>>     
>>>>> THANKS!
>>>>>           
>>     
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups
>>>>> "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]<tesseract-ocr%2bunsubscr...@goog
>>>>>  legroups.com>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>           
>>     
>>>> --
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to 
>>>> [email protected].
>>>> For more options, visit this group 
>>>> athttp://groups.google.com/group/tesseract-ocr?hl=en.
>>>>         
>>
>>
>>  smime.p7s
>> 5kBMostraBaixa
>>     
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Extracting files from .tessdata

Reply via email to