Re: Extracting files from .tessdata

Ramon Fri, 21 May 2010 04:34:52 -0700

Hi Zdenko,

After some tests, I realized I need the tiff pair boxes that the
creators used to generate Catalan tessdata file.


Do you know a way to contact to them?

Ramon.




On 29 Abr, 23:49, Zdenko Podobný <[email protected]> wrote:
> Hi Ramon,
>
> I do not have source files for dawg dictionaries and I am not able to
> "decompile" them. Anyway I think to create dictionaries is the easiest
> part of tesseract training: based on wiki[1] input is simple utf-8 file
> with one word per line. This file is split to several files:
>
>     * lang.punc    -> words with punctuation patterns
>     * lang.number    -> words with number patterns
>     * lang.freq    -> frequent words
>     * lang.word    -> rest of the words
>
> I believe you can get list of words from other opensource projects (e.g.
> spellchecker, dictionary projects as apertium.org, or search for free
> Catalan Corpus - do not forget to clear license of data first!) or you
> can create it from wikipedia[2].
>
> dawg files are easy to create (big input file can cause a long run this
> command!):
>
>     $ wordlist2dawg [-t] word_list_file dawg_file unicharset_file
>
> e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset
>
> This command is valid for tesseract 3.00. wordlist2dawg in tesseract
> 2.04 do not use unicharset_file as input.
>
> I hope there will be more details soon 
> onhttp://www.sk-spell.sk.cx/tesseract-ocr-en.
>
> [1]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
> [2]http://wiki.apertium.org/wiki/Building_dictionaries
>
> Zdenko
>
> Dn(a 29.04.2010 09:30, Ramon  wrote / napísal(a):
>
>
>
> > Hi for you quick answer Zdenko.
>
> > As you pointed out, I'm already using tif / box pair from spanish
> > language to train my catalan .traineddata language. (As spanish
> > characters suits catalan characters too).
>
> > But doing just this (with no words in dictionary files) the dictionary
> > is not quite good. I think the difference is from the words used in
> > those dictionaries. So I'm asking for that utf8 files...
>
> > Don't know if you (or a developer) can provide them.
>
> > Thanks.
>
> > Ramon.
>
> > On 28 Abr, 15:55, zdenko podobny <[email protected]> wrote:
>
> >> Hello Ramon,
>
> >> for extending existing language you need "Tif/Box pairs" 
> >> seehttp://code.google.com/p/tesseract-ocr/wiki/FAQandthere "How do I add 
> >> just
> >> one character or one font to my favourite language, without having to
> >> retrain from scratch?"
>
> >> Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld
> >> and spa languages... So you can wait that somebody will someday release
> >> tif/box pairs for your language or you will start training from scratch. I
> >> choose second option and this is reason why I started with testing of
> >> training process for  tesseract 3.00.
>
> >> BR,
>
> >> Zdenko
>
> >> On Mon, Apr 26, 2010 at 11:06 AM, Ramon <[email protected]> wrote:
>
> >>> Hi,
> >>> After some tests I realized the best for me is to put effort to extend
> >>> the Catalan Diccionari which is in svn repository (v3).
> >>> It will be so useful if you can do one of these:
>
> >>> -> deliver the different files combined to create the cat.traineddata
> >>> unified file. (the utf8 files used to generate the dawg would be also
> >>> amazing!).
> >>> -> show how to extract these files from the cat.traineddata and how to
> >>> dawg2utf8 (if it is possible).
>
> >>> THANKS!
>
> >>> --
> >>> You received this message because you are subscribed to the Google Groups
> >>> "tesseract-ocr" group.
> >>> To post to this group, send email to [email protected].
> >>> To unsubscribe from this group, send email to
> >>> [email protected]<tesseract-ocr%2bunsubscr...@goog
> >>>  legroups.com>
> >>> .
> >>> For more options, visit this group at
> >>>http://groups.google.com/group/tesseract-ocr?hl=en.
>
> >> --
> >> You received this message because you are subscribed to the Google Groups 
> >> "tesseract-ocr" group.
> >> To post to this group, send email to [email protected].
> >> To unsubscribe from this group, send email to 
> >> [email protected].
> >> For more options, visit this group 
> >> athttp://groups.google.com/group/tesseract-ocr?hl=en.
>
>
>
>  smime.p7s
> 5kBMostraBaixa

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Extracting files from .tessdata

Reply via email to