Hi Zdenko, After some tests, I realized I need the tiff pair boxes that the creators used to generate Catalan tessdata file.
Do you know a way to contact to them? Ramon. On 29 Abr, 23:49, Zdenko Podobný <[email protected]> wrote: > Hi Ramon, > > I do not have source files for dawg dictionaries and I am not able to > "decompile" them. Anyway I think to create dictionaries is the easiest > part of tesseract training: based on wiki[1] input is simple utf-8 file > with one word per line. This file is split to several files: > > * lang.punc -> words with punctuation patterns > * lang.number -> words with number patterns > * lang.freq -> frequent words > * lang.word -> rest of the words > > I believe you can get list of words from other opensource projects (e.g. > spellchecker, dictionary projects as apertium.org, or search for free > Catalan Corpus - do not forget to clear license of data first!) or you > can create it from wikipedia[2]. > > dawg files are easy to create (big input file can cause a long run this > command!): > > $ wordlist2dawg [-t] word_list_file dawg_file unicharset_file > > e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset > > This command is valid for tesseract 3.00. wordlist2dawg in tesseract > 2.04 do not use unicharset_file as input. > > I hope there will be more details soon > onhttp://www.sk-spell.sk.cx/tesseract-ocr-en. > > [1]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract > [2]http://wiki.apertium.org/wiki/Building_dictionaries > > Zdenko > > Dn(a 29.04.2010 09:30, Ramon wrote / napísal(a): > > > > > Hi for you quick answer Zdenko. > > > As you pointed out, I'm already using tif / box pair from spanish > > language to train my catalan .traineddata language. (As spanish > > characters suits catalan characters too). > > > But doing just this (with no words in dictionary files) the dictionary > > is not quite good. I think the difference is from the words used in > > those dictionaries. So I'm asking for that utf8 files... > > > Don't know if you (or a developer) can provide them. > > > Thanks. > > > Ramon. > > > On 28 Abr, 15:55, zdenko podobny <[email protected]> wrote: > > >> Hello Ramon, > > >> for extending existing language you need "Tif/Box pairs" > >> seehttp://code.google.com/p/tesseract-ocr/wiki/FAQandthere "How do I add > >> just > >> one character or one font to my favourite language, without having to > >> retrain from scratch?" > > >> Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld > >> and spa languages... So you can wait that somebody will someday release > >> tif/box pairs for your language or you will start training from scratch. I > >> choose second option and this is reason why I started with testing of > >> training process for tesseract 3.00. > > >> BR, > > >> Zdenko > > >> On Mon, Apr 26, 2010 at 11:06 AM, Ramon <[email protected]> wrote: > > >>> Hi, > >>> After some tests I realized the best for me is to put effort to extend > >>> the Catalan Diccionari which is in svn repository (v3). > >>> It will be so useful if you can do one of these: > > >>> -> deliver the different files combined to create the cat.traineddata > >>> unified file. (the utf8 files used to generate the dawg would be also > >>> amazing!). > >>> -> show how to extract these files from the cat.traineddata and how to > >>> dawg2utf8 (if it is possible). > > >>> THANKS! > > >>> -- > >>> You received this message because you are subscribed to the Google Groups > >>> "tesseract-ocr" group. > >>> To post to this group, send email to [email protected]. > >>> To unsubscribe from this group, send email to > >>> [email protected]<tesseract-ocr%2bunsubscr...@goog > >>> legroups.com> > >>> . > >>> For more options, visit this group at > >>>http://groups.google.com/group/tesseract-ocr?hl=en. > > >> -- > >> You received this message because you are subscribed to the Google Groups > >> "tesseract-ocr" group. > >> To post to this group, send email to [email protected]. > >> To unsubscribe from this group, send email to > >> [email protected]. > >> For more options, visit this group > >> athttp://groups.google.com/group/tesseract-ocr?hl=en. > > > > smime.p7s > 5kBMostraBaixa -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

