Hello Ramon, tesseract-ocr is developed by google (see http://groups.google.com/group/tesseract-ocr/msg/7408c699e27db341). I hope that after solving all/some issues final version of tesseract-ocr 3.00 will be released including tif+box files...
Zd. Dn(a 20.05.2010 10:53, Ramon wrote / napísal(a): > Hi Zdenko, > > After some tests, I realized I need the tiff pair boxes that the > creators used to generate Catalan tessdata file. > > Do you know a way to contact to them? > > Ramon. > > > > > On 29 Abr, 23:49, Zdenko Podobný <[email protected]> wrote: > >> Hi Ramon, >> >> I do not have source files for dawg dictionaries and I am not able to >> "decompile" them. Anyway I think to create dictionaries is the easiest >> part of tesseract training: based on wiki[1] input is simple utf-8 file >> with one word per line. This file is split to several files: >> >> * lang.punc -> words with punctuation patterns >> * lang.number -> words with number patterns >> * lang.freq -> frequent words >> * lang.word -> rest of the words >> >> I believe you can get list of words from other opensource projects (e.g. >> spellchecker, dictionary projects as apertium.org, or search for free >> Catalan Corpus - do not forget to clear license of data first!) or you >> can create it from wikipedia[2]. >> >> dawg files are easy to create (big input file can cause a long run this >> command!): >> >> $ wordlist2dawg [-t] word_list_file dawg_file unicharset_file >> >> e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset >> >> This command is valid for tesseract 3.00. wordlist2dawg in tesseract >> 2.04 do not use unicharset_file as input. >> >> I hope there will be more details soon >> onhttp://www.sk-spell.sk.cx/tesseract-ocr-en. >> >> [1]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract >> [2]http://wiki.apertium.org/wiki/Building_dictionaries >> >> Zdenko >> >> Dn(a 29.04.2010 09:30, Ramon wrote / napísal(a): >> >> >> >> >>> Hi for you quick answer Zdenko. >>> >> >>> As you pointed out, I'm already using tif / box pair from spanish >>> language to train my catalan .traineddata language. (As spanish >>> characters suits catalan characters too). >>> >> >>> But doing just this (with no words in dictionary files) the dictionary >>> is not quite good. I think the difference is from the words used in >>> those dictionaries. So I'm asking for that utf8 files... >>> >> >>> Don't know if you (or a developer) can provide them. >>> >> >>> Thanks. >>> >> >>> Ramon. >>> >> >>> On 28 Abr, 15:55, zdenko podobny <[email protected]> wrote: >>> >> >>>> Hello Ramon, >>>> >> >>>> for extending existing language you need "Tif/Box pairs" >>>> seehttp://code.google.com/p/tesseract-ocr/wiki/FAQandthere "How do I add >>>> just >>>> one character or one font to my favourite language, without having to >>>> retrain from scratch?" >>>> >> >>>> Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld >>>> and spa languages... So you can wait that somebody will someday release >>>> tif/box pairs for your language or you will start training from scratch. I >>>> choose second option and this is reason why I started with testing of >>>> training process for tesseract 3.00. >>>> >> >>>> BR, >>>> >> >>>> Zdenko >>>> >> >>>> On Mon, Apr 26, 2010 at 11:06 AM, Ramon <[email protected]> wrote: >>>> >> >>>>> Hi, >>>>> After some tests I realized the best for me is to put effort to extend >>>>> the Catalan Diccionari which is in svn repository (v3). >>>>> It will be so useful if you can do one of these: >>>>> >> >>>>> -> deliver the different files combined to create the cat.traineddata >>>>> unified file. (the utf8 files used to generate the dawg would be also >>>>> amazing!). >>>>> -> show how to extract these files from the cat.traineddata and how to >>>>> dawg2utf8 (if it is possible). >>>>> >> >>>>> THANKS! >>>>> >> >>>>> -- >>>>> You received this message because you are subscribed to the Google Groups >>>>> "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected]. >>>>> To unsubscribe from this group, send email to >>>>> [email protected]<tesseract-ocr%2bunsubscr...@goog >>>>> legroups.com> >>>>> . >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>> >> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "tesseract-ocr" group. >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]. >>>> For more options, visit this group >>>> athttp://groups.google.com/group/tesseract-ocr?hl=en. >>>> >> >> >> smime.p7s >> 5kBMostraBaixa >> >
smime.p7s
Description: S/MIME Cryptographic Signature

