Building training tools on windows is not priority. But is should be possible to compile most of tools with cygwin or msys&mingw.
Zdenko On Wed, Aug 6, 2014 at 5:20 PM, Shree Devi Kumar <[email protected]> wrote: > My current plan for documentation is as follows: >> >> - Rewrite and simplify TrainingTesseract3 on the wiki >> - Write manpages for each tool in training/ >> - Document how each training file is used, and what it contains >> >> Does that sound good to people? I'll take silence from the list to >> mean "that sounds perfect in every way, you wonderful man." ;) > > > Thanks, Nick. That's great. You should probably have separate sections for > training 3, 3.02, 3.03, 3.03.03 ...etc. Since the method has changed quite > a bit. > > BTW, do you know if the new training tools can be compiled on Windows or > do I need to to get access to Linux somewhere to give them a try. > > > > > > Shree Devi Kumar > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > > On Wed, Aug 6, 2014 at 8:23 PM, Nick White <[email protected]> > wrote: > >> Hi Albrecht, >> >> Sorry for not replying sooner, I've been away. >> >> > Nevertheless I read a post from Ray where he says that he receives >> > millions of >> > emails and the last thing he likes to do is writing long texts (email >> responses >> > or documentations). I think this is a fatal situation, because if he is >> the >> > only one who really knows the code, he is predestined to write that >> > documenation. But I understood that he is not motivated to do that. He >> is >> > testing new classifiers rather than caring about what is already done. >> >> Ah, but others can work to figure out how the code and tools work, >> and slowly but surely piece together documentation. Also, Ray is >> good at explaining when he has the time. I agree it isn't an ideal >> situation, but think we can fix it. >> >> >> > I studied the code of the set_unicharset_properties tool. >> > But this is a very basic tool. It only sets the basic properties. >> > The min/max values don't get touched >> >> This is wrong, actually. The unicharset.SetPropertiesFromOther() >> function called in set_unicharset_properties copies all properties >> from any copy of the character found in the script_dir. As I >> mentioned in my previous message to this thread, set the script_dir >> to the training/langdata directory and the data from all the >> .unicharset files there will be pulled in as appropriate. >> >> > I'm sure that there must exist a tool >> > (that is not published) that obtains them, because the han.unicharset >> has 23514 >> > characters defined - all with min / max values set. Or do you think that >> > someone has edited 23514 characters manually ? >> >> Ultimately, yes, there must be an unpublished tool that obtains the >> metrics that exist in the training/langdata directory. I suspect it >> looks quite like the pango based proof of concept I attached to a >> previous mail on this thread (charmetrics.c). >> >> > It is not the way open source projects should work. >> >> So, you pick yourself up and jump in! That's how open source >> projects should work. Patches are welcomed :) >> >> > > Are there particular things you'd like >> > > documentated, that I could start on? >> > >> > I would like to generate unicharset files automatically, but I don't >> know how >> > to calculate the min/max values. >> >> As I say, you can get good general figures by using the --script_dir >> option with set_unicharset_properties. I think we're clear now on >> the general definitions of all the fields. >> >> To calculate the min/max values for specific fonts where they may be >> very different, I recommend you try the charmetrics.c tool I posted, >> and compare the output to what you get without it. >> >> > If you want an idea where to start with: I think a good starting point >> would be >> > to explain what all these training files are good for and what they do >> exactly. >> > What is INTTEMP, what values does it contain exactly, how is it >> generated in >> > the training process and how is it used in recognition ? >> > What is PFFMTABLE good for, NORMPROTO etc. >> > >> > And then the DAWG files. >> > I still did not understand in which step of the recognition the Number >> DAWG is >> > used. (Did you see the weird things it contains?) >> > And what is the PUNC DAWG good for, how is it used exactly ? How should >> I >> > generate the values in it ? >> > What is the difference between a flat shape table and a clustered >> shapetable ? >> >> These are all good points, and good places to start, thank you. >> >> My current plan for documentation is as follows: >> >> - Rewrite and simplify TrainingTesseract3 on the wiki >> - Write manpages for each tool in training/ >> - Document how each training file is used, and what it contains >> >> Does that sound good to people? I'll take silence from the list to >> mean "that sounds perfect in every way, you wonderful man." ;) >> >> Nick >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/20140806145323.GG7804%40manta.lan >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2aUSCsuuyednh9j20McdeVM2A2SG1NtYaxLtOBT5gwA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2aUSCsuuyednh9j20McdeVM2A2SG1NtYaxLtOBT5gwA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w%3DTB5mfzJ0rsaPoMfeVKXLZBygaQW1q1rmH7077VWGQg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

