Hello Nick It is great that you are motivated to make a documentation and that you answer the questions in the forum.
Nevertheless I read a post from Ray where he says that he receives millions of emails and the last thing he likes to do is writing long texts (email responses or documentations). I think this is a fatal situation, because if he is the only one who really knows the code, he is predestined to write that documenation. But I understood that he is not motivated to do that. He is testing new classifiers rather than caring about what is already done. If he doesn't like writing documentations I think he should explain what he knows to someone else verbally who then writes the documentation. But I doubt that this will ever happen. And if he retires one day it will be too late. _________________________________ I studied the code of the set_unicharset_properties tool. But this is a very basic tool. It only sets the basic properties. The min/max values don't get touched and I'm sure that there must exist a tool (that is not published) that obtains them, because the han.unicharset has 23514 characters defined - all with min / max values set. Or do you think that someone has edited 23514 characters manually ? Ok we are stuck at the same point. Ray knows, but Ray is unavailable. It is really a sad situation. It is not the way open source projects should work. _________________________________ > Are there particular things you'd like > documentated, that I could start on? I would like to generate unicharset files automatically, but I don't know how to calculate the min/max values. So we have one person who is motivated (Nick) but does not know and we have another person who knows (Ray) but is not motivated to write a documentation. ________________________ In deed the documentation is totally inclomplete. If you see for example the documentation of the MySql server (which is excellent) you immediatly admit that Tesseract is on the other extreme end - light years away from that. If you want an idea where to start with: I think a good starting point would be to explain what all these training files are good for and what they do exactly. What is INTTEMP, what values does it contain exactly, how is it generated in the training process and how is it used in recognition ? What is PFFMTABLE good for, NORMPROTO etc. And then the DAWG files. I still did not understand in which step of the recognition the Number DAWG is used. (Did you see the weird things it contains?) And what is the PUNC DAWG good for, how is it used exactly ? How should I generate the values in it ? What is the difference between a flat shape table and a clustered shapetable ? There are millions of questions ! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3e3ae6e2-612f-4be2-b44f-845b8cd16b36%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

