Hello Nick

It is great that you are motivated to make a documentation and that you 
answer the questions in the forum.

Nevertheless I read a post from Ray where he says that he receives millions 
of emails and the last thing he likes to do is writing long texts (email 
responses or documentations). I think this is a fatal situation, because if 
he is the only one who really knows the code, he is predestined to write 
that documenation. But I understood that he is not motivated to do that. He 
is testing new classifiers rather than caring about what is already done.

If he doesn't like writing documentations I think he should explain what he 
knows to someone else verbally who then writes the documentation. But I 
doubt that this will ever happen. And if he retires one day it will be too 
late.

_________________________________

I studied the code of the set_unicharset_properties tool.
But this is a very basic tool. It only sets the basic properties.
The min/max values don't get touched and I'm sure that there must exist a 
tool (that is not published) that obtains them, because the han.unicharset 
has 23514 characters defined - all with min / max values set. Or do you 
think that someone has edited 23514 characters manually ?

Ok we are stuck at the same point.
Ray knows, but Ray is unavailable.
It is really a sad situation.
It is not the way open source projects should work.
_________________________________

> Are there particular things you'd like 
> documentated, that I could start on? 

I would like to generate unicharset files automatically, but I don't know 
how to calculate the min/max values.

So we have one person who is motivated (Nick) but does not know 
and we have another person who knows (Ray) but is not motivated to write a 
documentation.
________________________

In deed the documentation is totally inclomplete.
If you see for example the documentation of the MySql server (which is 
excellent) you immediatly admit that Tesseract is on the other extreme end 
- light years away from that.

If you want an idea where to start with: I think a good starting point 
would be to explain what all these training files are good for and what 
they do exactly.
What is INTTEMP, what values does it contain exactly, how is it generated 
in the training process and how is it used in recognition ?
What is PFFMTABLE good for, NORMPROTO etc.

And then the DAWG files.
I still did not understand in which step of the recognition the Number DAWG 
is used. (Did you see the weird things it contains?)
And what is the PUNC DAWG good for, how is it used exactly ? How should I 
generate the values in it ?
What is the difference between a flat shape table and a clustered 
shapetable ?

There are millions of questions !


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3e3ae6e2-612f-4be2-b44f-845b8cd16b36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to