Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Albrecht Hilker Thu, 17 Jul 2014 21:39:08 -0700

Hello Nick

It is great that you are motivated to make a documentation and that you 
answer the questions in the forum.

Nevertheless I read a post from Ray where he says that he receives millions
of emails and the last thing he likes to do is writing long texts (email
responses or documentations). I think this is a fatal situation, because if
he is the only one who really knows the code, he is predestined to write
that documenation. But I understood that he is not motivated to do that. He
is testing new classifiers rather than caring about what is already done.

If he doesn't like writing documentations I think he should explain what he
knows to someone else verbally who then writes the documentation. But I
doubt that this will ever happen. And if he retires one day it will be too
late.

_________________________________

I studied the code of the set_unicharset_properties tool.
But this is a very basic tool. It only sets the basic properties.
The min/max values don't get touched and I'm sure that there must exist a
tool (that is not published) that obtains them, because the han.unicharset
has 23514 characters defined - all with min / max values set. Or do you
think that someone has edited 23514 characters manually ?

Ok we are stuck at the same point.
Ray knows, but Ray is unavailable.
It is really a sad situation.
It is not the way open source projects should work.
_________________________________

> Are there particular things you'd like
> documentated, that I could start on?

I would like to generate unicharset files automatically, but I don't know
how to calculate the min/max values.

So we have one person who is motivated (Nick) but does not know
and we have another person who knows (Ray) but is not motivated to write a
documentation.
________________________

In deed the documentation is totally inclomplete.
If you see for example the documentation of the MySql server (which is
excellent) you immediatly admit that Tesseract is on the other extreme end
- light years away from that.

If you want an idea where to start with: I think a good starting point
would be to explain what all these training files are good for and what
they do exactly.
What is INTTEMP, what values does it contain exactly, how is it generated
in the training process and how is it used in recognition ?
What is PFFMTABLE good for, NORMPROTO etc.

And then the DAWG files.
I still did not understand in which step of the recognition the Number DAWG
is used. (Did you see the weird things it contains?)
And what is the PUNC DAWG good for, how is it used exactly ? How should I
generate the values in it ?
What is the difference between a flat shape table and a clustered
shapetable ?

There are millions of questions !

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/3e3ae6e2-612f-4be2-b44f-845b8cd16b36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Reply via email to