Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Nick White Wed, 06 Aug 2014 07:54:35 -0700

Hi Albrecht,

Sorry for not replying sooner, I've been away.


> Nevertheless I read a post from Ray where he says that he receives 
> millions of
> emails and the last thing he likes to do is writing long texts (email 
> responses
> or documentations). I think this is a fatal situation, because if he is the
> only one who really knows the code, he is predestined to write that
> documenation. But I understood that he is not motivated to do that. He is
> testing new classifiers rather than caring about what is already done.

Ah, but others can work to figure out how the code and tools work, 
and slowly but surely piece together documentation. Also, Ray is 
good at explaining when he has the time. I agree it isn't an ideal 
situation, but think we can fix it.


> I studied the code of the set_unicharset_properties tool.
> But this is a very basic tool. It only sets the basic properties.
> The min/max values don't get touched

This is wrong, actually. The unicharset.SetPropertiesFromOther()
function called in set_unicharset_properties copies all properties 
from any copy of the character found in the script_dir. As I 
mentioned in my previous message to this thread, set the script_dir 
to the training/langdata directory and the data from all the 
.unicharset files there will be pulled in as appropriate.

> I'm sure that there must exist a tool
> (that is not published) that obtains them, because the han.unicharset has 
> 23514
> characters defined - all with min / max values set. Or do you think that
> someone has edited 23514 characters manually ?

Ultimately, yes, there must be an unpublished tool that obtains the 
metrics that exist in the training/langdata directory. I suspect it 
looks quite like the pango based proof of concept I attached to a 
previous mail on this thread (charmetrics.c).

> It is not the way open source projects should work.

So, you pick yourself up and jump in! That's how open source 
projects should work. Patches are welcomed :)

> > Are there particular things you'd like
> > documentated, that I could start on?
> 
> I would like to generate unicharset files automatically, but I don't know how
> to calculate the min/max values.

As I say, you can get good general figures by using the --script_dir 
option with set_unicharset_properties. I think we're clear now on 
the general definitions of all the fields.

To calculate the min/max values for specific fonts where they may be 
very different, I recommend you try the charmetrics.c tool I posted, 
and compare the output to what you get without it.

> If you want an idea where to start with: I think a good starting point would 
> be
> to explain what all these training files are good for and what they do 
> exactly.
> What is INTTEMP, what values does it contain exactly, how is it generated in
> the training process and how is it used in recognition ?
> What is PFFMTABLE good for, NORMPROTO etc.
> 
> And then the DAWG files.
> I still did not understand in which step of the recognition the Number DAWG is
> used. (Did you see the weird things it contains?)
> And what is the PUNC DAWG good for, how is it used exactly ? How should I
> generate the values in it ?
> What is the difference between a flat shape table and a clustered shapetable ?

These are all good points, and good places to start, thank you.

My current plan for documentation is as follows:

- Rewrite and simplify TrainingTesseract3 on the wiki
- Write manpages for each tool in training/
- Document how each training file is used, and what it contains

Does that sound good to people? I'll take silence from the list to 
mean "that sounds perfect in every way, you wonderful man." ;)

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140806145323.GG7804%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Missing detailed documentation about Unicharset files

Reply via email to