Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

Albrecht Hilker Mon, 14 Jul 2014 09:59:32 -0700

Hello Nick

After some days I came back here and I'm very surprised about your lots of 
posts.
Thanks for answering and taking the time.

I found another bug in the tool.
(As I received no answer here, I already posted it to the Issues:
http://code.google.com/p/tesseract-ocr/issues/detail?id=1251 )
_____________________________________________________

Apart from the 4 bugs I described in the forum there is another one:

While the downloaded traineddata files distinguish between punctuation and
non-punctuation unichars like:

Punctuation: !"#%&'()*,-./:;?@[\]_{}
Others : $+<=>|~º®«

the unicharset_extractor tool returns ALL non-alphanumeric characters as
punctuation unichars.

_____________________________________________________

I think all the problems that I described can easily be fixed except the
min/max values.

And I still don't understand the basic question:
How can we ever write ONE Unicharset file with font metrics for a whole
bunch of completely different and contradicting fonts ?
If there was one unicharset file per font, it would be easier.
But ONE Unicharset file with min/max values for 358 fonts seems completely
unsane for me!
Did you know that the english and the spanish traineddata for 3.02 were
trained with 358 fonts ?
https://groups.google.com/forum/?fromgroups#!topic/tesseract-ocr/boQ188SeFsY

There are fonts that put the "9" below the baseline and other that do not.
How do we ever write a Unicharset for such different fonts ?
It simply doesn't make sense to me.
_______________________________________________

Why does Tesseract need these min/max values at all ?
Wouldn't it be much more intelligent to store this information directly in
the feature data ?
So each character brings the information about it's baseline, height etc,
along with the training data ?
These values could be easier to auto-generate.
_______________________________________________

And the other thing that I absolutely don't understand:
You are investigating about this topic now.
But where are the people who know ?
Is this only Ray ?

Google is one of the richest companies on earth.
Are they not able to pay one of the persons who knows to write a
documentation (at least part time) ?
One of the persons who work on the code will require let's say a month to
write a good documentation about Tesseract, which currently is completely
abandoned.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/55b76182-8d4e-4efd-9379-e9f43623856b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

Reply via email to