Re: Generating / Training box files for Kannada.

MARTIN Pierre Wed, 14 Apr 2010 04:28:19 -0700

> Regarding:overlapping of boxes= Kindly view attached screenshot of edited 
> boxes which is self explanatory for your research purpose.
Look in your screenshot, i have made RED vertical lines (Only for the first 3 
rows, i don't have time to do all). For those, your boxes are including 
MULTIPLE characters, which will at recognition be seen as ift's 2 characters! 
This can't work, you have to make one box for each distinct character. 
Edventually, i suggest the following, and let's take your 2nd box from the 
top-left corner. As you can see, it contains 2 distincts characters, separated 
by a blank space, but your box encapsulates the two. If you need tess to 
recognize the two separated characters as one, do this (And i'll say the first 
symbol is a A and the 2nd a B, because i don't have those on my system to write 
this email), and imagine that the whole "AB" thing means "C" for you:
- Make two boxes, one for the "A", one for the "B", instead of one for the 
whole "AB" ("C" for human") thing.
- Tell tesseract the "A" is a "A".
- Tell tesseract the "B" is a "B".
- This way, tesseract will write your output as "A" "B" instead of two unknown 
characters.
- Then, you have to make a small program which threats exceptions. You tell 
this program that if it finds "A" and "B" without space in between, it replaces 
it with "C". Basically, this program does a post-processing disambiguation.
- Do this process for all "C", and any other combination in your language where 
the two distinct letters means another single one.
i know it's cheesy, but that's all i can tell you given the fact that tesseract 
is not easily modifiable to do this for now...



> Your bat files does not work.
Give me error messages. They will only work if the folder where all tesseract 
binaries (including combine) resides is in your PATH variable.

> Then I copied all generated  exe files of tesseract r319svn into the extract 
> of  kan.training set.rar.
Yes, this works too.

> Then I ran all exe in the cmd.exe are generated required files. It is 
> observed that output of mine as well as yours has no difference. Still 
> improvement of accuracy is required.
i have no idea on how to improve yet...

> 2).DangAmbigs  this will work for English which is ANSI but does not work for 
> Kannada because it has to be UTF-8 or Unicode compatible for which relevant 
> source code has to be modified suitably -vide replace.txt
Understood.

> Instead of wasting time now, it is felt that it would be better perform 
> beta-testing  after  release of  your  proposed modified source codes for 
> english as well as Kannada and feedback to you.  Thereby slowly we can 
> improve the tesseract step by step.  In case, if  you succeeded for Kannada 
> project then it will be easy for other world  langs.
Yes, but i really dont have an estimate time  for such release. i'll keep the 
list informed.

Best,
Pierre.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Generating / Training box files for Kannada.

Reply via email to