> Regarding:overlapping of boxes= Kindly view attached screenshot of edited
> boxes which is self explanatory for your research purpose.
Look in your screenshot, i have made RED vertical lines (Only for the first 3
rows, i don't have time to do all). For those, your boxes are including
MULTIPLE characters, which will at recognition be seen as ift's 2 characters!
This can't work, you have to make one box for each distinct character.
Edventually, i suggest the following, and let's take your 2nd box from the
top-left corner. As you can see, it contains 2 distincts characters, separated
by a blank space, but your box encapsulates the two. If you need tess to
recognize the two separated characters as one, do this (And i'll say the first
symbol is a A and the 2nd a B, because i don't have those on my system to write
this email), and imagine that the whole "AB" thing means "C" for you:
- Make two boxes, one for the "A", one for the "B", instead of one for the
whole "AB" ("C" for human") thing.
- Tell tesseract the "A" is a "A".
- Tell tesseract the "B" is a "B".
- This way, tesseract will write your output as "A" "B" instead of two unknown
characters.
- Then, you have to make a small program which threats exceptions. You tell
this program that if it finds "A" and "B" without space in between, it replaces
it with "C". Basically, this program does a post-processing disambiguation.
- Do this process for all "C", and any other combination in your language where
the two distinct letters means another single one.
i know it's cheesy, but that's all i can tell you given the fact that tesseract
is not easily modifiable to do this for now...
> Your bat files does not work.
Give me error messages. They will only work if the folder where all tesseract
binaries (including combine) resides is in your PATH variable.
> Then I copied all generated exe files of tesseract r319svn into the extract
> of kan.training set.rar.
Yes, this works too.
> Then I ran all exe in the cmd.exe are generated required files. It is
> observed that output of mine as well as yours has no difference. Still
> improvement of accuracy is required.
i have no idea on how to improve yet...
> 2).DangAmbigs this will work for English which is ANSI but does not work for
> Kannada because it has to be UTF-8 or Unicode compatible for which relevant
> source code has to be modified suitably -vide replace.txt
Understood.
> Instead of wasting time now, it is felt that it would be better perform
> beta-testing after release of your proposed modified source codes for
> english as well as Kannada and feedback to you. Thereby slowly we can
> improve the tesseract step by step. In case, if you succeeded for Kannada
> project then it will be easy for other world langs.
Yes, but i really dont have an estimate time for such release. i'll keep the
list informed.
Best,
Pierre.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.