Tamil Trained data

nkantan r Wed, 28 Mar 2012 21:25:17 -0700

hi
i know there are two tamil trained data files corresponding to Latha
and Lohit fonts; going through the box and tif files i understand that
the boxes for combined consonants (உயிர்மெய்) are selected as
individual (for eg: கே  is selected as individual ே and க instead of a
merged கே. Since the vowel variation ே comes before the base consonant
க, post processing is elaborately required while such post-processing
can be written by a person knowing tamil aswell cpp! and as such post-
processing is now altogether missing;


to elaborate further:   குகூகெகே  is read correctly but texted out as
குகூெகேக; this is because the  sequence is read as கு கூ ெ, க ே க; by
unicharater reading க followed by ே is read as single unicharacter
கே;  the net result is குகூெகேக
this becomes worse when a single characters "கொ"  "கோ" "கௌ" are read
as three characters in three boxes!

another major issue is the missing vowel ஔ which is read as  while
reading ஒ and ள;

to avoid these issues, i am retraining the tamil alphabet in its
proper form; though i have succeeded doing the same in one font (Latha
size 12), while combining the language files i am getting :

Combining tessdata files
TessdataManager combined tess
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is -1
Offset for type 4 is 17420
Offset for type 5 is -1
Offset for type 6 is -1
Offset for type 7 is 21008
Offset for type 8 is -1
Offset for type 9 is 31506
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1

C:\indicocr\tesseract301>

obviously the -1 above indicates something wrong;? in the whole of the
tesseract-ocr project page, it is not possible to get the samples for

•tessdata/eng.config
•tessdata/eng.unicharset
•tessdata/eng.unicharambigs
•tessdata/eng.inttemp
•tessdata/eng.pffmtable
•tessdata/eng.normproto
•tessdata/eng.punc-dawg
•tessdata/eng.word-dawg
•tessdata/eng.number-dawg
•tessdata/eng.freq-dawg

There are 13 items listed in the combinedTess while only 10 files are
listed out above.

Though it is mentioned that unicharset, inttemp, pffmtable, normproto
are the four files required about from word-dawg and freq-dawg, there
is no mention if the other files such as tam,config, tam.unicharmbigs
etc can be left absent or empty files are required.

now while trying to Tesseract using the above made tam.traineddata
i am getting the error as below:
===================================
C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam
tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in
file ..\classify\adaptmatch.cpp, line 512

C:\indicocr\tesseract301>
=======================================

kinly advise what went wrong, and what need be done to get proper
traineddata file. and i am really hopeful that the files used before
combining are also made availalable so that one can see the samples.

regards
rnkantan

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Tamil Trained data

Reply via email to