hi i know there are two tamil trained data files corresponding to Latha and Lohit fonts; going through the box and tif files i understand that the boxes for combined consonants (உயிர்மெய்) are selected as individual (for eg: கே is selected as individual ே and க instead of a merged கே. Since the vowel variation ே comes before the base consonant க, post processing is elaborately required while such post-processing can be written by a person knowing tamil aswell cpp! and as such post- processing is now altogether missing;
to elaborate further: குகூகெகே is read correctly but texted out as குகூெகேக; this is because the sequence is read as கு கூ ெ, க ே க; by unicharater reading க followed by ே is read as single unicharacter கே; the net result is குகூெகேக this becomes worse when a single characters "கொ" "கோ" "கௌ" are read as three characters in three boxes! another major issue is the missing vowel ஔ which is read as while reading ஒ and ள; to avoid these issues, i am retraining the tamil alphabet in its proper form; though i have succeeded doing the same in one font (Latha size 12), while combining the language files i am getting : Combining tessdata files TessdataManager combined tess Offset for type 0 is -1 Offset for type 1 is 108 Offset for type 2 is -1 Offset for type 3 is -1 Offset for type 4 is 17420 Offset for type 5 is -1 Offset for type 6 is -1 Offset for type 7 is 21008 Offset for type 8 is -1 Offset for type 9 is 31506 Offset for type 10 is -1 Offset for type 11 is -1 Offset for type 12 is -1 C:\indicocr\tesseract301> obviously the -1 above indicates something wrong;? in the whole of the tesseract-ocr project page, it is not possible to get the samples for •tessdata/eng.config •tessdata/eng.unicharset •tessdata/eng.unicharambigs •tessdata/eng.inttemp •tessdata/eng.pffmtable •tessdata/eng.normproto •tessdata/eng.punc-dawg •tessdata/eng.word-dawg •tessdata/eng.number-dawg •tessdata/eng.freq-dawg There are 13 items listed in the combinedTess while only 10 files are listed out above. Though it is mentioned that unicharset, inttemp, pffmtable, normproto are the four files required about from word-dawg and freq-dawg, there is no mention if the other files such as tam,config, tam.unicharmbigs etc can be left absent or empty files are required. now while trying to Tesseract using the above made tam.traineddata i am getting the error as below: =================================== C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in file ..\classify\adaptmatch.cpp, line 512 C:\indicocr\tesseract301> ======================================= kinly advise what went wrong, and what need be done to get proper traineddata file. and i am really hopeful that the files used before combining are also made availalable so that one can see the samples. regards rnkantan -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

