Hi, Can you please send a copy of all your source data files..eg font_properties, unicharset etc which version of Tesseract are you using 3.01 or 3.02 ? So I can try and compile a traineddata file. I am currently unable to do so using jTesseract or train.ps1 powershell script.
So I am looking for people with complete source data files to try and compare. Any help is always appreciated Richard On Thursday, March 29, 2012 6:31:53 AM UTC+11, nkantan r wrote: > > hi > i know there are two tamil trained data files corresponding to Latha > and Lohit fonts; going through the box and tif files i understand that > the boxes for combined consonants (உயிர்மெய்) are selected as > individual (for eg: கே is selected as individual ே and க instead of a > merged கே. Since the vowel variation ே comes before the base consonant > க, post processing is elaborately required while such post-processing > can be written by a person knowing tamil aswell cpp! and as such post- > processing is now altogether missing; > > to elaborate further: குகூகெகே is read correctly but texted out as > குகூெகேக; this is because the sequence is read as கு கூ ெ, க ே க; by > unicharater reading க followed by ே is read as single unicharacter > கே; the net result is குகூெகேக > this becomes worse when a single characters "கொ" "கோ" "கௌ" are read > as three characters in three boxes! > > another major issue is the missing vowel ஔ which is read as while > reading ஒ and ள; > > to avoid these issues, i am retraining the tamil alphabet in its > proper form; though i have succeeded doing the same in one font (Latha > size 12), while combining the language files i am getting : > > Combining tessdata files > TessdataManager combined tess > Offset for type 0 is -1 > Offset for type 1 is 108 > Offset for type 2 is -1 > Offset for type 3 is -1 > Offset for type 4 is 17420 > Offset for type 5 is -1 > Offset for type 6 is -1 > Offset for type 7 is 21008 > Offset for type 8 is -1 > Offset for type 9 is 31506 > Offset for type 10 is -1 > Offset for type 11 is -1 > Offset for type 12 is -1 > > C:\indicocr\tesseract301> > > obviously the -1 above indicates something wrong;? in the whole of the > tesseract-ocr project page, it is not possible to get the samples for > > •tessdata/eng.config > •tessdata/eng.unicharset > •tessdata/eng.unicharambigs > •tessdata/eng.inttemp > •tessdata/eng.pffmtable > •tessdata/eng.normproto > •tessdata/eng.punc-dawg > •tessdata/eng.word-dawg > •tessdata/eng.number-dawg > •tessdata/eng.freq-dawg > > There are 13 items listed in the combinedTess while only 10 files are > listed out above. > > Though it is mentioned that unicharset, inttemp, pffmtable, normproto > are the four files required about from word-dawg and freq-dawg, there > is no mention if the other files such as tam,config, tam.unicharmbigs > etc can be left absent or empty files are required. > > now while trying to Tesseract using the above made tam.traineddata > i am getting the error as below: > =================================== > C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam > tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in > file ..\classify\adaptmatch.cpp, line 512 > > C:\indicocr\tesseract301> > ======================================= > > kinly advise what went wrong, and what need be done to get proper > traineddata file. and i am really hopeful that the files used before > combining are also made availalable so that one can see the samples. > > regards > rnkantan > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

