Re: Tamil Trained data

blavatsky3 Mon, 02 Jul 2012 06:39:22 -0700

Hi,

Can you please send a copy of all your source data files..eg 
font_properties, unicharset etc 
which version of Tesseract are you using 3.01 or 3.02 ?
So I can try and compile a traineddata file.
I am currently unable to do so using jTesseract or train.ps1 powershell 
script.



So I am looking for people with complete source data files to try and 
compare.

Any help is always appreciated 

Richard

On Thursday, March 29, 2012 6:31:53 AM UTC+11, nkantan r wrote:
>
> hi 
> i know there are two tamil trained data files corresponding to Latha 
> and Lohit fonts; going through the box and tif files i understand that 
> the boxes for combined consonants (உயிர்மெய்) are selected as 
> individual (for eg: கே  is selected as individual ே and க instead of a 
> merged கே. Since the vowel variation ே comes before the base consonant 
> க, post processing is elaborately required while such post-processing 
> can be written by a person knowing tamil aswell cpp! and as such post- 
> processing is now altogether missing; 
>
> to elaborate further:   குகூகெகே  is read correctly but texted out as 
> குகூெகேக; this is because the  sequence is read as கு கூ ெ, க ே க; by 
> unicharater reading க followed by ே is read as single unicharacter 
> கே;  the net result is குகூெகேக 
> this becomes worse when a single characters "கொ"  "கோ" "கௌ" are read 
> as three characters in three boxes! 
>
> another major issue is the missing vowel ஔ which is read as  while 
> reading ஒ and ள; 
>
> to avoid these issues, i am retraining the tamil alphabet in its 
> proper form; though i have succeeded doing the same in one font (Latha 
> size 12), while combining the language files i am getting : 
>
> Combining tessdata files 
> TessdataManager combined tess 
> Offset for type 0 is -1 
> Offset for type 1 is 108 
> Offset for type 2 is -1 
> Offset for type 3 is -1 
> Offset for type 4 is 17420 
> Offset for type 5 is -1 
> Offset for type 6 is -1 
> Offset for type 7 is 21008 
> Offset for type 8 is -1 
> Offset for type 9 is 31506 
> Offset for type 10 is -1 
> Offset for type 11 is -1 
> Offset for type 12 is -1 
>
> C:\indicocr\tesseract301> 
>
> obviously the -1 above indicates something wrong;? in the whole of the 
> tesseract-ocr project page, it is not possible to get the samples for 
>
> •tessdata/eng.config 
> •tessdata/eng.unicharset 
> •tessdata/eng.unicharambigs 
> •tessdata/eng.inttemp 
> •tessdata/eng.pffmtable 
> •tessdata/eng.normproto 
> •tessdata/eng.punc-dawg 
> •tessdata/eng.word-dawg 
> •tessdata/eng.number-dawg 
> •tessdata/eng.freq-dawg 
>
> There are 13 items listed in the combinedTess while only 10 files are 
> listed out above. 
>
> Though it is mentioned that unicharset, inttemp, pffmtable, normproto 
> are the four files required about from word-dawg and freq-dawg, there 
> is no mention if the other files such as tam,config, tam.unicharmbigs 
> etc can be left absent or empty files are required. 
>
> now while trying to Tesseract using the above made tam.traineddata 
> i am getting the error as below: 
> =================================== 
> C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam 
> tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in 
> file ..\classify\adaptmatch.cpp, line 512 
>
> C:\indicocr\tesseract301> 
> ======================================= 
>
> kinly advise what went wrong, and what need be done to get proper 
> traineddata file. and i am really hopeful that the files used before 
> combining are also made availalable so that one can see the samples. 
>
> regards 
> rnkantan 
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tamil Trained data

Reply via email to