similar type problem in the output. txt only for  Kannada lang also

On 1 March 2015 at 21:27, Anupam Srivatsav <[email protected]>
wrote:

> Dear rnkantan,
>
> I am getting the same type of errors as you specified.  Have you over come
> them? If you have the traineddata, I like to get it.
> Thanks in advance.
> Anupam.
>
>
> On Thursday, 29 March 2012 01:01:53 UTC+5:30, nkantan r wrote:
>>
>> hi
>> i know there are two tamil trained data files corresponding to Latha
>> and Lohit fonts; going through the box and tif files i understand that
>> the boxes for combined consonants (உயிர்மெய்) are selected as
>> individual (for eg: கே  is selected as individual ே and க instead of a
>> merged கே. Since the vowel variation ே comes before the base consonant
>> க, post processing is elaborately required while such post-processing
>> can be written by a person knowing tamil aswell cpp! and as such post-
>> processing is now altogether missing;
>>
>> to elaborate further:   குகூகெகே  is read correctly but texted out as
>> குகூெகேக; this is because the  sequence is read as கு கூ ெ, க ே க; by
>> unicharater reading க followed by ே is read as single unicharacter
>> கே;  the net result is குகூெகேக
>> this becomes worse when a single characters "கொ"  "கோ" "கௌ" are read
>> as three characters in three boxes!
>>
>> another major issue is the missing vowel ஔ which is read as  while
>> reading ஒ and ள;
>>
>> to avoid these issues, i am retraining the tamil alphabet in its
>> proper form; though i have succeeded doing the same in one font (Latha
>> size 12), while combining the language files i am getting :
>>
>> Combining tessdata files
>> TessdataManager combined tess
>> Offset for type 0 is -1
>> Offset for type 1 is 108
>> Offset for type 2 is -1
>> Offset for type 3 is -1
>> Offset for type 4 is 17420
>> Offset for type 5 is -1
>> Offset for type 6 is -1
>> Offset for type 7 is 21008
>> Offset for type 8 is -1
>> Offset for type 9 is 31506
>> Offset for type 10 is -1
>> Offset for type 11 is -1
>> Offset for type 12 is -1
>>
>> C:\indicocr\tesseract301>
>>
>> obviously the -1 above indicates something wrong;? in the whole of the
>> tesseract-ocr project page, it is not possible to get the samples for
>>
>> •tessdata/eng.config
>> •tessdata/eng.unicharset
>> •tessdata/eng.unicharambigs
>> •tessdata/eng.inttemp
>> •tessdata/eng.pffmtable
>> •tessdata/eng.normproto
>> •tessdata/eng.punc-dawg
>> •tessdata/eng.word-dawg
>> •tessdata/eng.number-dawg
>> •tessdata/eng.freq-dawg
>>
>> There are 13 items listed in the combinedTess while only 10 files are
>> listed out above.
>>
>> Though it is mentioned that unicharset, inttemp, pffmtable, normproto
>> are the four files required about from word-dawg and freq-dawg, there
>> is no mention if the other files such as tam,config, tam.unicharmbigs
>> etc can be left absent or empty files are required.
>>
>> now while trying to Tesseract using the above made tam.traineddata
>> i am getting the error as below:
>> ===================================
>> C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam
>> tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in
>> file ..\classify\adaptmatch.cpp, line 512
>>
>> C:\indicocr\tesseract301>
>> =======================================
>>
>> kinly advise what went wrong, and what need be done to get proper
>> traineddata file. and i am really hopeful that the files used before
>> combining are also made availalable so that one can see the samples.
>>
>> regards
>> rnkantan
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8654f631-33b6-4fb9-a747-9c0f1a6a7dd4%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/8654f631-33b6-4fb9-a747-9c0f1a6a7dd4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPyuS5i4b1AxJgTaQ0QrF%2BZQ4bw%2B%2B6XgmZHJomMjTND3suStEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to