[tesseract-ocr] Re: Tamil Trained data; Tesseract 3.01- its strange ways of using the box file.

er . prashanth27 Mon, 12 Feb 2018 07:39:58 -0800

Hi..

can i get the box file for those tif files and trained data also for latha 
font...


On Sunday, April 1, 2012 at 7:32:00 PM UTC+5:30, nkantan r wrote:
>
> hi all!
>  
> i am surprised that no one replied on this subject in this forum, but not 
> shocked as i find the interest level in tamil ocr is rather very limited; 
> the real error on the above is that the "fullstop" in my training image is 
> treated as zero; so the box file had "two" zeros but the number of unichars 
> were not matching.
>  
> while i have successfully trained tesseract (3.01) with suitable 
> unicharambigs to generate the correct ocr for simple computer passages, i 
> am keen on sharing some of my notes on the quirky (that is strange) ways 
> the box files are used for training; though i have used my own traineddata 
> for training pages of other fonts and even real fonts snapshots of old 
> books, i will be using here in this thread the exisitng trained data 
> initially.
>  
> First thing to be noted by would-be trainers is never to use just letters 
> in the image file; either clube two or three letter to form "word" like 
> uneven spacing or use deliberately more spacing between letters; to clarify 
> further, use அஆஇ   ஈஉஊஎ ஐ  ஒஓஔ instead of அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ ஃ.  i 
> donot know the reason for the same but a strange way of tesseract. 
>  
> i created a file called tam.latha.exp0.tif (from a snapshot of a pdf file 
> of a text file name tam.latha.exp0.odt). This contains all the tamil 
> characters latha font regular and 10 size, spaced out but presented in the 
> alphabetical order. the file is enclosed below. i created the box file 
> using the command below using the existing trained data:
>
> C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 -l 
> tam batch.nochop makebox
>  
> the created box file is enclosed after renaming it as 
> tam.latha.exp0.orig.box; (the reason for renaming is that i have edited the 
> file). If any body opens the file in a box editior after renaming it to the 
> orignal name, they will find the following:
> a) there is no blob corresponding to ஃ  and ஹ் in the first part; also the 
> boxes are created in a sequence different from the arrangement of letters: 
> அ, ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ  ஔ.  
>  THAT IS THOUGH ஒ, ஓ, ஔ ARE IN SEQUENCE THE BOXES/BLOBS ARE CREATED IN 
> DIFFERENT ORDER.  This wrong order happens only in the first part, the same 
> set of letters are repeated in the bottom of the page and the BOXES/BLOBS 
> are in same sequence.  I manually edited the box file. using jTess 
> editior deleting the ஔ box and inserted boxes for ஔ and ஃ I also deleted 
> irrelevant boxes aroung the vowel-variations. the edited file is enclosed 
> below (tam.latha.exp0.box). Now that the box file is satisfactory, as seen 
> in the jTess box editor, i attampted creating the tr file as below:
> ==================
> C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 
> nobatch box.train
> Tesseract Open Source OCR Engine v3.01 with Leptonica
> Page 0
> APPLY_BOXES: boxfile line 14/α«â ((1099,3010),(1124,3040)): FAILURE! 
> Couldn't find a matching blob
> APPLY_BOXES:
>    Boxes read from boxfile:    1151
>    Boxes failed resegmentation:       1
> APPLY_BOXES: Unlabelled word at :Bounding box=(2941,-1043)->(2966,-1027)
> APPLY_BOXES: Unlabelled word at :Bounding box=(3010,-1071)->(3037,-1032)
>    Found 1150 good blobs and 55 unlabelled blobs in 0 words.
>    2 remaining unlabelled words deleted.
> TRAINING ... Font name = latha
> Generated training data for 103 words
> =================
> the generated Tr file is also enclosed;
>  
> my observations and questions:
> 1) the box (1099,3010),(1124,3040)  coresponds to ஃ  and has been manually 
> inserted; Also it is the 13th box and not in the 14th line!
> 2) what is meant by "boxes failed resegmentation"
> 3) second message regarding the bounding boxes  ( 
> 2941,-1043)->(2966,-1027) (3010,-1071)->(3037,-1032);  i am not able to 
> identify any boxes; not sure about the negative values; do they represent 
> the boxes in the box file or some blob co-ordinates?
> 4) if we open the Tr file in any editor, we find the letter ஔ is after அ, 
> ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ; 
> 5) this means the image file is again read first and then the blobs are 
> compared to the nearest boxes.  not that the box file is used to directly 
> create the blobs on the tif image and generate the training data within the 
> box boundaries. Obviously for a user, it appeals to common sense that the 
> box file will be used to create the blob on the image file. 
> 6)  more curious to note is that the same set of  letters in the first 
> part are repeated in the second half of the page; it is correctly sequenced 
> in the box file automatically;  so the layout (linear arrangement of 
> letters) probaly does not matter.?
>  
> ===============
> Again the tif file is manually edited moving ஃ  closer to  க்  ஔ closer 
> to ஒ; this time the box file and tr files are created properly; for 
> reference the tif file and box file are enclosed (tam.latha.exp1)
>  
> ====
> i would like some answers this time; 
> if any body really wants to use and improve the revised trained data for 
> testing please feel free to write
>  
> regards
> rnkantan
>  
> (PS since google in its wisdom doesnot want tif images, they are added as 
> zip files!)
>  
>  
>  
>  
>
> On Thursday, March 29, 2012 1:01:53 AM UTC+5:30, nkantan r wrote:
>
>> hi 
>> i know there are two tamil trained data files corresponding to Latha 
>> and Lohit fonts; going through the box and tif files i understand that 
>> the boxes for combined consonants (உயிர்மெய்) are selected as 
>> individual (for eg: கே  is selected as individual ே and க instead of a 
>> merged கே. Since the vowel variation ே comes before the base consonant 
>> க, post processing is elaborately required while such post-processing 
>> can be written by a person knowing tamil aswell cpp! and as such post- 
>> processing is now altogether missing; 
>>
>> to elaborate further:   குகூகெகே  is read correctly but texted out as 
>> குகூெகேக; this is because the  sequence is read as கு கூ ெ, க ே க; by 
>> unicharater reading க followed by ே is read as single unicharacter 
>> கே;  the net result is குகூெகேக 
>> this becomes worse when a single characters "கொ"  "கோ" "கௌ" are read 
>> as three characters in three boxes! 
>>
>> another major issue is the missing vowel ஔ which is read as  while 
>> reading ஒ and ள; 
>>
>> to avoid these issues, i am retraining the tamil alphabet in its 
>> proper form; though i have succeeded doing the same in one font (Latha 
>> size 12), while combining the language files i am getting : 
>>
>> Combining tessdata files 
>> TessdataManager combined tess 
>> Offset for type 0 is -1 
>> Offset for type 1 is 108 
>> Offset for type 2 is -1 
>> Offset for type 3 is -1 
>> Offset for type 4 is 17420 
>> Offset for type 5 is -1 
>> Offset for type 6 is -1 
>> Offset for type 7 is 21008 
>> Offset for type 8 is -1 
>> Offset for type 9 is 31506 
>> Offset for type 10 is -1 
>> Offset for type 11 is -1 
>> Offset for type 12 is -1 
>>
>> C:\indicocr\tesseract301> 
>>
>> obviously the -1 above indicates something wrong;? in the whole of the 
>> tesseract-ocr project page, it is not possible to get the samples for 
>>
>> •tessdata/eng.config 
>> •tessdata/eng.unicharset 
>> •tessdata/eng.unicharambigs 
>> •tessdata/eng.inttemp 
>> •tessdata/eng.pffmtable 
>> •tessdata/eng.normproto 
>> •tessdata/eng.punc-dawg 
>> •tessdata/eng.word-dawg 
>> •tessdata/eng.number-dawg 
>> •tessdata/eng.freq-dawg 
>>
>> There are 13 items listed in the combinedTess while only 10 files are 
>> listed out above. 
>>
>> Though it is mentioned that unicharset, inttemp, pffmtable, normproto 
>> are the four files required about from word-dawg and freq-dawg, there 
>> is no mention if the other files such as tam,config, tam.unicharmbigs 
>> etc can be left absent or empty files are required. 
>>
>> now while trying to Tesseract using the above made tam.traineddata 
>> i am getting the error as below: 
>> =================================== 
>> C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam 
>> tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in 
>> file ..\classify\adaptmatch.cpp, line 512 
>>
>> C:\indicocr\tesseract301> 
>> ======================================= 
>>
>> kinly advise what went wrong, and what need be done to get proper 
>> traineddata file. and i am really hopeful that the files used before 
>> combining are also made availalable so that one can see the samples. 
>>
>> regards 
>> rnkantan 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c9230929-aa95-4cef-898d-d67fceb8a877%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Tamil Trained data; Tesseract 3.01- its strange ways of using the box file.

Reply via email to