can i get the box file for those tif files and trained data also for latha
On Sunday, April 1, 2012 at 7:32:00 PM UTC+5:30, nkantan r wrote:
> hi all!
> i am surprised that no one replied on this subject in this forum, but not
> shocked as i find the interest level in tamil ocr is rather very limited;
> the real error on the above is that the "fullstop" in my training image is
> treated as zero; so the box file had "two" zeros but the number of unichars
> were not matching.
> while i have successfully trained tesseract (3.01) with suitable
> unicharambigs to generate the correct ocr for simple computer passages, i
> am keen on sharing some of my notes on the quirky (that is strange) ways
> the box files are used for training; though i have used my own traineddata
> for training pages of other fonts and even real fonts snapshots of old
> books, i will be using here in this thread the exisitng trained data
> First thing to be noted by would-be trainers is never to use just letters
> in the image file; either clube two or three letter to form "word" like
> uneven spacing or use deliberately more spacing between letters; to clarify
> further, use அஆஇ ஈஉஊஎ ஐ ஒஓஔ instead of அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ ஃ. i
> donot know the reason for the same but a strange way of tesseract.
> i created a file called tam.latha.exp0.tif (from a snapshot of a pdf file
> of a text file name tam.latha.exp0.odt). This contains all the tamil
> characters latha font regular and 10 size, spaced out but presented in the
> alphabetical order. the file is enclosed below. i created the box file
> using the command below using the existing trained data:
> C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 -l
> tam batch.nochop makebox
> the created box file is enclosed after renaming it as
> tam.latha.exp0.orig.box; (the reason for renaming is that i have edited the
> file). If any body opens the file in a box editior after renaming it to the
> orignal name, they will find the following:
> a) there is no blob corresponding to ஃ and ஹ் in the first part; also the
> boxes are created in a sequence different from the arrangement of letters:
> அ, ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ஔ.
> THAT IS THOUGH ஒ, ஓ, ஔ ARE IN SEQUENCE THE BOXES/BLOBS ARE CREATED IN
> DIFFERENT ORDER. This wrong order happens only in the first part, the same
> set of letters are repeated in the bottom of the page and the BOXES/BLOBS
> are in same sequence. I manually edited the box file. using jTess
> editior deleting the ஔ box and inserted boxes for ஔ and ஃ I also deleted
> irrelevant boxes aroung the vowel-variations. the edited file is enclosed
> below (tam.latha.exp0.box). Now that the box file is satisfactory, as seen
> in the jTess box editor, i attampted creating the tr file as below:
> C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0
> nobatch box.train
> Tesseract Open Source OCR Engine v3.01 with Leptonica
> Page 0
> APPLY_BOXES: boxfile line 14/α«â ((1099,3010),(1124,3040)): FAILURE!
> Couldn't find a matching blob
> Boxes read from boxfile: 1151
> Boxes failed resegmentation: 1
> APPLY_BOXES: Unlabelled word at :Bounding box=(2941,-1043)->(2966,-1027)
> APPLY_BOXES: Unlabelled word at :Bounding box=(3010,-1071)->(3037,-1032)
> Found 1150 good blobs and 55 unlabelled blobs in 0 words.
> 2 remaining unlabelled words deleted.
> TRAINING ... Font name = latha
> Generated training data for 103 words
> the generated Tr file is also enclosed;
> my observations and questions:
> 1) the box (1099,3010),(1124,3040) coresponds to ஃ and has been manually
> inserted; Also it is the 13th box and not in the 14th line!
> 2) what is meant by "boxes failed resegmentation"
> 3) second message regarding the bounding boxes (
> 2941,-1043)->(2966,-1027) (3010,-1071)->(3037,-1032); i am not able to
> identify any boxes; not sure about the negative values; do they represent
> the boxes in the box file or some blob co-ordinates?
> 4) if we open the Tr file in any editor, we find the letter ஔ is after அ,
> ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ;
> 5) this means the image file is again read first and then the blobs are
> compared to the nearest boxes. not that the box file is used to directly
> create the blobs on the tif image and generate the training data within the
> box boundaries. Obviously for a user, it appeals to common sense that the
> box file will be used to create the blob on the image file.
> 6) more curious to note is that the same set of letters in the first
> part are repeated in the second half of the page; it is correctly sequenced
> in the box file automatically; so the layout (linear arrangement of
> letters) probaly does not matter.?
> Again the tif file is manually edited moving ஃ closer to க் ஔ closer
> to ஒ; this time the box file and tr files are created properly; for
> reference the tif file and box file are enclosed (tam.latha.exp1)
> i would like some answers this time;
> if any body really wants to use and improve the revised trained data for
> testing please feel free to write
> (PS since google in its wisdom doesnot want tif images, they are added as
> zip files!)
> On Thursday, March 29, 2012 1:01:53 AM UTC+5:30, nkantan r wrote:
>> i know there are two tamil trained data files corresponding to Latha
>> and Lohit fonts; going through the box and tif files i understand that
>> the boxes for combined consonants (உயிர்மெய்) are selected as
>> individual (for eg: கே is selected as individual ே and க instead of a
>> merged கே. Since the vowel variation ே comes before the base consonant
>> க, post processing is elaborately required while such post-processing
>> can be written by a person knowing tamil aswell cpp! and as such post-
>> processing is now altogether missing;
>> to elaborate further: குகூகெகே is read correctly but texted out as
>> குகூெகேக; this is because the sequence is read as கு கூ ெ, க ே க; by
>> unicharater reading க followed by ே is read as single unicharacter
>> கே; the net result is குகூெகேக
>> this becomes worse when a single characters "கொ" "கோ" "கௌ" are read
>> as three characters in three boxes!
>> another major issue is the missing vowel ஔ which is read as while
>> reading ஒ and ள;
>> to avoid these issues, i am retraining the tamil alphabet in its
>> proper form; though i have succeeded doing the same in one font (Latha
>> size 12), while combining the language files i am getting :
>> Combining tessdata files
>> TessdataManager combined tess
>> Offset for type 0 is -1
>> Offset for type 1 is 108
>> Offset for type 2 is -1
>> Offset for type 3 is -1
>> Offset for type 4 is 17420
>> Offset for type 5 is -1
>> Offset for type 6 is -1
>> Offset for type 7 is 21008
>> Offset for type 8 is -1
>> Offset for type 9 is 31506
>> Offset for type 10 is -1
>> Offset for type 11 is -1
>> Offset for type 12 is -1
>> obviously the -1 above indicates something wrong;? in the whole of the
>> tesseract-ocr project page, it is not possible to get the samples for
>> There are 13 items listed in the combinedTess while only 10 files are
>> listed out above.
>> Though it is mentioned that unicharset, inttemp, pffmtable, normproto
>> are the four files required about from word-dawg and freq-dawg, there
>> is no mention if the other files such as tam,config, tam.unicharmbigs
>> etc can be left absent or empty files are required.
>> now while trying to Tesseract using the above made tam.traineddata
>> i am getting the error as below:
>> C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam
>> tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in
>> file ..\classify\adaptmatch.cpp, line 512
>> kinly advise what went wrong, and what need be done to get proper
>> traineddata file. and i am really hopeful that the files used before
>> combining are also made availalable so that one can see the samples.
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
To post to this group, send email to firstname.lastname@example.org.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
For more options, visit https://groups.google.com/d/optout.