Hi.. can i get the box file for those tif files and trained data also for latha font...
On Sunday, April 1, 2012 at 7:32:00 PM UTC+5:30, nkantan r wrote: > > hi all! > > i am surprised that no one replied on this subject in this forum, but not > shocked as i find the interest level in tamil ocr is rather very limited; > the real error on the above is that the "fullstop" in my training image is > treated as zero; so the box file had "two" zeros but the number of unichars > were not matching. > > while i have successfully trained tesseract (3.01) with suitable > unicharambigs to generate the correct ocr for simple computer passages, i > am keen on sharing some of my notes on the quirky (that is strange) ways > the box files are used for training; though i have used my own traineddata > for training pages of other fonts and even real fonts snapshots of old > books, i will be using here in this thread the exisitng trained data > initially. > > First thing to be noted by would-be trainers is never to use just letters > in the image file; either clube two or three letter to form "word" like > uneven spacing or use deliberately more spacing between letters; to clarify > further, use அஆஇ ஈஉஊஎ ஐ ஒஓஔ instead of அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ ஃ. i > donot know the reason for the same but a strange way of tesseract. > > i created a file called tam.latha.exp0.tif (from a snapshot of a pdf file > of a text file name tam.latha.exp0.odt). This contains all the tamil > characters latha font regular and 10 size, spaced out but presented in the > alphabetical order. the file is enclosed below. i created the box file > using the command below using the existing trained data: > > C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 -l > tam batch.nochop makebox > > the created box file is enclosed after renaming it as > tam.latha.exp0.orig.box; (the reason for renaming is that i have edited the > file). If any body opens the file in a box editior after renaming it to the > orignal name, they will find the following: > a) there is no blob corresponding to ஃ and ஹ் in the first part; also the > boxes are created in a sequence different from the arrangement of letters: > அ, ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ஔ. > THAT IS THOUGH ஒ, ஓ, ஔ ARE IN SEQUENCE THE BOXES/BLOBS ARE CREATED IN > DIFFERENT ORDER. This wrong order happens only in the first part, the same > set of letters are repeated in the bottom of the page and the BOXES/BLOBS > are in same sequence. I manually edited the box file. using jTess > editior deleting the ஔ box and inserted boxes for ஔ and ஃ I also deleted > irrelevant boxes aroung the vowel-variations. the edited file is enclosed > below (tam.latha.exp0.box). Now that the box file is satisfactory, as seen > in the jTess box editor, i attampted creating the tr file as below: > ================== > C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 > nobatch box.train > Tesseract Open Source OCR Engine v3.01 with Leptonica > Page 0 > APPLY_BOXES: boxfile line 14/α«â ((1099,3010),(1124,3040)): FAILURE! > Couldn't find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 1151 > Boxes failed resegmentation: 1 > APPLY_BOXES: Unlabelled word at :Bounding box=(2941,-1043)->(2966,-1027) > APPLY_BOXES: Unlabelled word at :Bounding box=(3010,-1071)->(3037,-1032) > Found 1150 good blobs and 55 unlabelled blobs in 0 words. > 2 remaining unlabelled words deleted. > TRAINING ... Font name = latha > Generated training data for 103 words > ================= > the generated Tr file is also enclosed; > > my observations and questions: > 1) the box (1099,3010),(1124,3040) coresponds to ஃ and has been manually > inserted; Also it is the 13th box and not in the 14th line! > 2) what is meant by "boxes failed resegmentation" > 3) second message regarding the bounding boxes ( > 2941,-1043)->(2966,-1027) (3010,-1071)->(3037,-1032); i am not able to > identify any boxes; not sure about the negative values; do they represent > the boxes in the box file or some blob co-ordinates? > 4) if we open the Tr file in any editor, we find the letter ஔ is after அ, > ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ; > 5) this means the image file is again read first and then the blobs are > compared to the nearest boxes. not that the box file is used to directly > create the blobs on the tif image and generate the training data within the > box boundaries. Obviously for a user, it appeals to common sense that the > box file will be used to create the blob on the image file. > 6) more curious to note is that the same set of letters in the first > part are repeated in the second half of the page; it is correctly sequenced > in the box file automatically; so the layout (linear arrangement of > letters) probaly does not matter.? > > =============== > Again the tif file is manually edited moving ஃ closer to க் ஔ closer > to ஒ; this time the box file and tr files are created properly; for > reference the tif file and box file are enclosed (tam.latha.exp1) > > ==== > i would like some answers this time; > if any body really wants to use and improve the revised trained data for > testing please feel free to write > > regards > rnkantan > > (PS since google in its wisdom doesnot want tif images, they are added as > zip files!) > > > > > > On Thursday, March 29, 2012 1:01:53 AM UTC+5:30, nkantan r wrote: > >> hi >> i know there are two tamil trained data files corresponding to Latha >> and Lohit fonts; going through the box and tif files i understand that >> the boxes for combined consonants (உயிர்மெய்) are selected as >> individual (for eg: கே is selected as individual ே and க instead of a >> merged கே. Since the vowel variation ே comes before the base consonant >> க, post processing is elaborately required while such post-processing >> can be written by a person knowing tamil aswell cpp! and as such post- >> processing is now altogether missing; >> >> to elaborate further: குகூகெகே is read correctly but texted out as >> குகூெகேக; this is because the sequence is read as கு கூ ெ, க ே க; by >> unicharater reading க followed by ே is read as single unicharacter >> கே; the net result is குகூெகேக >> this becomes worse when a single characters "கொ" "கோ" "கௌ" are read >> as three characters in three boxes! >> >> another major issue is the missing vowel ஔ which is read as while >> reading ஒ and ள; >> >> to avoid these issues, i am retraining the tamil alphabet in its >> proper form; though i have succeeded doing the same in one font (Latha >> size 12), while combining the language files i am getting : >> >> Combining tessdata files >> TessdataManager combined tess >> Offset for type 0 is -1 >> Offset for type 1 is 108 >> Offset for type 2 is -1 >> Offset for type 3 is -1 >> Offset for type 4 is 17420 >> Offset for type 5 is -1 >> Offset for type 6 is -1 >> Offset for type 7 is 21008 >> Offset for type 8 is -1 >> Offset for type 9 is 31506 >> Offset for type 10 is -1 >> Offset for type 11 is -1 >> Offset for type 12 is -1 >> >> C:\indicocr\tesseract301> >> >> obviously the -1 above indicates something wrong;? in the whole of the >> tesseract-ocr project page, it is not possible to get the samples for >> >> •tessdata/eng.config >> •tessdata/eng.unicharset >> •tessdata/eng.unicharambigs >> •tessdata/eng.inttemp >> •tessdata/eng.pffmtable >> •tessdata/eng.normproto >> •tessdata/eng.punc-dawg >> •tessdata/eng.word-dawg >> •tessdata/eng.number-dawg >> •tessdata/eng.freq-dawg >> >> There are 13 items listed in the combinedTess while only 10 files are >> listed out above. >> >> Though it is mentioned that unicharset, inttemp, pffmtable, normproto >> are the four files required about from word-dawg and freq-dawg, there >> is no mention if the other files such as tam,config, tam.unicharmbigs >> etc can be left absent or empty files are required. >> >> now while trying to Tesseract using the above made tam.traineddata >> i am getting the error as below: >> =================================== >> C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam >> tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in >> file ..\classify\adaptmatch.cpp, line 512 >> >> C:\indicocr\tesseract301> >> ======================================= >> >> kinly advise what went wrong, and what need be done to get proper >> traineddata file. and i am really hopeful that the files used before >> combining are also made availalable so that one can see the samples. >> >> regards >> rnkantan >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c9230929-aa95-4cef-898d-d67fceb8a877%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.