Re: [tesseract-ocr] Re: Tamil Trained data; Tesseract 3.01- its strange ways of using the box file.

ShreeDevi Kumar Mon, 12 Feb 2018 07:52:17 -0800

That is a really old email regarding traineddata for 3.01.

You might get better results using the latest version of files from github.


On 12-Feb-2018 9:09 PM, <[email protected]> wrote:

Hi..

can i get the box file for those tif files and trained data also for latha
font...

On Sunday, April 1, 2012 at 7:32:00 PM UTC+5:30, nkantan r wrote:
>
> hi all!
>
> i am surprised that no one replied on this subject in this forum, but not
> shocked as i find the interest level in tamil ocr is rather very limited;
> the real error on the above is that the "fullstop" in my training image is
> treated as zero; so the box file had "two" zeros but the number of unichars
> were not matching.
>
> while i have successfully trained tesseract (3.01) with suitable
> unicharambigs to generate the correct ocr for simple computer passages, i
> am keen on sharing some of my notes on the quirky (that is strange) ways
> the box files are used for training; though i have used my own traineddata
> for training pages of other fonts and even real fonts snapshots of old
> books, i will be using here in this thread the exisitng trained data
> initially.
>
> First thing to be noted by would-be trainers is never to use just letters
> in the image file; either clube two or three letter to form "word" like
> uneven spacing or use deliberately more spacing between letters; to clarify
> further, use அஆஇ   ஈஉஊஎ ஐ  ஒஓஔ instead of அ ஆ இ ஈ உ ஊ எ ஏ ஐ ஒ ஓ ஔ ஃ.  i
> donot know the reason for the same but a strange way of tesseract.
>
> i created a file called tam.latha.exp0.tif (from a snapshot of a pdf file
> of a text file name tam.latha.exp0.odt). This contains all the tamil
> characters latha font regular and 10 size, spaced out but presented in the
> alphabetical order. the file is enclosed below. i created the box file
> using the command below using the existing trained data:
>
> C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0 -l
> tam batch.nochop makebox
>
> the created box file is enclosed after renaming it as
> tam.latha.exp0.orig.box; (the reason for renaming is that i have edited the
> file). If any body opens the file in a box editior after renaming it to the
> orignal name, they will find the following:
> a) there is no blob corresponding to ஃ  and ஹ் in the first part; also the
> boxes are created in a sequence different from the arrangement of letters:
> அ, ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ  ஔ.
>  THAT IS THOUGH ஒ, ஓ, ஔ ARE IN SEQUENCE THE BOXES/BLOBS ARE CREATED IN
> DIFFERENT ORDER.  This wrong order happens only in the first part, the same
> set of letters are repeated in the bottom of the page and the BOXES/BLOBS
> are in same sequence.  I manually edited the box file. using jTess
> editior deleting the ஔ box and inserted boxes for ஔ and ஃ I also deleted
> irrelevant boxes aroung the vowel-variations. the edited file is enclosed
> below (tam.latha.exp0.box). Now that the box file is satisfactory, as seen
> in the jTess box editor, i attampted creating the tr file as below:
> ==================
> C:\indicocr\tesseract301>tesseract tam.latha.exp0.tif tam.latha.exp0
> nobatch box.train
> Tesseract Open Source OCR Engine v3.01 with Leptonica
> Page 0
> APPLY_BOXES: boxfile line 14/α«â ((1099,3010),(1124,3040)): FAILURE!
> Couldn't find a matching blob
> APPLY_BOXES:
>    Boxes read from boxfile:    1151
>    Boxes failed resegmentation:       1
> APPLY_BOXES: Unlabelled word at :Bounding box=(2941,-1043)->(2966,-1027)
> APPLY_BOXES: Unlabelled word at :Bounding box=(3010,-1071)->(3037,-1032)
>    Found 1150 good blobs and 55 unlabelled blobs in 0 words.
>    2 remaining unlabelled words deleted.
> TRAINING ... Font name = latha
> Generated training data for 103 words
> =================
> the generated Tr file is also enclosed;
>
> my observations and questions:
> 1) the box (1099,3010),(1124,3040)  coresponds to ஃ  and has been manually
> inserted; Also it is the 13th box and not in the 14th line!
> 2) what is meant by "boxes failed resegmentation"
> 3) second message regarding the bounding boxes  (
> 2941,-1043)->(2966,-1027) (3010,-1071)->(3037,-1032);  i am not able to
> identify any boxes; not sure about the negative values; do they represent
> the boxes in the box file or some blob co-ordinates?
> 4) if we open the Tr file in any editor, we find the letter ஔ is after அ,
> ஆ, இ, ஈ , உ, ஊ, எ, ஏ, ஐ, ஒ, க், ங். ............ெ ;
> 5) this means the image file is again read first and then the blobs are
> compared to the nearest boxes.  not that the box file is used to directly
> create the blobs on the tif image and generate the training data within the
> box boundaries. Obviously for a user, it appeals to common sense that the
> box file will be used to create the blob on the image file.
> 6)  more curious to note is that the same set of  letters in the first
> part are repeated in the second half of the page; it is correctly sequenced
> in the box file automatically;  so the layout (linear arrangement of
> letters) probaly does not matter.?
>
> ===============
> Again the tif file is manually edited moving ஃ  closer to  க்  ஔ closer
> to ஒ; this time the box file and tr files are created properly; for
> reference the tif file and box file are enclosed (tam.latha.exp1)
>
> ====
> i would like some answers this time;
> if any body really wants to use and improve the revised trained data for
> testing please feel free to write
>
> regards
> rnkantan
>
> (PS since google in its wisdom doesnot want tif images, they are added as
> zip files!)
>
>
>
>
>
> On Thursday, March 29, 2012 1:01:53 AM UTC+5:30, nkantan r wrote:
>
>> hi
>> i know there are two tamil trained data files corresponding to Latha
>> and Lohit fonts; going through the box and tif files i understand that
>> the boxes for combined consonants (உயிர்மெய்) are selected as
>> individual (for eg: கே  is selected as individual ே and க instead of a
>> merged கே. Since the vowel variation ே comes before the base consonant
>> க, post processing is elaborately required while such post-processing
>> can be written by a person knowing tamil aswell cpp! and as such post-
>> processing is now altogether missing;
>>
>> to elaborate further:   குகூகெகே  is read correctly but texted out as
>> குகூெகேக; this is because the  sequence is read as கு கூ ெ, க ே க; by
>> unicharater reading க followed by ே is read as single unicharacter
>> கே;  the net result is குகூெகேக
>> this becomes worse when a single characters "கொ"  "கோ" "கௌ" are read
>> as three characters in three boxes!
>>
>> another major issue is the missing vowel ஔ which is read as  while
>> reading ஒ and ள;
>>
>> to avoid these issues, i am retraining the tamil alphabet in its
>> proper form; though i have succeeded doing the same in one font (Latha
>> size 12), while combining the language files i am getting :
>>
>> Combining tessdata files
>> TessdataManager combined tess
>> Offset for type 0 is -1
>> Offset for type 1 is 108
>> Offset for type 2 is -1
>> Offset for type 3 is -1
>> Offset for type 4 is 17420
>> Offset for type 5 is -1
>> Offset for type 6 is -1
>> Offset for type 7 is 21008
>> Offset for type 8 is -1
>> Offset for type 9 is 31506
>> Offset for type 10 is -1
>> Offset for type 11 is -1
>> Offset for type 12 is -1
>>
>> C:\indicocr\tesseract301>
>>
>> obviously the -1 above indicates something wrong;? in the whole of the
>> tesseract-ocr project page, it is not possible to get the samples for
>>
>> •tessdata/eng.config
>> •tessdata/eng.unicharset
>> •tessdata/eng.unicharambigs
>> •tessdata/eng.inttemp
>> •tessdata/eng.pffmtable
>> •tessdata/eng.normproto
>> •tessdata/eng.punc-dawg
>> •tessdata/eng.word-dawg
>> •tessdata/eng.number-dawg
>> •tessdata/eng.freq-dawg
>>
>> There are 13 items listed in the combinedTess while only 10 files are
>> listed out above.
>>
>> Though it is mentioned that unicharset, inttemp, pffmtable, normproto
>> are the four files required about from word-dawg and freq-dawg, there
>> is no mention if the other files such as tam,config, tam.unicharmbigs
>> etc can be left absent or empty files are required.
>>
>> now while trying to Tesseract using the above made tam.traineddata
>> i am getting the error as below:
>> ===================================
>> C:\indicocr\tesseract301>tesseract image.tif testtxt -l tam
>> tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in
>> file ..\classify\adaptmatch.cpp, line 512
>>
>> C:\indicocr\tesseract301>
>> =======================================
>>
>> kinly advise what went wrong, and what need be done to get proper
>> traineddata file. and i am really hopeful that the files used before
>> combining are also made availalable so that one can see the samples.
>>
>> regards
>> rnkantan
>>
> --
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/
msgid/tesseract-ocr/c9230929-aa95-4cef-898d-d67fceb8a877%40googlegroups.com
<https://groups.google.com/d/msgid/tesseract-ocr/c9230929-aa95-4cef-898d-d67fceb8a877%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVLU8dU5DxME_7d_k63_3Z3MiapPJ31ux7vHFSedTG8EA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Tamil Trained data; Tesseract 3.01- its strange ways of using the box file.

Reply via email to