Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Tom Morris Mon, 07 Dec 2015 10:16:07 -0800

Hi Sriranga.  I haven't used the training tools, but since no one else has 
answered, I'll give it my best attempt.  Shree might have better insights.


First, a question of clarification.  Are you having problems with the file 
or are you just trying to determine whether it is working properly or not?

If you just want to see if it's working correctly, my impression is that 
most people do this empirically by a) visual inspection of the file to see 
if the substitutions look correct and b) running a corpus of text through 
to see how the contents of the file affect accuracy.

To my untrained eye, the things I wonder about are:
- are all those mandatory substitutions (lines ending in 1) correct? ie is 
it true that the string in column 1 can *never* be a valid word?
- there is an empty line which probably should be removed
- there are a few lines which have junk after the third column which don't 
match the specified format e.g.:

ಚಟಿಲ್ಕೆ ಚಟ್ನಿ,, 1   "
ಹೊರಿದಿವೆ ಹೊಂದಿವೆ.1   .

Some of the words with embedded punctuation also look a little suspicious 
to me.  Not knowing the script or language I don't know how common these 
errors are, but I'd probably start with a very basic list of substitutions 
and add to it as I found more common errors.

Hopefully someone else can give you better advice which is based on more 
than bystander guesswork!

Tom


On Friday, December 4, 2015 at 10:36:13 PM UTC-5, sriranga(83yrsold) wrote:
>
> Solution is requested urgently. 
>
> On Wed, Dec 2, 2015 at 4:25 PM, sriranga(83yrsold) <
> [email protected] <javascript:>> wrote:
>
>>
>>  I have created kan.unicharambigs(attached below) based on the output 
>> text of Kan.training_text file (which is big). I could not understand how 
>> to test the attached file and find out whether it works or not?
>> kindly point out my mistakes in fhe said attached file, if any, for which 
>> i shall be thankful to you. I prefer to have commandline test if possible.
>>
>> ==========================================================================
>> Based on wiki instruction (extract reproduced below for ready reference) =
>>
>> The rules are not bidirectional, so if you want 'rn' to be considered 
>> when 'm' is detected and vise versa you need a rule for each. 
>>
>> Version 3.03 and on supports a new, simpler format for the unicharambigs 
>> file: 
>>
>> v2
>> '' " 1
>> m rn 0
>> iii m 0
>>
>> In this format, the "error" and "correction" are simple utf-8 strings 
>> separated by *a space*, and, after another space, the same type 
>> specifier as v1 (0 for optional and 1 for mandatory substitution). Note the 
>> downside of this simpler format is that Tesseract has to encode the utf-8 
>> strings into the components of the unicharset. In complex scripts, this 
>> encoding may be ambiguous. In this case, the encoding is chosen such as to 
>> use the least utf-8 characters for each component, ie the shortest 
>> unicharset components will make up the encoding. 
>>
>> Like most other files used in training, the 'unicharambigs' file must be 
>> encoded as UTF8, and must end with a newline character. The unicharambigs 
>> format is also described in the unicharambigs(5) man page 
>> <https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs.5.html>.
>>  
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Reply via email to