Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Sriranga(83yrsold) Tue, 08 Dec 2015 01:12:40 -0800

Another question Is how to test  and add more in the <lang>unicharambigs in
the tesseract-ocr . In case if it can  be tested in the CMD or terminal
what is the commandline to be used?


On Tue, Dec 8, 2015 at 2:18 PM, Sriranga(83yrsold) <
[email protected]> wrote:

> Hi Tom,
> attached herewith sample of post-proc.txt used in FreeOCR  - which had
> incorporated on my special request by creator Ralph Richardson  more than 3
> years back. Attached screenshots will speak itself. As a sample I have done
> in English for easy understand by you.
> You can test in any langs. FreeOCR available for free download.
> you will notice that post-processor text sample (except no option like 0
> or 1)has similar feature available  in the <lang>unicharambig.
> *Advantage of in-built *of "unicharambigs" is at the time of final output
> of OCRed-
> all misspelling will automatically corrected before generating the
> <lan>traineddata resulting the corrected tessdata can be used for any image
> for correcting output text.
> *disadvantage of post processor* being external program is - one should
> have update the post-proc.text everytime  for each  ocred
> I am puzzled why unicharmabigs does not work as internal program correctly
> - when the post processor program works fine?
> With regards,
> sriranga(83yrs)
>
>
> On Mon, Dec 7, 2015 at 11:44 PM, Tom Morris <[email protected]> wrote:
>
>> Hi Sriranga.  I haven't used the training tools, but since no one else
>> has answered, I'll give it my best attempt.  Shree might have better
>> insights.
>>
>> First, a question of clarification.  Are you having problems with the
>> file or are you just trying to determine whether it is working properly or
>> not?
>>
>> If you just want to see if it's working correctly, my impression is that
>> most people do this empirically by a) visual inspection of the file to see
>> if the substitutions look correct and b) running a corpus of text through
>> to see how the contents of the file affect accuracy.
>>
>> To my untrained eye, the things I wonder about are:
>> - are all those mandatory substitutions (lines ending in 1) correct? ie
>> is it true that the string in column 1 can *never* be a valid word?
>> - there is an empty line which probably should be removed
>> - there are a few lines which have junk after the third column which
>> don't match the specified format e.g.:
>>
>> ಚಟಿಲ್ಕೆ ಚಟ್ನಿ,, 1   "
>> ಹೊರಿದಿವೆ ಹೊಂದಿವೆ.1   .
>>
>> Some of the words with embedded punctuation also look a little suspicious
>> to me.  Not knowing the script or language I don't know how common these
>> errors are, but I'd probably start with a very basic list of substitutions
>> and add to it as I found more common errors.
>>
>> Hopefully someone else can give you better advice which is based on more
>> than bystander guesswork!
>>
>> Tom
>>
>>
>> On Friday, December 4, 2015 at 10:36:13 PM UTC-5, sriranga(83yrsold)
>> wrote:
>>>
>>> Solution is requested urgently.
>>>
>>> On Wed, Dec 2, 2015 at 4:25 PM, sriranga(83yrsold) <
>>> [email protected]> wrote:
>>>
>>>>
>>>>  I have created kan.unicharambigs(attached below) based on the output
>>>> text of Kan.training_text file (which is big). I could not understand how
>>>> to test the attached file and find out whether it works or not?
>>>> kindly point out my mistakes in fhe said attached file, if any, for
>>>> which i shall be thankful to you. I prefer to have commandline test if
>>>> possible.
>>>>
>>>>
>>>> ==========================================================================
>>>> Based on wiki instruction (extract reproduced below for ready
>>>> reference) =
>>>>
>>>> The rules are not bidirectional, so if you want 'rn' to be considered
>>>> when 'm' is detected and vise versa you need a rule for each.
>>>>
>>>> Version 3.03 and on supports a new, simpler format for the
>>>> unicharambigs file:
>>>>
>>>> v2
>>>> '' " 1
>>>> m rn 0
>>>> iii m 0
>>>>
>>>> In this format, the "error" and "correction" are simple utf-8 strings
>>>> separated by *a space*, and, after another space, the same type
>>>> specifier as v1 (0 for optional and 1 for mandatory substitution). Note the
>>>> downside of this simpler format is that Tesseract has to encode the utf-8
>>>> strings into the components of the unicharset. In complex scripts, this
>>>> encoding may be ambiguous. In this case, the encoding is chosen such as to
>>>> use the least utf-8 characters for each component, ie the shortest
>>>> unicharset components will make up the encoding.
>>>>
>>>> Like most other files used in training, the 'unicharambigs' file must
>>>> be encoded as UTF8, and must end with a newline character. The
>>>> unicharambigs format is also described in the unicharambigs(5) man page
>>>> <https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs.5.html>.
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Reply via email to