Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Sriranga(83yrsold) Fri, 11 Dec 2015 00:01:17 -0800

Tom,
Thanks for the response. I like to know whether you have tested
"eng.unicharambigs"
at your end and like to have your considered experience/comments, if any.
Based on your valuable comments/suggestions if any, I am thinking to try
for my lang kannada which is complex Indian lang.



On Thu, Dec 10, 2015 at 10:46 PM, Tom Morris <[email protected]> wrote:

> On Wed, Dec 9, 2015 at 5:30 AM, Sriranga(83yrsold) <
> [email protected]> wrote:
>
>> Tom,
>> thanks for the hints. Just now I tested the eng.unicharambigs created by
>> me and found workable. - attached files will speak itself. I am happy to
>> note that eng.unicharambigs works fine. also attached output
>> "unicharamtest.txt" for perusal - in which however I noticed that last line
>> "luck good" did not changed to "good luck" - where I made mistake?
>>
>
> This mechanism is really intended to fix a small number of characters, not
> reorder entire word strings.  The "good luck" case may be running into the
> maximum string size (10) limit, depending on whether or not the count
> includes the string terminator, but whatever the cause of the failure
> there, is not a very realistic use case.  I would focus on the actual texts
> that you're trying to correct.
>
> Tom
>
>
>> your suggested sentence
>> "Novv is the time to go dovvn" also corrected. Please note I regenerated
>> eng.traineddata  in ubuntu 15.10.
>> With regards, sriranga(83ys)
>>
>> On Wed, Dec 9, 2015 at 12:04 AM, Tom Morris <[email protected]> wrote:
>>
>>> FreeOCR is closed source and Windows only, so it's difficult for me to
>>> tell what it's doing (or even what version of Tesseract it includes).
>>> However, the test case that you're using doesn't appear realistic.
>>> Tesseract is optimized for recognizing words, not short random strings of
>>> characters, so rather than testing on "vv w" I think you'd get more
>>> representative results if you used something like "Novv is the time to go
>>> dovvn" and see if it turns the vv's into w's.  Having said that, vv ==> w
>>> isn't an entry in the standard eng.unicharambigs.  They only mandatory
>>> entries are for quotes, so you could try things like `' or '` to see if
>>> they get turned into ".
>>>
>>> As far as I know, there's no way to specify a different unicharambigs
>>> file on the command line.  You need to replace it in the kan.traineddata
>>> file for it to be found.  The combine_tessdata utility is used for packing
>>> and unpacked the traineddata files.  e.g.
>>>
>>>     $ combine_tessdata -e kan.traineddata kan.unicharambigs
>>>     $ combine_tessdata -o kan.traineddata kan.unicharambigs
>>>
>>> One thing that I noticed when looking at the source is that there's an
>>> upper limit of 10 characters for the bad and replacement strings, which I'm
>>> not sure is documented anywhere.  This should be plenty for most
>>> applications, but it's something to keep in mind.
>>>
>>> Good luck.  Let us know how you make out.
>>>
>>> Tom
>>>
>>>
>>>
>>> On Tue, Dec 8, 2015 at 4:11 AM, Sriranga(83yrsold) <
>>> [email protected]> wrote:
>>>
>>>> Another question Is how to test  and add more in the
>>>> <lang>unicharambigs in the tesseract-ocr . In case if it can  be tested in
>>>> the CMD or terminal what is the commandline to be used?
>>>>
>>>> On Tue, Dec 8, 2015 at 2:18 PM, Sriranga(83yrsold) <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Tom,
>>>>> attached herewith sample of post-proc.txt used in FreeOCR  - which had
>>>>> incorporated on my special request by creator Ralph Richardson  more than 
>>>>> 3
>>>>> years back. Attached screenshots will speak itself. As a sample I have 
>>>>> done
>>>>> in English for easy understand by you.
>>>>> You can test in any langs. FreeOCR available for free download.
>>>>> you will notice that post-processor text sample (except no option like
>>>>> 0 or 1)has similar feature available  in the <lang>unicharambig.
>>>>> *Advantage of in-built *of "unicharambigs" is at the time of final
>>>>> output of OCRed-
>>>>> all misspelling will automatically corrected before generating the
>>>>> <lan>traineddata resulting the corrected tessdata can be used for any 
>>>>> image
>>>>> for correcting output text.
>>>>> *disadvantage of post processor* being external program is - one
>>>>> should have update the post-proc.text everytime  for each  ocred
>>>>> I am puzzled why unicharmabigs does not work as internal program
>>>>> correctly - when the post processor program works fine?
>>>>> With regards,
>>>>> sriranga(83yrs)
>>>>>
>>>>>
>>>>> On Mon, Dec 7, 2015 at 11:44 PM, Tom Morris <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Sriranga.  I haven't used the training tools, but since no one
>>>>>> else has answered, I'll give it my best attempt.  Shree might have better
>>>>>> insights.
>>>>>>
>>>>>> First, a question of clarification.  Are you having problems with the
>>>>>> file or are you just trying to determine whether it is working properly 
>>>>>> or
>>>>>> not?
>>>>>>
>>>>>> If you just want to see if it's working correctly, my impression is
>>>>>> that most people do this empirically by a) visual inspection of the file 
>>>>>> to
>>>>>> see if the substitutions look correct and b) running a corpus of text
>>>>>> through to see how the contents of the file affect accuracy.
>>>>>>
>>>>>> To my untrained eye, the things I wonder about are:
>>>>>> - are all those mandatory substitutions (lines ending in 1) correct?
>>>>>> ie is it true that the string in column 1 can *never* be a valid word?
>>>>>> - there is an empty line which probably should be removed
>>>>>> - there are a few lines which have junk after the third column which
>>>>>> don't match the specified format e.g.:
>>>>>>
>>>>>> ಚಟಿಲ್ಕೆ ಚಟ್ನಿ,, 1   "
>>>>>> ಹೊರಿದಿವೆ ಹೊಂದಿವೆ.1   .
>>>>>>
>>>>>> Some of the words with embedded punctuation also look a little
>>>>>> suspicious to me.  Not knowing the script or language I don't know how
>>>>>> common these errors are, but I'd probably start with a very basic list of
>>>>>> substitutions and add to it as I found more common errors.
>>>>>>
>>>>>> Hopefully someone else can give you better advice which is based on
>>>>>> more than bystander guesswork!
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>
>>>>>> On Friday, December 4, 2015 at 10:36:13 PM UTC-5, sriranga(83yrsold)
>>>>>> wrote:
>>>>>>>
>>>>>>> Solution is requested urgently.
>>>>>>>
>>>>>>> On Wed, Dec 2, 2015 at 4:25 PM, sriranga(83yrsold) <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>  I have created kan.unicharambigs(attached below) based on the
>>>>>>>> output text of Kan.training_text file (which is big). I could not
>>>>>>>> understand how to test the attached file and find out whether it works 
>>>>>>>> or
>>>>>>>> not?
>>>>>>>> kindly point out my mistakes in fhe said attached file, if any, for
>>>>>>>> which i shall be thankful to you. I prefer to have commandline test if
>>>>>>>> possible.
>>>>>>>>
>>>>>>>>
>>>>>>>> ==========================================================================
>>>>>>>> Based on wiki instruction (extract reproduced below for ready
>>>>>>>> reference) =
>>>>>>>>
>>>>>>>> The rules are not bidirectional, so if you want 'rn' to be
>>>>>>>> considered when 'm' is detected and vise versa you need a rule for 
>>>>>>>> each.
>>>>>>>>
>>>>>>>> Version 3.03 and on supports a new, simpler format for the
>>>>>>>> unicharambigs file:
>>>>>>>>
>>>>>>>> v2
>>>>>>>> '' " 1
>>>>>>>> m rn 0
>>>>>>>> iii m 0
>>>>>>>>
>>>>>>>> In this format, the "error" and "correction" are simple utf-8
>>>>>>>> strings separated by *a space*, and, after another space, the same
>>>>>>>> type specifier as v1 (0 for optional and 1 for mandatory substitution).
>>>>>>>> Note the downside of this simpler format is that Tesseract has to 
>>>>>>>> encode
>>>>>>>> the utf-8 strings into the components of the unicharset. In complex
>>>>>>>> scripts, this encoding may be ambiguous. In this case, the encoding is
>>>>>>>> chosen such as to use the least utf-8 characters for each component, 
>>>>>>>> ie the
>>>>>>>> shortest unicharset components will make up the encoding.
>>>>>>>>
>>>>>>>> Like most other files used in training, the 'unicharambigs' file
>>>>>>>> must be encoded as UTF8, and must end with a newline character. The
>>>>>>>> unicharambigs format is also described in the unicharambigs(5) man
>>>>>>>> page
>>>>>>>> <https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs.5.html>.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>> --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "tesseract-ocr" group.
>>>> To unsubscribe from this topic, visit
>>>> https://groups.google.com/d/topic/tesseract-ocr/VXdCSnno06w/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEH3Qhs1QK3yoAmqR%3Dw-%2B9Bd_BNYgpoNxf%2BCaFNaE1k2zA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEH3Qhs1QK3yoAmqR%3Dw-%2B9Bd_BNYgpoNxf%2BCaFNaE1k2zA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/VXdCSnno06w/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CANKD7Yyijd2jjzytP8UOytOXUi8YwE6o%2BnzEpVyB1BZyYWBiAQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CANKD7Yyijd2jjzytP8UOytOXUi8YwE6o%2BnzEpVyB1BZyYWBiAQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEHS%2BcQ1LFWCBYbhmX8jH2xc9xSGbAs6PqRaefs0vTjdQA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEHS%2BcQ1LFWCBYbhmX8jH2xc9xSGbAs6PqRaefs0vTjdQA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CANKD7Yzhp1ai1Sgv54KByj9KfO-ZZ%3DaypxNcROaEk82Zu8QTAA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Reply via email to