Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Tom Morris Thu, 10 Dec 2015 09:17:03 -0800

On Wed, Dec 9, 2015 at 5:30 AM, Sriranga(83yrsold) <
[email protected]> wrote:


> Tom,
> thanks for the hints. Just now I tested the eng.unicharambigs created by
> me and found workable. - attached files will speak itself. I am happy to
> note that eng.unicharambigs works fine. also attached output
> "unicharamtest.txt" for perusal - in which however I noticed that last line
> "luck good" did not changed to "good luck" - where I made mistake?
>

This mechanism is really intended to fix a small number of characters, not
reorder entire word strings.  The "good luck" case may be running into the
maximum string size (10) limit, depending on whether or not the count
includes the string terminator, but whatever the cause of the failure
there, is not a very realistic use case.  I would focus on the actual texts
that you're trying to correct.

Tom


> your suggested sentence
> "Novv is the time to go dovvn" also corrected. Please note I regenerated
> eng.traineddata  in ubuntu 15.10.
> With regards, sriranga(83ys)
>
> On Wed, Dec 9, 2015 at 12:04 AM, Tom Morris <[email protected]> wrote:
>
>> FreeOCR is closed source and Windows only, so it's difficult for me to
>> tell what it's doing (or even what version of Tesseract it includes).
>> However, the test case that you're using doesn't appear realistic.
>> Tesseract is optimized for recognizing words, not short random strings of
>> characters, so rather than testing on "vv w" I think you'd get more
>> representative results if you used something like "Novv is the time to go
>> dovvn" and see if it turns the vv's into w's.  Having said that, vv ==> w
>> isn't an entry in the standard eng.unicharambigs.  They only mandatory
>> entries are for quotes, so you could try things like `' or '` to see if
>> they get turned into ".
>>
>> As far as I know, there's no way to specify a different unicharambigs
>> file on the command line.  You need to replace it in the kan.traineddata
>> file for it to be found.  The combine_tessdata utility is used for packing
>> and unpacked the traineddata files.  e.g.
>>
>>     $ combine_tessdata -e kan.traineddata kan.unicharambigs
>>     $ combine_tessdata -o kan.traineddata kan.unicharambigs
>>
>> One thing that I noticed when looking at the source is that there's an
>> upper limit of 10 characters for the bad and replacement strings, which I'm
>> not sure is documented anywhere.  This should be plenty for most
>> applications, but it's something to keep in mind.
>>
>> Good luck.  Let us know how you make out.
>>
>> Tom
>>
>>
>>
>> On Tue, Dec 8, 2015 at 4:11 AM, Sriranga(83yrsold) <
>> [email protected]> wrote:
>>
>>> Another question Is how to test  and add more in the <lang>unicharambigs
>>> in the tesseract-ocr . In case if it can  be tested in the CMD or terminal
>>> what is the commandline to be used?
>>>
>>> On Tue, Dec 8, 2015 at 2:18 PM, Sriranga(83yrsold) <
>>> [email protected]> wrote:
>>>
>>>> Hi Tom,
>>>> attached herewith sample of post-proc.txt used in FreeOCR  - which had
>>>> incorporated on my special request by creator Ralph Richardson  more than 3
>>>> years back. Attached screenshots will speak itself. As a sample I have done
>>>> in English for easy understand by you.
>>>> You can test in any langs. FreeOCR available for free download.
>>>> you will notice that post-processor text sample (except no option like
>>>> 0 or 1)has similar feature available  in the <lang>unicharambig.
>>>> *Advantage of in-built *of "unicharambigs" is at the time of final
>>>> output of OCRed-
>>>> all misspelling will automatically corrected before generating the
>>>> <lan>traineddata resulting the corrected tessdata can be used for any image
>>>> for correcting output text.
>>>> *disadvantage of post processor* being external program is - one
>>>> should have update the post-proc.text everytime  for each  ocred
>>>> I am puzzled why unicharmabigs does not work as internal program
>>>> correctly - when the post processor program works fine?
>>>> With regards,
>>>> sriranga(83yrs)
>>>>
>>>>
>>>> On Mon, Dec 7, 2015 at 11:44 PM, Tom Morris <[email protected]> wrote:
>>>>
>>>>> Hi Sriranga.  I haven't used the training tools, but since no one else
>>>>> has answered, I'll give it my best attempt.  Shree might have better
>>>>> insights.
>>>>>
>>>>> First, a question of clarification.  Are you having problems with the
>>>>> file or are you just trying to determine whether it is working properly or
>>>>> not?
>>>>>
>>>>> If you just want to see if it's working correctly, my impression is
>>>>> that most people do this empirically by a) visual inspection of the file 
>>>>> to
>>>>> see if the substitutions look correct and b) running a corpus of text
>>>>> through to see how the contents of the file affect accuracy.
>>>>>
>>>>> To my untrained eye, the things I wonder about are:
>>>>> - are all those mandatory substitutions (lines ending in 1) correct?
>>>>> ie is it true that the string in column 1 can *never* be a valid word?
>>>>> - there is an empty line which probably should be removed
>>>>> - there are a few lines which have junk after the third column which
>>>>> don't match the specified format e.g.:
>>>>>
>>>>> ಚಟಿಲ್ಕೆ ಚಟ್ನಿ,, 1   "
>>>>> ಹೊರಿದಿವೆ ಹೊಂದಿವೆ.1   .
>>>>>
>>>>> Some of the words with embedded punctuation also look a little
>>>>> suspicious to me.  Not knowing the script or language I don't know how
>>>>> common these errors are, but I'd probably start with a very basic list of
>>>>> substitutions and add to it as I found more common errors.
>>>>>
>>>>> Hopefully someone else can give you better advice which is based on
>>>>> more than bystander guesswork!
>>>>>
>>>>> Tom
>>>>>
>>>>>
>>>>> On Friday, December 4, 2015 at 10:36:13 PM UTC-5, sriranga(83yrsold)
>>>>> wrote:
>>>>>>
>>>>>> Solution is requested urgently.
>>>>>>
>>>>>> On Wed, Dec 2, 2015 at 4:25 PM, sriranga(83yrsold) <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>>  I have created kan.unicharambigs(attached below) based on the
>>>>>>> output text of Kan.training_text file (which is big). I could not
>>>>>>> understand how to test the attached file and find out whether it works 
>>>>>>> or
>>>>>>> not?
>>>>>>> kindly point out my mistakes in fhe said attached file, if any, for
>>>>>>> which i shall be thankful to you. I prefer to have commandline test if
>>>>>>> possible.
>>>>>>>
>>>>>>>
>>>>>>> ==========================================================================
>>>>>>> Based on wiki instruction (extract reproduced below for ready
>>>>>>> reference) =
>>>>>>>
>>>>>>> The rules are not bidirectional, so if you want 'rn' to be
>>>>>>> considered when 'm' is detected and vise versa you need a rule for each.
>>>>>>>
>>>>>>> Version 3.03 and on supports a new, simpler format for the
>>>>>>> unicharambigs file:
>>>>>>>
>>>>>>> v2
>>>>>>> '' " 1
>>>>>>> m rn 0
>>>>>>> iii m 0
>>>>>>>
>>>>>>> In this format, the "error" and "correction" are simple utf-8
>>>>>>> strings separated by *a space*, and, after another space, the same
>>>>>>> type specifier as v1 (0 for optional and 1 for mandatory substitution).
>>>>>>> Note the downside of this simpler format is that Tesseract has to encode
>>>>>>> the utf-8 strings into the components of the unicharset. In complex
>>>>>>> scripts, this encoding may be ambiguous. In this case, the encoding is
>>>>>>> chosen such as to use the least utf-8 characters for each component, ie 
>>>>>>> the
>>>>>>> shortest unicharset components will make up the encoding.
>>>>>>>
>>>>>>> Like most other files used in training, the 'unicharambigs' file
>>>>>>> must be encoded as UTF8, and must end with a newline character. The
>>>>>>> unicharambigs format is also described in the unicharambigs(5) man
>>>>>>> page
>>>>>>> <https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs.5.html>.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/tesseract-ocr/VXdCSnno06w/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEH3Qhs1QK3yoAmqR%3Dw-%2B9Bd_BNYgpoNxf%2BCaFNaE1k2zA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEH3Qhs1QK3yoAmqR%3Dw-%2B9Bd_BNYgpoNxf%2BCaFNaE1k2zA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/VXdCSnno06w/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CANKD7Yyijd2jjzytP8UOytOXUi8YwE6o%2BnzEpVyB1BZyYWBiAQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CANKD7Yyijd2jjzytP8UOytOXUi8YwE6o%2BnzEpVyB1BZyYWBiAQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEHS%2BcQ1LFWCBYbhmX8jH2xc9xSGbAs6PqRaefs0vTjdQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] how to test the file kan.unicharambigs used for tesseeract-ocr 3.04 or 3.05

Reply via email to