Tom, Thanks for the response. I like to know whether you have tested "eng.unicharambigs" at your end and like to have your considered experience/comments, if any. Based on your valuable comments/suggestions if any, I am thinking to try for my lang kannada which is complex Indian lang.
On Thu, Dec 10, 2015 at 10:46 PM, Tom Morris <[email protected]> wrote: > On Wed, Dec 9, 2015 at 5:30 AM, Sriranga(83yrsold) < > [email protected]> wrote: > >> Tom, >> thanks for the hints. Just now I tested the eng.unicharambigs created by >> me and found workable. - attached files will speak itself. I am happy to >> note that eng.unicharambigs works fine. also attached output >> "unicharamtest.txt" for perusal - in which however I noticed that last line >> "luck good" did not changed to "good luck" - where I made mistake? >> > > This mechanism is really intended to fix a small number of characters, not > reorder entire word strings. The "good luck" case may be running into the > maximum string size (10) limit, depending on whether or not the count > includes the string terminator, but whatever the cause of the failure > there, is not a very realistic use case. I would focus on the actual texts > that you're trying to correct. > > Tom > > >> your suggested sentence >> "Novv is the time to go dovvn" also corrected. Please note I regenerated >> eng.traineddata in ubuntu 15.10. >> With regards, sriranga(83ys) >> >> On Wed, Dec 9, 2015 at 12:04 AM, Tom Morris <[email protected]> wrote: >> >>> FreeOCR is closed source and Windows only, so it's difficult for me to >>> tell what it's doing (or even what version of Tesseract it includes). >>> However, the test case that you're using doesn't appear realistic. >>> Tesseract is optimized for recognizing words, not short random strings of >>> characters, so rather than testing on "vv w" I think you'd get more >>> representative results if you used something like "Novv is the time to go >>> dovvn" and see if it turns the vv's into w's. Having said that, vv ==> w >>> isn't an entry in the standard eng.unicharambigs. They only mandatory >>> entries are for quotes, so you could try things like `' or '` to see if >>> they get turned into ". >>> >>> As far as I know, there's no way to specify a different unicharambigs >>> file on the command line. You need to replace it in the kan.traineddata >>> file for it to be found. The combine_tessdata utility is used for packing >>> and unpacked the traineddata files. e.g. >>> >>> $ combine_tessdata -e kan.traineddata kan.unicharambigs >>> $ combine_tessdata -o kan.traineddata kan.unicharambigs >>> >>> One thing that I noticed when looking at the source is that there's an >>> upper limit of 10 characters for the bad and replacement strings, which I'm >>> not sure is documented anywhere. This should be plenty for most >>> applications, but it's something to keep in mind. >>> >>> Good luck. Let us know how you make out. >>> >>> Tom >>> >>> >>> >>> On Tue, Dec 8, 2015 at 4:11 AM, Sriranga(83yrsold) < >>> [email protected]> wrote: >>> >>>> Another question Is how to test and add more in the >>>> <lang>unicharambigs in the tesseract-ocr . In case if it can be tested in >>>> the CMD or terminal what is the commandline to be used? >>>> >>>> On Tue, Dec 8, 2015 at 2:18 PM, Sriranga(83yrsold) < >>>> [email protected]> wrote: >>>> >>>>> Hi Tom, >>>>> attached herewith sample of post-proc.txt used in FreeOCR - which had >>>>> incorporated on my special request by creator Ralph Richardson more than >>>>> 3 >>>>> years back. Attached screenshots will speak itself. As a sample I have >>>>> done >>>>> in English for easy understand by you. >>>>> You can test in any langs. FreeOCR available for free download. >>>>> you will notice that post-processor text sample (except no option like >>>>> 0 or 1)has similar feature available in the <lang>unicharambig. >>>>> *Advantage of in-built *of "unicharambigs" is at the time of final >>>>> output of OCRed- >>>>> all misspelling will automatically corrected before generating the >>>>> <lan>traineddata resulting the corrected tessdata can be used for any >>>>> image >>>>> for correcting output text. >>>>> *disadvantage of post processor* being external program is - one >>>>> should have update the post-proc.text everytime for each ocred >>>>> I am puzzled why unicharmabigs does not work as internal program >>>>> correctly - when the post processor program works fine? >>>>> With regards, >>>>> sriranga(83yrs) >>>>> >>>>> >>>>> On Mon, Dec 7, 2015 at 11:44 PM, Tom Morris <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Sriranga. I haven't used the training tools, but since no one >>>>>> else has answered, I'll give it my best attempt. Shree might have better >>>>>> insights. >>>>>> >>>>>> First, a question of clarification. Are you having problems with the >>>>>> file or are you just trying to determine whether it is working properly >>>>>> or >>>>>> not? >>>>>> >>>>>> If you just want to see if it's working correctly, my impression is >>>>>> that most people do this empirically by a) visual inspection of the file >>>>>> to >>>>>> see if the substitutions look correct and b) running a corpus of text >>>>>> through to see how the contents of the file affect accuracy. >>>>>> >>>>>> To my untrained eye, the things I wonder about are: >>>>>> - are all those mandatory substitutions (lines ending in 1) correct? >>>>>> ie is it true that the string in column 1 can *never* be a valid word? >>>>>> - there is an empty line which probably should be removed >>>>>> - there are a few lines which have junk after the third column which >>>>>> don't match the specified format e.g.: >>>>>> >>>>>> ಚಟಿಲ್ಕೆ ಚಟ್ನಿ,, 1 " >>>>>> ಹೊರಿದಿವೆ ಹೊಂದಿವೆ.1 . >>>>>> >>>>>> Some of the words with embedded punctuation also look a little >>>>>> suspicious to me. Not knowing the script or language I don't know how >>>>>> common these errors are, but I'd probably start with a very basic list of >>>>>> substitutions and add to it as I found more common errors. >>>>>> >>>>>> Hopefully someone else can give you better advice which is based on >>>>>> more than bystander guesswork! >>>>>> >>>>>> Tom >>>>>> >>>>>> >>>>>> On Friday, December 4, 2015 at 10:36:13 PM UTC-5, sriranga(83yrsold) >>>>>> wrote: >>>>>>> >>>>>>> Solution is requested urgently. >>>>>>> >>>>>>> On Wed, Dec 2, 2015 at 4:25 PM, sriranga(83yrsold) < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> I have created kan.unicharambigs(attached below) based on the >>>>>>>> output text of Kan.training_text file (which is big). I could not >>>>>>>> understand how to test the attached file and find out whether it works >>>>>>>> or >>>>>>>> not? >>>>>>>> kindly point out my mistakes in fhe said attached file, if any, for >>>>>>>> which i shall be thankful to you. I prefer to have commandline test if >>>>>>>> possible. >>>>>>>> >>>>>>>> >>>>>>>> ========================================================================== >>>>>>>> Based on wiki instruction (extract reproduced below for ready >>>>>>>> reference) = >>>>>>>> >>>>>>>> The rules are not bidirectional, so if you want 'rn' to be >>>>>>>> considered when 'm' is detected and vise versa you need a rule for >>>>>>>> each. >>>>>>>> >>>>>>>> Version 3.03 and on supports a new, simpler format for the >>>>>>>> unicharambigs file: >>>>>>>> >>>>>>>> v2 >>>>>>>> '' " 1 >>>>>>>> m rn 0 >>>>>>>> iii m 0 >>>>>>>> >>>>>>>> In this format, the "error" and "correction" are simple utf-8 >>>>>>>> strings separated by *a space*, and, after another space, the same >>>>>>>> type specifier as v1 (0 for optional and 1 for mandatory substitution). >>>>>>>> Note the downside of this simpler format is that Tesseract has to >>>>>>>> encode >>>>>>>> the utf-8 strings into the components of the unicharset. In complex >>>>>>>> scripts, this encoding may be ambiguous. In this case, the encoding is >>>>>>>> chosen such as to use the least utf-8 characters for each component, >>>>>>>> ie the >>>>>>>> shortest unicharset components will make up the encoding. >>>>>>>> >>>>>>>> Like most other files used in training, the 'unicharambigs' file >>>>>>>> must be encoded as UTF8, and must end with a newline character. The >>>>>>>> unicharambigs format is also described in the unicharambigs(5) man >>>>>>>> page >>>>>>>> <https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs.5.html>. >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "tesseract-ocr" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/tesseract-ocr/VXdCSnno06w/unsubscribe >>>> . >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEH3Qhs1QK3yoAmqR%3Dw-%2B9Bd_BNYgpoNxf%2BCaFNaE1k2zA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEH3Qhs1QK3yoAmqR%3Dw-%2B9Bd_BNYgpoNxf%2BCaFNaE1k2zA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/VXdCSnno06w/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CANKD7Yyijd2jjzytP8UOytOXUi8YwE6o%2BnzEpVyB1BZyYWBiAQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CANKD7Yyijd2jjzytP8UOytOXUi8YwE6o%2BnzEpVyB1BZyYWBiAQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEHS%2BcQ1LFWCBYbhmX8jH2xc9xSGbAs6PqRaefs0vTjdQA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEHS%2BcQ1LFWCBYbhmX8jH2xc9xSGbAs6PqRaefs0vTjdQA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CANKD7Yzhp1ai1Sgv54KByj9KfO-ZZ%3DaypxNcROaEk82Zu8QTAA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

