Another question Is how to test and add more in the <lang>unicharambigs in the tesseract-ocr . In case if it can be tested in the CMD or terminal what is the commandline to be used?
On Tue, Dec 8, 2015 at 2:18 PM, Sriranga(83yrsold) < [email protected]> wrote: > Hi Tom, > attached herewith sample of post-proc.txt used in FreeOCR - which had > incorporated on my special request by creator Ralph Richardson more than 3 > years back. Attached screenshots will speak itself. As a sample I have done > in English for easy understand by you. > You can test in any langs. FreeOCR available for free download. > you will notice that post-processor text sample (except no option like 0 > or 1)has similar feature available in the <lang>unicharambig. > *Advantage of in-built *of "unicharambigs" is at the time of final output > of OCRed- > all misspelling will automatically corrected before generating the > <lan>traineddata resulting the corrected tessdata can be used for any image > for correcting output text. > *disadvantage of post processor* being external program is - one should > have update the post-proc.text everytime for each ocred > I am puzzled why unicharmabigs does not work as internal program correctly > - when the post processor program works fine? > With regards, > sriranga(83yrs) > > > On Mon, Dec 7, 2015 at 11:44 PM, Tom Morris <[email protected]> wrote: > >> Hi Sriranga. I haven't used the training tools, but since no one else >> has answered, I'll give it my best attempt. Shree might have better >> insights. >> >> First, a question of clarification. Are you having problems with the >> file or are you just trying to determine whether it is working properly or >> not? >> >> If you just want to see if it's working correctly, my impression is that >> most people do this empirically by a) visual inspection of the file to see >> if the substitutions look correct and b) running a corpus of text through >> to see how the contents of the file affect accuracy. >> >> To my untrained eye, the things I wonder about are: >> - are all those mandatory substitutions (lines ending in 1) correct? ie >> is it true that the string in column 1 can *never* be a valid word? >> - there is an empty line which probably should be removed >> - there are a few lines which have junk after the third column which >> don't match the specified format e.g.: >> >> ಚಟಿಲ್ಕೆ ಚಟ್ನಿ,, 1 " >> ಹೊರಿದಿವೆ ಹೊಂದಿವೆ.1 . >> >> Some of the words with embedded punctuation also look a little suspicious >> to me. Not knowing the script or language I don't know how common these >> errors are, but I'd probably start with a very basic list of substitutions >> and add to it as I found more common errors. >> >> Hopefully someone else can give you better advice which is based on more >> than bystander guesswork! >> >> Tom >> >> >> On Friday, December 4, 2015 at 10:36:13 PM UTC-5, sriranga(83yrsold) >> wrote: >>> >>> Solution is requested urgently. >>> >>> On Wed, Dec 2, 2015 at 4:25 PM, sriranga(83yrsold) < >>> [email protected]> wrote: >>> >>>> >>>> I have created kan.unicharambigs(attached below) based on the output >>>> text of Kan.training_text file (which is big). I could not understand how >>>> to test the attached file and find out whether it works or not? >>>> kindly point out my mistakes in fhe said attached file, if any, for >>>> which i shall be thankful to you. I prefer to have commandline test if >>>> possible. >>>> >>>> >>>> ========================================================================== >>>> Based on wiki instruction (extract reproduced below for ready >>>> reference) = >>>> >>>> The rules are not bidirectional, so if you want 'rn' to be considered >>>> when 'm' is detected and vise versa you need a rule for each. >>>> >>>> Version 3.03 and on supports a new, simpler format for the >>>> unicharambigs file: >>>> >>>> v2 >>>> '' " 1 >>>> m rn 0 >>>> iii m 0 >>>> >>>> In this format, the "error" and "correction" are simple utf-8 strings >>>> separated by *a space*, and, after another space, the same type >>>> specifier as v1 (0 for optional and 1 for mandatory substitution). Note the >>>> downside of this simpler format is that Tesseract has to encode the utf-8 >>>> strings into the components of the unicharset. In complex scripts, this >>>> encoding may be ambiguous. In this case, the encoding is chosen such as to >>>> use the least utf-8 characters for each component, ie the shortest >>>> unicharset components will make up the encoding. >>>> >>>> Like most other files used in training, the 'unicharambigs' file must >>>> be encoded as UTF8, and must end with a newline character. The >>>> unicharambigs format is also described in the unicharambigs(5) man page >>>> <https://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharambigs.5.html>. >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/0d30025d-cc11-4f69-9e98-ec919d3f43df%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/cb707912-5c46-46c8-8791-340f84e6421a%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CANKD7YxsYjJuvCpc0rPY56ZB2bWo_XFDAY_rzP13k4rD20ZbdA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

