I have myself started experimenting with tesseract recently. So, I passed on the info I had found on the documentation pages.
The experts on the forum may suggest the next steps. On Mon, Apr 22, 2013 at 5:40 PM, Attila Sukosd <[email protected]>wrote: > Hi again, > > I've looked at the unicharambigs file, but I think the problem is > elsewhere. > > > <https://lh4.googleusercontent.com/-XrDllWLRSN4/UXUnzmx4JNI/AAAAAAAAAGE/5L4CqAnuXbQ/s1600/boundingbox.png> > In the attached image, you can see that the last word is "omkommet", but > tesseract recognises it as "onkonnet". To me it looks like the bounding > boxes are incorrect, mostly because the "mm" and "mk" have no character > spacing in between them. > Is there a way to train this scenario to work better? > > Cheers, > > Attila > > > > > On Monday, April 22, 2013 1:54:11 PM UTC+2, Attila Sukosd wrote: >> >> Wow, thank you for the detailed reply! I will give it a try! :) >> >> Best, >> >> Attila >> >> On Monday, April 22, 2013 11:04:32 AM UTC+2, sdk wrote: >>> >>> Please look at the unicharambigs file for your language. You can add >>> these substitutions to the same and recombine the traineddata without >>> needing to do any additional training. >>> >>> Please see http://code.google.com/p/**tesseract-ocr/wiki/** >>> TrainingTesseract3<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>- >>> section on - The last file (unicharambigs) >>> >>> The final data file that Tesseract uses is called unicharambigs. It >>>> represents the intrinsic ambiguity between characters or sets of >>>> characters, and is currently entirely manually generated. To understand the >>>> file format, look at the following example: >>>> >>>> v1 >>>> 3 I I 0 2 u o 3 >>>> >>>> 3 I - I 1 H 2 >>>> 2 ' ' 1 " 1 >>>> >>>> >>>> >>>> 2 ಕೊ 6 1 ಕೋ 1 >>>> 1 m 2 r n 0 >>>> 3 i i i 1 m 0 >>>> >>>> The first line is a version identifier. The remaining lines consist of >>>> 5 tab-separated fields. The first field is the number of strings in the >>>> second field. The 3rd field is the number of strings in the 4th field, and >>>> the 5th field is a type indicator. The 2nd and 4th fields consist of a >>>> number of space-separated strings. As with the other files, this is a UTF-8 >>>> format file, and therefore each string is a UTF-8 string. Each of these >>>> strings must match the first field of some line in the unicharset file, ie >>>> it must a recognizable unit. >>>> >>> >>> If that doesn't work, you can try post-processing the OCR output. >>> VietOCR allows a user defined susbtitution file for the same. >>> See >>> http://vietocr.sourceforge.**net/usage.html<http://vietocr.sourceforge.net/usage.html>- >>> section on post-processing >>> >>> In addition to the built-in text postprocessing algorithm, you can add >>>> your own custom text replacement scheme via a text file named >>>> x.DangAmbigs.txt, where x is the ISO639-3 language code. The >>>> UTF-8-encoded file should contain equal sign-delimited >>>> oldValue=newValue pairs. >>>> >>> >>> Shree Devi Kumar >>> ______________________________**______________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> >>> On Mon, Apr 22, 2013 at 2:00 PM, Attila Sukosd < >>> [email protected]> wrote: >>> >>>> Hi all, >>>> >>>> I'm trying to run some OCR on some old-ish danish datasets from 1970+, >>>> and it seems like some of the characters are consequently recognized wrong: >>>> >>>> å => á >>>> mm => nn >>>> : => e >>>> l => 1 >>>> >>>> Is there any way to improve on the recognition of these individual >>>> characters without having to retrain the complete font? >>>> I've found a lot of documents on how to train a completely new font, >>>> but not a lot on how to improve on existing ones. >>>> >>>> Best, >>>> >>>> Attila >>>> >>>> -- >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> To unsubscribe from this group, send email to >>>> tesseract-oc...@googlegroups.**com >>>> For more options, visit this group at >>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.**com. >>>> For more options, visit >>>> https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out> >>>> . >>>> >>>> >>>> >>> >>> -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

