That is exactly what I needed. Thank you. Den lørdag den 15. november 2014 11.17.07 UTC+1 skrev shree: > > take a look at hocr output > > and tsv option from https://code.google.com/r/email-hocr-tsv/ > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Sat, Nov 15, 2014 at 3:39 PM, Simon Støvring <[email protected] > <javascript:>> wrote: > >> I have tried with the English traineddata and got similar results. >> However, I had not tried recognizing the entire 'prepared-image' with psm 6 >> and I see that gives pretty good results. >> The thing is, I need to know the location of each character. That is >> which row and column it is placed on. If Tesseract fails recognizing a >> single letter when recognizing the entire image, I have no way of knowing >> which letter is missing and therefore I do not know the location of any of >> the letters. >> >> Den fredag den 14. november 2014 18.24.15 UTC+1 skrev shree: >>> >>> Have you tried with the existing english traineddata? >>> >>> I get good recognition with your 'prepared-image'? >>> >>> If that is the kind of image you need to OCR, you could do that with psm >>> 6 and then split each letter separately? >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Fri, Nov 14, 2014 at 7:12 PM, Simon Støvring <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> I am trying to recognize single characters written with the Gotham Bold >>>> font. I have trained Tesseract by following Michael Jay Lissners guide >>>> "Adding New Fonts to Tesseract 3 OCR Engine" >>>> <http://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/>. >>>> >>>> I trained it using a newspaper article and removed all characters that I >>>> am >>>> not interested in as well as making sure all characters are upper case as >>>> I >>>> am not going to match lower case characters. >>>> >>>> I run Tesseract with my custom language and with page segmentation set >>>> to 10, which treat the image as a single character. >>>> >>>> While most of the matches are fine, I am getting a lot of incorrect >>>> matches. For example, the below image of the letter "B" is matched as an >>>> "X". I cannot figure out why this is. >>>> >>>> >>>> <https://lh4.googleusercontent.com/-AOLPnD7nXJY/VGYC58I-roI/AAAAAAAAASQ/kTJq9eSNMy4/s1600/0-4.png> >>>> >>>> And the "B" below which looks the same as the above but it is in fact >>>> not the same image, is not matched to anything. Tesseract does not know >>>> what is on the image. >>>> >>>> >>>> <https://lh4.googleusercontent.com/-b0kMaAzcN-Y/VGYFI6NOzjI/AAAAAAAAASk/c9EfpR8CjWI/s1600/1-7.png.png> >>>> >>>> >>>> The below "C" is not matched to anything. Tesseract cannot figure out >>>> what is on the image. >>>> >>>> >>>> <https://lh5.googleusercontent.com/-ZKl8jE2Orto/VGYEs2xzGlI/AAAAAAAAASc/2xTXomhIkWI/s1600/0-8.png> >>>> The same goes for the "U" below. >>>> >>>> >>>> <https://lh5.googleusercontent.com/-fciIyBe9bDw/VGYFRh3YBNI/AAAAAAAAASs/29WZQUHqPmE/s1600/1-8.png> >>>> And it thinks the "E" below is a "K". >>>> >>>> >>>> <https://lh4.googleusercontent.com/-ZZFkr77drgM/VGYFcDydDXI/AAAAAAAAAS0/RQ1UO8U3rOY/s1600/1-9.png> >>>> >>>> The above errors are just examples. There are others but I think those >>>> four examples illustrate the quirks I'm currently dealing with. >>>> >>>> I manually slice the image below into images of single characters like >>>> the ones above. Maybe a completely different approach is better? >>>> >>>> >>>> <https://lh4.googleusercontent.com/-TfwZnXosqB0/VGYFjLppJ9I/AAAAAAAAAS8/Oun76IHLwks/s1600/prepared_image.png> >>>> Does anyone know how I can improve the recognition of single >>>> characters? I'ld like the above examples to match correctly but generally >>>> it's just not good enough and I'ld like to know if there's any way I can >>>> improve it. Should I train differently? Should I pass other configurations >>>> or should I process the images before trying to recognize the characters? >>>> >>>> Best regards, >>>> Simon B. Støvring >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/tesseract-ocr/e905020c-f0b2-47b6-b09c-e01efa96dcc1% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/e905020c-f0b2-47b6-b09c-e01efa96dcc1%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/44df342b-9d7f-42bf-9d1f-d2a9028426ac%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/44df342b-9d7f-42bf-9d1f-d2a9028426ac%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/96d34a73-f68c-4c7c-b281-01ab8143d2ff%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

