Re: [tesseract-ocr] Configure for single character recognition

Simon Støvring Sat, 15 Nov 2014 03:44:15 -0800

That is exactly what I needed. Thank you.

Den lørdag den 15. november 2014 11.17.07 UTC+1 skrev shree:
>
> take a look at hocr output
>
> and tsv option from https://code.google.com/r/email-hocr-tsv/
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sat, Nov 15, 2014 at 3:39 PM, Simon Støvring <[email protected] 
> <javascript:>> wrote:
>
>> I have tried with the English traineddata and got similar results. 
>> However, I had not tried recognizing the entire 'prepared-image' with psm 6 
>> and I see that gives pretty good results.
>> The thing is, I need to know the location of each character. That is 
>> which row and column it is placed on. If Tesseract fails recognizing a 
>> single letter when recognizing the entire image, I have no way of knowing 
>> which letter is missing and therefore I do not know the location of any of 
>> the letters.
>>
>> Den fredag den 14. november 2014 18.24.15 UTC+1 skrev shree:
>>>
>>> Have you tried with the existing english traineddata?
>>>
>>> I get good recognition with your 'prepared-image'? 
>>>
>>> If that is the kind of image you need to OCR, you could do that with psm 
>>> 6 and then split each letter separately?
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Fri, Nov 14, 2014 at 7:12 PM, Simon Støvring <[email protected]> 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am trying to recognize single characters written with the Gotham Bold 
>>>> font. I have trained Tesseract by following Michael Jay Lissners guide 
>>>> "Adding New Fonts to Tesseract 3 OCR Engine" 
>>>> <http://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/>.
>>>>  
>>>> I trained it using a newspaper article and removed all characters that I 
>>>> am 
>>>> not interested in as well as making sure all characters are upper case as 
>>>> I 
>>>> am not going to match lower case characters.
>>>>
>>>> I run Tesseract with my custom language and with page segmentation set 
>>>> to 10, which treat the image as a single character.
>>>>
>>>> While most of the matches are fine, I am getting a lot of incorrect 
>>>> matches. For example, the below image of the letter "B" is matched as an 
>>>> "X". I cannot figure out why this is. 
>>>>
>>>>
>>>> <https://lh4.googleusercontent.com/-AOLPnD7nXJY/VGYC58I-roI/AAAAAAAAASQ/kTJq9eSNMy4/s1600/0-4.png>
>>>>
>>>> And the "B" below which looks the same as the above but it is in fact 
>>>> not the same image, is not matched to anything. Tesseract does not know 
>>>> what is on the image.
>>>>
>>>>
>>>> <https://lh4.googleusercontent.com/-b0kMaAzcN-Y/VGYFI6NOzjI/AAAAAAAAASk/c9EfpR8CjWI/s1600/1-7.png.png>
>>>>
>>>>
>>>> The below "C" is not matched to anything. Tesseract cannot figure out 
>>>> what is on the image.
>>>>
>>>>
>>>> <https://lh5.googleusercontent.com/-ZKl8jE2Orto/VGYEs2xzGlI/AAAAAAAAASc/2xTXomhIkWI/s1600/0-8.png>
>>>> The same goes for the "U" below.
>>>>
>>>>
>>>> <https://lh5.googleusercontent.com/-fciIyBe9bDw/VGYFRh3YBNI/AAAAAAAAASs/29WZQUHqPmE/s1600/1-8.png>
>>>> And it thinks the "E" below is a "K".
>>>>
>>>>
>>>> <https://lh4.googleusercontent.com/-ZZFkr77drgM/VGYFcDydDXI/AAAAAAAAAS0/RQ1UO8U3rOY/s1600/1-9.png>
>>>>
>>>> The above errors are just examples. There are others but I think those 
>>>> four examples illustrate the quirks I'm currently dealing with.
>>>>
>>>> I manually slice the image below into images of single characters like 
>>>> the ones above. Maybe a completely different approach is better?
>>>>
>>>>
>>>> <https://lh4.googleusercontent.com/-TfwZnXosqB0/VGYFjLppJ9I/AAAAAAAAAS8/Oun76IHLwks/s1600/prepared_image.png>
>>>> Does anyone know how I can improve the recognition of single 
>>>> characters? I'ld like the above examples to match correctly but generally 
>>>> it's just not good enough and I'ld like to know if there's any way I can 
>>>> improve it. Should I train differently? Should I pass other configurations 
>>>> or should I process the images before trying to recognize the characters?
>>>>
>>>> Best regards,
>>>> Simon B. Støvring
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/tesseract-ocr/e905020c-f0b2-47b6-b09c-e01efa96dcc1%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e905020c-f0b2-47b6-b09c-e01efa96dcc1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/44df342b-9d7f-42bf-9d1f-d2a9028426ac%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/44df342b-9d7f-42bf-9d1f-d2a9028426ac%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/96d34a73-f68c-4c7c-b281-01ab8143d2ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Configure for single character recognition

Reply via email to