Re: [tesseract-ocr] Train tesseract for 14-segment display

Pierre-Henri DAUVERGNE Wed, 08 Jul 2015 07:39:56 -0700

The thing is that if I try to scale down, some characters will become 
extremely similar (5 and S, C and E, 0 and O, etc.). And I believe the same 
would happen if I were to split characters from the gaps (like a "C" would 
become a "r" and a ~"L").
I guess another solution could be to use template matching function in 
opencv : compare each characters I am reading to those I have in my set and 
the highest result will probably be the character I want.


I wish I could just either train tesseract with the kind of characters I 
gave you or process my image to fill in the gaps.

Le mercredi 8 juillet 2015 15:43:32 UTC+2, Art Rhyno a écrit :
>
>  Well the good news is that tesseract tells you in the training process 
> what it can and cannot work with. I'd be tempted to use the gaps in the 
> line segments to break apart the letters, for example, instead of "C", 
> train for the top part to be something like "r" and the bottom to be 
> another unique character, and then put them together in post OCR 
> processing. I'd separate the "X" in the same way. The other option, and the 
> one I would investigate where the segment gap doesn't go across the letter, 
> for example, on the "B", is to scale it down to the point that tesseract 
> would work with the blob as a single character.  This makes for a 
> painstaking process to be sure, but I think it could work. I should note 
> that you can configure settings for more flexibility in blob detection [1] 
> but that's beyond anything I have ever done. I have tried opencv for 
> pattern detection, I wouldn’t call it OCR, and it seems very powerful, but 
> I haven’t used it enough to speak to whether it is the right hammer in this 
> case.
>
>  
>
> art
>
> ---
>
> 1. https://code.google.com/p/tesseract-ocr/wiki/ControlParams
>
>  
>
> *From:* [email protected] <javascript:> [mailto:
> [email protected] <javascript:>] *On Behalf Of *Pierre-Henri 
> DAUVERGNE
> *Sent:* Wednesday, July 08, 2015 5:26 AM
> *To:* [email protected] <javascript:>
> *Subject:* Re: [tesseract-ocr] Train tesseract for 14-segment display
>
>  
>  
> I also tried different size and I have been able to make it work with any.
> Regarding doing OCR with OpenCV, I won't have enough time to do that. 
> Moreover, as I already use Tesseract for other fonts, I'd like to use it 
> for this one too (and the guys who did the tutorial said in the comments 
> that Tesseract is more powerful :/ )
>
> Le mardi 7 juillet 2015 21:11:21 UTC+2, Art Rhyno a écrit :
>
>  When tesseract can’t find a matching blob, it gets trickier but at least 
> it is working with something. I am guessing some of the gaps between 
> segments are passing a threshold for belonging to a single character. I 
> tried a few different sizes, but I couldn’t get the “B” recognized and I 
> wonder if opencv might be a better route if the source of the characters is 
> fairly static. There’s an example here of using opencv with handwritten 
> numbers [1].
>
>  
>
> art
>
> ---
>
> 1. http://blog.damiles.com/2008/11/basic-ocr-in-opencv/
>
>  
>
> *From:* [email protected] [mailto:[email protected]] *On 
> Behalf Of *Pierre-Henri DAUVERGNE
> *Sent:* Tuesday, July 07, 2015 8:41 AM
> *To:* [email protected]
> *Subject:* Re: [tesseract-ocr] Train tesseract for 14-segment display
>
>  
>  
> I actually can't show you all the characters but I can give you a sample. 
> I have the 10 digits and all letters. I tried to decrease the size of the 
> characters but it still didn't work. Tesseract didn't say "Empty page!!" 
> but "Failure ! Couldn't find a matching blob" for all letters, the digits 
> worked fine.
>
> Here is a small sample : http://i.imgur.com/NeYBKrj.png the letters are V 
> X B C D.
>
> Thank you for your help :)
>
>
> Le mardi 7 juillet 2015 13:40:24 UTC+2, Art Rhyno a écrit :
>
>  Could you attach the “my_font_exp0.png” and “my_font_exp0.box” that are 
> producing the “Empty page!!” message? 
>
>  
>
> art
>
>  
>
> *From:* [email protected] [mailto:[email protected]] *On 
> Behalf Of *Pierre-Henri DAUVERGNE
> *Sent:* Tuesday, July 07, 2015 3:26 AM
> *To:* [email protected]
> *Subject:* Re: [tesseract-ocr] Train tesseract for 14-segment display
>
>  
>  
> Acutally I followed this guide 
> <http://blog.ayoungprogrammer.com/2013/01/equation-ocr-part-2-training-characters.html>
>  
> which is essentially the same as the one you gave me. I am doing all that. 
> I use qt-box-editor to manually set the boxes over the characters then I 
> use the command "tesseract my_font_exp0.png my_font_exp0 nobatch box.train" 
> but it says "Empty page!!" and nothing else. It creates an empty .txt file. 
> Whenever I try to train with linked segments, it works. 
> That's why I'm looking for an image-processing way of linking all the 
> segments as they should be or a tesseract way of training it with unlinked 
> segments.
>
>
>
> Le lundi 6 juillet 2015 14:41:22 UTC+2, Art Rhyno a écrit :
>
>  Hi,
>
>  
>
> I am guessing my attachment didn’t make it to the list but the character I 
> used is about 17x25 pixels.  I resaved the sample as a PNG (instead of a 
> TIFF) and am trying again. Remember that you can (and often have to) edit 
> the box files for training. Tesseract may split your character into more 
> than one blob, but you can override this. By default, the “makebox” 
> produced:
>
>  
>
> l 45 254 53 279 0
>
> ’ 55 267 62 277 0
>
>  
>
> But I modified this to be:
>
> V 45 254 62 279 0
>
>  
>
> I found this blog post really helpful for training [1]. You can contact me 
> off-list if you want the entire training set I used, but I only did the one 
> character.
>
>  
>
> art
>
> ---
>
> 1. 
> http://michaeljaylissner.com/blog/adding-new-fonts-to-tesseract-3-ocr-engine
>
>  
>
> *From:* [email protected] [mailto:[email protected]] *On 
> Behalf Of *Pierre-Henri DAUVERGNE
> *Sent:* Monday, July 06, 2015 4:29 AM
> *To:* [email protected]
> *Subject:* Re: [tesseract-ocr] Train tesseract for 14-segment display
>
>  
>  
> Ok so I just tried after resizing my image by 2 and by 4 and it still 
> doesn't work : tesseract says "Empty page!!".
> However, if I manually link the segments (with the brush tool in Gimp, see 
> here : http://i.imgur.com/akVmAgh.png ), it works but it doesn't feel 
> like it's a good training for tesseract.
> Any advice ?
>
> Thank you
>
> Le lundi 6 juillet 2015 09:18:44 UTC+2, Pierre-Henri DAUVERGNE a écrit :
>
>  Hi, thank you for your answer :)
>
> Each character is about 100x160 pixels, is that too low ? I'll try with 
> bigger ones and I'll post the results here
>
> Le samedi 4 juillet 2015 04:10:18 UTC+2, Art Rhyno a écrit :
>
>  Hi,
>
>  
>
> I wonder if it has something to do with the sizing of the characters in 
> the image that you are using for font training. I swapped out the character 
> without the linked segments for a character in a set I am using and it 
> seemed to work ok. The set is too big for the list but I have attached the 
> image I used. 
>
>  
>
> art
>
>  
>
> *From:* [email protected] [mailto:[email protected]] *On 
> Behalf Of *Pierre-Henri DAUVERGNE
> *Sent:* Friday, July 03, 2015 10:20 AM
> *To:* [email protected]
> *Subject:* [tesseract-ocr] Train tesseract for 14-segment display
>
>  
>  
> Hello everyone.
>
> I've posted on stackoverflow already but haven't had an answer yet (
> http://stackoverflow.com/questions/31131796/14-segment-display-and-tesseract-ocr-with-opencv
> ).
>
> I'm looking for a way to accurately OCR 14-segment display. As you can see 
> in my SO thread, I trained tesseract with dilated characters which link all 
> of its segments together. My issue is that when I read from my webcam a 
> character, I have to erode it first to remove noise. After that, I dilate 
> it.
> However, I can't do it enough to link all the segments together without 
> having issues with letters like 'B' and 'D' and the letter 'V' is not 
> recognized at all (I believe it is because of the space between the 
> diagonal being too long).
>
> ·        What I trained tesseract with (that's the "V" letter) : 
> http://i.imgur.com/NbmVqkb.png (segments are all linked)
>
> ·        What I feed tesseract with : http://i.imgur.com/0E4iXXk.png 
> (some segments are linked, some aren't)
>
> I tried to train tesseract with characters where all the segments aren't 
> linked but it says "Empty page !!". When I manually link the segments, the 
> training works fine (it feels weird that tesseract can't be trained with 
> blanck space inside characters since some of the existing languages (ie. 
> arabic or chineese) already have some).
>
> To bypass this issue, I've been trying different kind of image processing 
> algorithms (like thinning, in order to dilate "in height" but not in 
> "width") but gave more accurate results.
>
> Thank you for your help !
>  
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/451dbd65-20b7-437a-8b5b-a0a726bdad06%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/451dbd65-20b7-437a-8b5b-a0a726bdad06%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>  
>   -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/4f0135b3-ced6-439c-8272-66299e6c2a03%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/4f0135b3-ced6-439c-8272-66299e6c2a03%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>  
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/44f83e75-7a97-4d1e-a6dc-68533fc75b2f%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/44f83e75-7a97-4d1e-a6dc-68533fc75b2f%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>  
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/831536ec-bbc5-44e8-b273-0118e287049d%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/831536ec-bbc5-44e8-b273-0118e287049d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>  
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> To post to this group, send email to [email protected] 
> <javascript:>.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/2e54acb2-2505-475b-8fa2-846ecf3ce36b%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/2e54acb2-2505-475b-8fa2-846ecf3ce36b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>  

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5c603ca4-430b-4bf6-a9af-5599b069194e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Train tesseract for 14-segment display

Reply via email to