Well the good news is that tesseract tells you in the training process what it can and cannot work with. I'd be tempted to use the gaps in the line segments to break apart the letters, for example, instead of "C", train for the top part to be something like "r" and the bottom to be another unique character, and then put them together in post OCR processing. I'd separate the "X" in the same way. The other option, and the one I would investigate where the segment gap doesn't go across the letter, for example, on the "B", is to scale it down to the point that tesseract would work with the blob as a single character. This makes for a painstaking process to be sure, but I think it could work. I should note that you can configure settings for more flexibility in blob detection [1] but that's beyond anything I have ever done. I have tried opencv for pattern detection, I wouldn’t call it OCR, and it seems very powerful, but I haven’t used it enough to speak to whether it is the right hammer in this case.
art --- 1. https://code.google.com/p/tesseract-ocr/wiki/ControlParams From: [email protected] [mailto:[email protected]] On Behalf Of Pierre-Henri DAUVERGNE Sent: Wednesday, July 08, 2015 5:26 AM To: [email protected] Subject: Re: [tesseract-ocr] Train tesseract for 14-segment display I also tried different size and I have been able to make it work with any. Regarding doing OCR with OpenCV, I won't have enough time to do that. Moreover, as I already use Tesseract for other fonts, I'd like to use it for this one too (and the guys who did the tutorial said in the comments that Tesseract is more powerful :/ ) Le mardi 7 juillet 2015 21:11:21 UTC+2, Art Rhyno a écrit : When tesseract can’t find a matching blob, it gets trickier but at least it is working with something. I am guessing some of the gaps between segments are passing a threshold for belonging to a single character. I tried a few different sizes, but I couldn’t get the “B” recognized and I wonder if opencv might be a better route if the source of the characters is fairly static. There’s an example here of using opencv with handwritten numbers [1]. art --- 1. http://blog.damiles.com/2008/11/basic-ocr-in-opencv/ From: [email protected]<javascript:> [mailto:[email protected]<javascript:>] On Behalf Of Pierre-Henri DAUVERGNE Sent: Tuesday, July 07, 2015 8:41 AM To: [email protected]<javascript:> Subject: Re: [tesseract-ocr] Train tesseract for 14-segment display I actually can't show you all the characters but I can give you a sample. I have the 10 digits and all letters. I tried to decrease the size of the characters but it still didn't work. Tesseract didn't say "Empty page!!" but "Failure ! Couldn't find a matching blob" for all letters, the digits worked fine. Here is a small sample : http://i.imgur.com/NeYBKrj.png the letters are V X B C D. Thank you for your help :) Le mardi 7 juillet 2015 13:40:24 UTC+2, Art Rhyno a écrit : Could you attach the “my_font_exp0.png” and “my_font_exp0.box” that are producing the “Empty page!!” message? art From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Pierre-Henri DAUVERGNE Sent: Tuesday, July 07, 2015 3:26 AM To: [email protected]<mailto:[email protected]> Subject: Re: [tesseract-ocr] Train tesseract for 14-segment display Acutally I followed this guide<http://blog.ayoungprogrammer.com/2013/01/equation-ocr-part-2-training-characters.html> which is essentially the same as the one you gave me. I am doing all that. I use qt-box-editor to manually set the boxes over the characters then I use the command "tesseract my_font_exp0.png my_font_exp0 nobatch box.train" but it says "Empty page!!" and nothing else. It creates an empty .txt file. Whenever I try to train with linked segments, it works. That's why I'm looking for an image-processing way of linking all the segments as they should be or a tesseract way of training it with unlinked segments. Le lundi 6 juillet 2015 14:41:22 UTC+2, Art Rhyno a écrit : Hi, I am guessing my attachment didn’t make it to the list but the character I used is about 17x25 pixels. I resaved the sample as a PNG (instead of a TIFF) and am trying again. Remember that you can (and often have to) edit the box files for training. Tesseract may split your character into more than one blob, but you can override this. By default, the “makebox” produced: l 45 254 53 279 0 ’ 55 267 62 277 0 But I modified this to be: V 45 254 62 279 0 I found this blog post really helpful for training [1]. You can contact me off-list if you want the entire training set I used, but I only did the one character. art --- 1. http://michaeljaylissner.com/blog/adding-new-fonts-to-tesseract-3-ocr-engine From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Pierre-Henri DAUVERGNE Sent: Monday, July 06, 2015 4:29 AM To: [email protected]<mailto:[email protected]> Subject: Re: [tesseract-ocr] Train tesseract for 14-segment display Ok so I just tried after resizing my image by 2 and by 4 and it still doesn't work : tesseract says "Empty page!!". However, if I manually link the segments (with the brush tool in Gimp, see here : http://i.imgur.com/akVmAgh.png ), it works but it doesn't feel like it's a good training for tesseract. Any advice ? Thank you Le lundi 6 juillet 2015 09:18:44 UTC+2, Pierre-Henri DAUVERGNE a écrit : Hi, thank you for your answer :) Each character is about 100x160 pixels, is that too low ? I'll try with bigger ones and I'll post the results here Le samedi 4 juillet 2015 04:10:18 UTC+2, Art Rhyno a écrit : Hi, I wonder if it has something to do with the sizing of the characters in the image that you are using for font training. I swapped out the character without the linked segments for a character in a set I am using and it seemed to work ok. The set is too big for the list but I have attached the image I used. art From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Pierre-Henri DAUVERGNE Sent: Friday, July 03, 2015 10:20 AM To: [email protected]<mailto:[email protected]> Subject: [tesseract-ocr] Train tesseract for 14-segment display Hello everyone. I've posted on stackoverflow already but haven't had an answer yet (http://stackoverflow.com/questions/31131796/14-segment-display-and-tesseract-ocr-with-opencv). I'm looking for a way to accurately OCR 14-segment display. As you can see in my SO thread, I trained tesseract with dilated characters which link all of its segments together. My issue is that when I read from my webcam a character, I have to erode it first to remove noise. After that, I dilate it. However, I can't do it enough to link all the segments together without having issues with letters like 'B' and 'D' and the letter 'V' is not recognized at all (I believe it is because of the space between the diagonal being too long). • What I trained tesseract with (that's the "V" letter) : http://i.imgur.com/NbmVqkb.png (segments are all linked) • What I feed tesseract with : http://i.imgur.com/0E4iXXk.png (some segments are linked, some aren't) I tried to train tesseract with characters where all the segments aren't linked but it says "Empty page !!". When I manually link the segments, the training works fine (it feels weird that tesseract can't be trained with blanck space inside characters since some of the existing languages (ie. arabic or chineese) already have some). To bypass this issue, I've been trying different kind of image processing algorithms (like thinning, in order to dilate "in height" but not in "width") but gave more accurate results. Thank you for your help ! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To post to this group, send email to [email protected]<mailto:[email protected]>. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/451dbd65-20b7-437a-8b5b-a0a726bdad06%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/451dbd65-20b7-437a-8b5b-a0a726bdad06%40googlegroups.com?utm_medium=email&utm_source=footer>. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To post to this group, send email to [email protected]<mailto:[email protected]>. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f0135b3-ced6-439c-8272-66299e6c2a03%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/4f0135b3-ced6-439c-8272-66299e6c2a03%40googlegroups.com?utm_medium=email&utm_source=footer>. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To post to this group, send email to [email protected]<mailto:[email protected]>. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/44f83e75-7a97-4d1e-a6dc-68533fc75b2f%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/44f83e75-7a97-4d1e-a6dc-68533fc75b2f%40googlegroups.com?utm_medium=email&utm_source=footer>. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<javascript:>. To post to this group, send email to [email protected]<javascript:>. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/831536ec-bbc5-44e8-b273-0118e287049d%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/831536ec-bbc5-44e8-b273-0118e287049d%40googlegroups.com?utm_medium=email&utm_source=footer>. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To post to this group, send email to [email protected]<mailto:[email protected]>. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2e54acb2-2505-475b-8fa2-846ecf3ce36b%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/2e54acb2-2505-475b-8fa2-846ecf3ce36b%40googlegroups.com?utm_medium=email&utm_source=footer>. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/BY2PR11MB05524E10E953AD24A719FD6DDC910%40BY2PR11MB0552.namprd11.prod.outlook.com. For more options, visit https://groups.google.com/d/optout.

