RE: [tesseract-ocr] Train tesseract for 14-segment display

Art Rhyno . Wed, 08 Jul 2015 06:44:02 -0700

Well the good news is that tesseract tells you in the training process what it 
can and cannot work with. I'd be tempted to use the gaps in the line segments 
to break apart the letters, for example, instead of "C", train for the top part 
to be something like "r" and the bottom to be another unique character, and 
then put them together in post OCR processing. I'd separate the "X" in the same 
way. The other option, and the one I would investigate where the segment gap 
doesn't go across the letter, for example, on the "B", is to scale it down to 
the point that tesseract would work with the blob as a single character.  This 
makes for a painstaking process to be sure, but I think it could work. I should 
note that you can configure settings for more flexibility in blob detection [1] 
but that's beyond anything I have ever done. I have tried opencv for pattern 
detection, I wouldn’t call it OCR, and it seems very powerful, but I haven’t 
used it enough to speak to whether it is the right hammer in this case.

art
---
1. https://code.google.com/p/tesseract-ocr/wiki/ControlParams

From: [email protected] [mailto:[email protected]] On 
Behalf Of Pierre-Henri DAUVERGNE
Sent: Wednesday, July 08, 2015 5:26 AM
To: [email protected]
Subject: Re: [tesseract-ocr] Train tesseract for 14-segment display

I also tried different size and I have been able to make it work with any.
Regarding doing OCR with OpenCV, I won't have enough time to do that. Moreover, 
as I already use Tesseract for other fonts, I'd like to use it for this one too 
(and the guys who did the tutorial said in the comments that Tesseract is more 
powerful :/ )

Le mardi 7 juillet 2015 21:11:21 UTC+2, Art Rhyno a écrit :
When tesseract can’t find a matching blob, it gets trickier but at least it is 
working with something. I am guessing some of the gaps between segments are 
passing a threshold for belonging to a single character. I tried a few 
different sizes, but I couldn’t get the “B” recognized and I wonder if opencv 
might be a better route if the source of the characters is fairly static. 
There’s an example here of using opencv with handwritten numbers [1].

art
---
1. http://blog.damiles.com/2008/11/basic-ocr-in-opencv/

From: [email protected]<javascript:> 
[mailto:[email protected]<javascript:>] On Behalf Of Pierre-Henri 
DAUVERGNE
Sent: Tuesday, July 07, 2015 8:41 AM
To: [email protected]<javascript:>
Subject: Re: [tesseract-ocr] Train tesseract for 14-segment display

I actually can't show you all the characters but I can give you a sample. I 
have the 10 digits and all letters. I tried to decrease the size of the 
characters but it still didn't work. Tesseract didn't say "Empty page!!" but 
"Failure ! Couldn't find a matching blob" for all letters, the digits worked 
fine.

Here is a small sample : http://i.imgur.com/NeYBKrj.png the letters are V X B C 
D.

Thank you for your help :)

Le mardi 7 juillet 2015 13:40:24 UTC+2, Art Rhyno a écrit :
Could you attach the “my_font_exp0.png” and “my_font_exp0.box” that are 
producing the “Empty page!!” message?

art

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]] On Behalf Of Pierre-Henri DAUVERGNE
Sent: Tuesday, July 07, 2015 3:26 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: [tesseract-ocr] Train tesseract for 14-segment display

Acutally I followed this 
guide<http://blog.ayoungprogrammer.com/2013/01/equation-ocr-part-2-training-characters.html>
 which is essentially the same as the one you gave me. I am doing all that. I 
use qt-box-editor to manually set the boxes over the characters then I use the 
command "tesseract my_font_exp0.png my_font_exp0 nobatch box.train" but it says 
"Empty page!!" and nothing else. It creates an empty .txt file. Whenever I try 
to train with linked segments, it works.
That's why I'm looking for an image-processing way of linking all the segments 
as they should be or a tesseract way of training it with unlinked segments.

Le lundi 6 juillet 2015 14:41:22 UTC+2, Art Rhyno a écrit :
Hi,

I am guessing my attachment didn’t make it to the list but the character I used 
is about 17x25 pixels.  I resaved the sample as a PNG (instead of a TIFF) and 
am trying again. Remember that you can (and often have to) edit the box files 
for training. Tesseract may split your character into more than one blob, but 
you can override this. By default, the “makebox” produced:

l 45 254 53 279 0
’ 55 267 62 277 0

But I modified this to be:
V 45 254 62 279 0

I found this blog post really helpful for training [1]. You can contact me 
off-list if you want the entire training set I used, but I only did the one 
character.

art
---
1. http://michaeljaylissner.com/blog/adding-new-fonts-to-tesseract-3-ocr-engine

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]] On Behalf Of Pierre-Henri DAUVERGNE
Sent: Monday, July 06, 2015 4:29 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: [tesseract-ocr] Train tesseract for 14-segment display

Ok so I just tried after resizing my image by 2 and by 4 and it still doesn't 
work : tesseract says "Empty page!!".
However, if I manually link the segments (with the brush tool in Gimp, see here 
: http://i.imgur.com/akVmAgh.png ), it works but it doesn't feel like it's a 
good training for tesseract.
Any advice ?

Thank you

Le lundi 6 juillet 2015 09:18:44 UTC+2, Pierre-Henri DAUVERGNE a écrit :
Hi, thank you for your answer :)

Each character is about 100x160 pixels, is that too low ? I'll try with bigger 
ones and I'll post the results here

Le samedi 4 juillet 2015 04:10:18 UTC+2, Art Rhyno a écrit :
Hi,

I wonder if it has something to do with the sizing of the characters in the 
image that you are using for font training. I swapped out the character without 
the linked segments for a character in a set I am using and it seemed to work 
ok. The set is too big for the list but I have attached the image I used.

art

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]] On Behalf Of Pierre-Henri DAUVERGNE
Sent: Friday, July 03, 2015 10:20 AM
To: [email protected]<mailto:[email protected]>
Subject: [tesseract-ocr] Train tesseract for 14-segment display

Hello everyone.

I've posted on stackoverflow already but haven't had an answer yet 
(http://stackoverflow.com/questions/31131796/14-segment-display-and-tesseract-ocr-with-opencv).

I'm looking for a way to accurately OCR 14-segment display. As you can see in 
my SO thread, I trained tesseract with dilated characters which link all of its 
segments together. My issue is that when I read from my webcam a character, I 
have to erode it first to remove noise. After that, I dilate it.
However, I can't do it enough to link all the segments together without having 
issues with letters like 'B' and 'D' and the letter 'V' is not recognized at 
all (I believe it is because of the space between the diagonal being too long).

•        What I trained tesseract with (that's the "V" letter) : 
http://i.imgur.com/NbmVqkb.png (segments are all linked)

•        What I feed tesseract with : http://i.imgur.com/0E4iXXk.png (some 
segments are linked, some aren't)
I tried to train tesseract with characters where all the segments aren't linked 
but it says "Empty page !!". When I manually link the segments, the training 
works fine (it feels weird that tesseract can't be trained with blanck space 
inside characters since some of the existing languages (ie. arabic or chineese) 
already have some).

To bypass this issue, I've been trying different kind of image processing 
algorithms (like thinning, in order to dilate "in height" but not in "width") 
but gave more accurate results.

Thank you for your help !
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected]<mailto:[email protected]>.
To post to this group, send email to 
[email protected]<mailto:[email protected]>.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/451dbd65-20b7-437a-8b5b-a0a726bdad06%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/451dbd65-20b7-437a-8b5b-a0a726bdad06%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected]<mailto:[email protected]>.
To post to this group, send email to 
[email protected]<mailto:[email protected]>.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4f0135b3-ced6-439c-8272-66299e6c2a03%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/4f0135b3-ced6-439c-8272-66299e6c2a03%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected]<mailto:[email protected]>.
To post to this group, send email to 
[email protected]<mailto:[email protected]>.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/44f83e75-7a97-4d1e-a6dc-68533fc75b2f%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/44f83e75-7a97-4d1e-a6dc-68533fc75b2f%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected]<javascript:>.
To post to this group, send email to [email protected]<javascript:>.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/831536ec-bbc5-44e8-b273-0118e287049d%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/831536ec-bbc5-44e8-b273-0118e287049d%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To post to this group, send email to 
[email protected]<mailto:[email protected]>.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2e54acb2-2505-475b-8fa2-846ecf3ce36b%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/2e54acb2-2505-475b-8fa2-846ecf3ce36b%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/BY2PR11MB05524E10E953AD24A719FD6DDC910%40BY2PR11MB0552.namprd11.prod.outlook.com.
For more options, visit https://groups.google.com/d/optout.

RE: [tesseract-ocr] Train tesseract for 14-segment display

Reply via email to