[tesseract-ocr] Training strategy to add a few GDT Symbols to eng

Boot Tue, 01 Dec 2020 11:17:08 -0800

I'm working on a training model to recognize Mechanical Engineering 
drawings that may contain GDT symbols such as a symbol to indicate depth, a 
counterbore, countersink, diameter, etc. I saw that the eng.traineddata has 
a number of these GDT symbols already but not all. I'm using Legacy OEM.

I am obtaining 2 different types of images from these mechanical drawings -
images that contain Notes which are typically english paragraphs/sentences
of text, and images that contain dimensions/gdt symbols.

For the Notes regions of the drawing (in general, recognition of all
letters, numbers, punctuation), i'm satisfied with the results that the
eng.traineddata language produces.

For images obtained from the drawing that contain dimension text such as
"⌀1.05 + .05 - .03 TYP" , I have developed a training model that is trained
with letters A-Z (only uppercase letters - typical on these drawings -
dimensions can have english text before or after as well), limited
punctuation chars, and all the GDT symbols I need. It works OK on some
fonts - but is not as good as the eng.traineddata model is at recognizing
letters, numbers, punctuation. I'm assuming the main reason is because I
haven't trained it with nearly as many fonts as the eng.traineddata model
has been trained with. So my question is.. What's the best way to develop
this language I need - which is just the eng model plus a few additional
characters? Does it make sense to try to re-create the eng training data on
my own? That seems like a daunting task that I'm trying to avoid. Do I have
to re-create the eng language to add a few symbols?

Thanks for any Advice,
Boot

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/3977493d-900c-4b01-91a8-9e814e0399c4n%40googlegroups.com.

[tesseract-ocr] Training strategy to add a few GDT Symbols to eng

Reply via email to