On Fri, 21 Aug 2015, Rutger Rozendal wrote:
Ok, thanks
We will try this method
So first getting the rectangles out as cropped pictures
and then do the character recognition on this separate
picture.
For rectangle-by-color extraction we think to use OpenCV
as it seems that Tesseract is not really into that, isn't
it?
We used OpenCV before to find the rooms (rectangular
boxes) but that was based on the walls, their radius and
this analysis was done on an inverted black-end-white
picture of the floorplan.
We will try now with the color as an input source but - as
some rooms have the same color and they are beside each
other we are wondering what will happen to those.
And then for the last step, the Tesseract recognition on
the cropped picture of one room, is it advisable to use
there a grayscale image?
And can we feed Tesseract with a kind of target list? For
us it is important to find the location on the room (x and
y on the picture), that is the overall goal of the
assignment.
Thanks again for any tips in this challange.
Rutger
I believe that tesseract operates on black and white
images. All grayscale and colour images are converted
internally to black and white if necessary. In your
case, you could probably do the conversion yourself,
turning every pixel that is not black to white, since
all of the text is black.
Many people have converted numeric text, and there
are many posts in the archive about that. I think
some used a whitelist of numeric characters, and
others created dictionaries containing valid combinations
of numbers to search against. Tesseract does not
just try to recognize each character, it also tries
to recognize each "word" against dictionaries, so
it helps to let tesseract know that "8008" is a
better answer than "BOOB".
Cheers,
Rob Komar
P.S. Does anyone know if the whitelist applies to
the dictionary search, as well? If not, I think it
would be a useful addition to make to the code.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/alpine.LNX.2.20.1508211045270.21793%40robpc4.home.org.
For more options, visit https://groups.google.com/d/optout.