I'm trying to find the linkage between words and characters in the core
Tesseract data model.
In the TessBaseAPI there are two interesting methods.
TessBaseAPI::GetHOCRText[1] recursively walks through the data structure
and prints out the results, page, block, line, word. But it doesn't do
characters.
TessBaseAPI::GetConnectedComponents[2] allows me to get to the bounding
boxes of each character (RIL_SYMBOL?).
But there seems to be nothing that links these together. What am I missing?
is there a mapping?
What I'd like to dump is:
line_1 <1516 2365 2426 2425>
word_1 <1516 2365 1809 2425> text=Sign
character_1 <1516 2365 1540 2425> text=S
character_2 <1545 2365 1580 2425> text=i
character_3 <1585 2365 1590 2425> text=g
character_4 <1700 2365 1809 2425> text=n
How can I connect the resulting word back to the underlying characters?
I need to be able to get the structure of all the boxes contained within
each other.
Any help would be appreciated.
Regards,
Patrick.
[1]
http://zdenop.github.com/tesseract-doc/group___advanced_a_p_i.html#ga655f906bbf64dcd6f33ce633ecce997d
[2]
http://zdenop.github.com/tesseract-doc/group___advanced_a_p_i.html#gaf2b4f88c53457fa5153dc80f5a60e152
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en