I'm trying to find the linkage between words and characters in the core 
Tesseract data model.

In the TessBaseAPI there are two interesting methods. 

TessBaseAPI::GetHOCRText[1] recursively walks through the data structure 
and prints out the results, page, block, line, word. But it doesn't do 
characters.

TessBaseAPI::GetConnectedComponents[2] allows me to get to the bounding 
boxes of each character (RIL_SYMBOL?).

But there seems to be nothing that links these together. What am I missing? 
is there a mapping?

What I'd like to dump is:

line_1 <1516 2365 2426 2425>
     word_1 <1516 2365 1809 2425> text=Sign
           character_1 <1516 2365 1540 2425> text=S
           character_2 <1545 2365 1580 2425> text=i
           character_3 <1585 2365 1590 2425> text=g
           character_4 <1700 2365 1809 2425> text=n

How can I connect the resulting word back to the underlying characters?

I need to be able to get the structure of all the boxes contained within 
each other.

Any help would be appreciated.

Regards,
Patrick.

[1] 
http://zdenop.github.com/tesseract-doc/group___advanced_a_p_i.html#ga655f906bbf64dcd6f33ce633ecce997d
[2] 
http://zdenop.github.com/tesseract-doc/group___advanced_a_p_i.html#gaf2b4f88c53457fa5153dc80f5a60e152

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to