hocr output - ocr_word and ocrx_word

Carlos Thu, 19 Apr 2012 09:28:34 -0700

using tesseract 3.01

I am a working with a tool that parses generated pdf files from tesseract's 
hOCR output.  This tool is choking because it expects the ocrx_word element 
to include the bbox position information.  Before I patch the tool I just 
wanted to confirm a few things.


the hOCR output generated by tesseract 3.01 wraps each word in two tags:

<span class='ocr_word' title="bbox x0 y0 x1 y1"><span class='ocrx_word' 
id='xword_1_1' title="x_wconf -2"><strong>Text</strong></span></span>

The hOCR spec 
(https://docs.google.com/a/touzon.com/document/preview?id=1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0)
 
doesn't mention ocr_word and ocrx_word is considered engine specific markup.

Is there a disconnect between the hOCR spec and tesseract or am I reading 
too much into this?  If ocr_word isn't part of the spec, why not drop it 
and place the bbox position information in the ocrx_word element?  This 
would make parsing slightly easier and reduce the size of the generated 
hOCR.

Carlos

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

hocr output - ocr_word and ocrx_word

Reply via email to