using tesseract 3.01 I am a working with a tool that parses generated pdf files from tesseract's hOCR output. This tool is choking because it expects the ocrx_word element to include the bbox position information. Before I patch the tool I just wanted to confirm a few things.
the hOCR output generated by tesseract 3.01 wraps each word in two tags: <span class='ocr_word' title="bbox x0 y0 x1 y1"><span class='ocrx_word' id='xword_1_1' title="x_wconf -2"><strong>Text</strong></span></span> The hOCR spec (https://docs.google.com/a/touzon.com/document/preview?id=1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0) doesn't mention ocr_word and ocrx_word is considered engine specific markup. Is there a disconnect between the hOCR spec and tesseract or am I reading too much into this? If ocr_word isn't part of the spec, why not drop it and place the bbox position information in the ocrx_word element? This would make parsing slightly easier and reduce the size of the generated hOCR. Carlos -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

