Thank you Rick. A concise answer was given on GitHub recently:
*jimregan <https://github.com/jimregan> *commented 2 days ago
<https://github.com/tesseract-ocr/tesseract/issues/42#issuecomment-122577036>
This issue is currently the top search result for 'ocr_float'; it lacks a
simple summary: Tesseract (currently) does not support ocr_float.
On Monday, 6 July 2015 19:59:15 UTC+1, Rick Leir wrote:
>
> You will see how the hocr file is built with lines like this:
> api/baseapi.cpp: hocr_str.add_str_int("\n <p class='ocr_par'
> dir='ltr' id='par_",
>
> Going out on a limb, I grepped the tree for ocr_float, and got no hits. A
> closer look at the code might turn up something, so have a look.
>
> What I see in api/baseapi.cpp is:
> 'ocr_page'
> 'ocr_carea'
> 'ocr_par'
> 'ocr_line'
> 'ocrx_word'
>
> You can also look in api/renderer.cpp :
>
> bool TessHOcrRenderer::BeginDocumentHandler() {
> ..
> " <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par"
> " ocr_line ocrx_word");
>
>
> On Monday, July 6, 2015 at 6:52:14 AM UTC-4, James Owers wrote:
>>
>> I'm trying to reproduce results achieved at the ICDAR page segmentation
>> competitions [1,2] with tesseract. I'm struggling to get the tool to output
>> the hOCR tags that I'm expecting for tables and figures etc [3]. At the
>> moment I'm calling tesseract with pagesegmode 1. Should I be adding other
>> options via a config file to achieve the full extent of tesseracts
>> segmentation and labelling ability (I'm not interested in the character
>> recognition element as much).
>>
>> 1. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical
>> Book Recognition – HBR2013
>> 2. Antonacopoulos (2013, ICDAR) ICDAR2013 Competition on Historical
>> Newspaper Layout Analysis – HNLA2013
>> 3. Breuel (2010) The hOCR Embedded OCR Workflow and Output Format
>>
>>
>> I've cross-posted this from
>> https://github.com/tesseract-ocr/tesseract/issues/42 and will update
>> both with responses. Which is the default Q&A place?
>>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/3a090c00-d682-4e30-9658-ba79f5e417a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.