Discussion could be found in (closed and open) Issues (;-) ). Initial hOCR support[1] comes from issue 263[2] and was submitted by amkryukov. As you can see this patch implemented 'ocr_word'and 'xocr_word'. They are not part of hOCR spec.
'xocr_word'was changed[3] to 'ocrx_word'based on issue issue 492[4] that complained its non conformity with hOCR spec. Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output to follow hOCR spec. I think we need to split this problem to several parts: A. Spec conformity. As far as I understood this is fixed (no report about non conformity to hOCR spec). B. Usability in other tools. This is a little bit tricky because it needs support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr produce valid hOCR document and some tool is not able to process it, than IMO that tool should be fixed... But it depends on problem. From my point of view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be closed. C. Other problems/enhancements: e.g. "empty words". This need to tested (improved) but I think other tools should be able to process it. [1] http://code.google.com/p/tesseract-ocr/source/detail?r=333 [2] http://code.google.com/p/tesseract-ocr/issues/detail?id=263&can=1&q=hocr [3] http://code.google.com/p/tesseract-ocr/source/diff?spec=svn585&r=585&format=side&path=/trunk/api/baseapi.cpp [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=492 -- Zdenko On Wed, May 23, 2012 at 11:15 AM, Galt <[email protected]> wrote: > Thanks, Zdenko! > > I found most of those same links too. > > FYI here is Tess3.01 output: > > <p class='ocr_par'> > <span class='ocr_line' id='line_1_3' title="bbox 444 293 2633 363"> > > <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346"> > <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span> > </span> > <span class='ocr_word' id='word_1_6' title="bbox 620 298 696 360"> > <span class='ocrx_word' id='xword_1_6' title="x_wconf -2">fé</span> > </span> > <span class='ocr_word' id='word_1_7' title="bbox 736 308 816 345"> > <span class='ocrx_word' id='xword_1_7' title="x_wconf -1">na</span> > </span> > <span class='ocr_word' id='word_1_8' title="bbox 859 296 1095 363"> > <span class='ocrx_word' id='xword_1_8' title="x_wconf -2">Gréine</ > span> > </span> <span class='ocr_word' id='word_1_9' title="bbox 1325 332 1337 > 345"> > <span class='ocrx_word' id='xword_1_9' title="x_wconf -3">.</span> > </span> > <span class='ocr_word' id='word_1_10' title="bbox 1605 334 1617 346"> > <span class='ocrx_word' id='xword_1_10' title="x_wconf -1">.</span> > </span> > <span class='ocr_word' id='word_1_11' title="bbox 1888 336 1899 346"> > <span class='ocrx_word' id='xword_1_11' title="x_wconf -1">.</span> > </span> > <span class='ocr_word' id='word_1_12' title="bbox 2451 335 2462 348"> > <span class='ocrx_word' id='xword_1_12' title="x_wconf -1">.</span> > </span> > <span class='ocr_word' id='word_1_13' title="bbox 2599 293 2633 349"> > <span class='ocrx_word' id='xword_1_13' title="x_wconf -7">3</span> > </span> > > </span> > </p> > > In a nutshell, Tess 3.01 outputs this pattern for each word: > > <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346"> > <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span> > </span> > > And judging by pdfbeads code, tess 3.00 did something like this for > each word: > <span class='ocrx_word' id='xword_1_5' title="bbox 444 294 577 > 346">Dul</span> > > pdfbeads 1.0.9 added a hack just to keep it from crashing > when the ratio was 0 because ocrx_word does not have bbox info. > > next if bbox == [0,0,0,0] > This simple change does not actually make it use the bbox info that > is in ocr_word. In fact, the net result is that only the bbox info > from > the entire line is used, and actual word positions are just > guestimated > by the pdf viewer -- which is sometimes nearly right, and other times > horribly wrong. > > I assume that the author of pdfbeads (Alexey Kryukov) understands this > change in the output of Tess3.01. Is he refusing to use ocr_word > because > it is not part of the standard ? This was implied by Carlos. > > Is there some useful discussion of the hocr output change in 3.01 > somewhere? > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

