Zdenko, Thanks for your work on that! I'm excited about using hOCR for some projects, so I'm really glad that we're moving towards standards compliance. --Sven
On Sat, May 26, 2012 at 2:57 AM, zdenko podobny <[email protected]> wrote: > Discussion could be found in (closed and open) Issues (;-) ). > > Initial hOCR support[1] comes from issue 263[2] and was submitted > by amkryukov. > As you can see this patch implemented 'ocr_word'and 'xocr_word'. They are > not part of hOCR spec. > > 'xocr_word'was changed[3] to 'ocrx_word'based on issue issue 492[4] that > complained its non conformity with hOCR spec. > > Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output > to follow hOCR spec. > > I think we need to split this problem to several parts: > > A. Spec conformity. As far as I understood this is fixed (no report about > non conformity to hOCR spec). > B. Usability in other tools. This is a little bit tricky because it needs > support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr > produce valid hOCR document and some tool is not able to process it, than > IMO that tool should be fixed... But it depends on problem. From my point of > view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be closed. > C. Other problems/enhancements: e.g. "empty words". This need to tested > (improved) but I think other tools should be able to process it. > > [1] http://code.google.com/p/tesseract-ocr/source/detail?r=333 > [2] http://code.google.com/p/tesseract-ocr/issues/detail?id=263&can=1&q=hocr > [3] http://code.google.com/p/tesseract-ocr/source/diff?spec=svn585&r=585&format=side&path=/trunk/api/baseapi.cpp > [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=492 > > -- > Zdenko > > On Wed, May 23, 2012 at 11:15 AM, Galt <[email protected]> wrote: >> >> Thanks, Zdenko! >> >> I found most of those same links too. >> >> FYI here is Tess3.01 output: >> >> <p class='ocr_par'> >> <span class='ocr_line' id='line_1_3' title="bbox 444 293 2633 363"> >> >> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346"> >> <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span> >> </span> >> <span class='ocr_word' id='word_1_6' title="bbox 620 298 696 360"> >> <span class='ocrx_word' id='xword_1_6' title="x_wconf -2">fé</span> >> </span> >> <span class='ocr_word' id='word_1_7' title="bbox 736 308 816 345"> >> <span class='ocrx_word' id='xword_1_7' title="x_wconf -1">na</span> >> </span> >> <span class='ocr_word' id='word_1_8' title="bbox 859 296 1095 363"> >> <span class='ocrx_word' id='xword_1_8' title="x_wconf -2">Gréine</ >> span> >> </span> <span class='ocr_word' id='word_1_9' title="bbox 1325 332 1337 >> 345"> >> <span class='ocrx_word' id='xword_1_9' title="x_wconf -3">.</span> >> </span> >> <span class='ocr_word' id='word_1_10' title="bbox 1605 334 1617 346"> >> <span class='ocrx_word' id='xword_1_10' title="x_wconf -1">.</span> >> </span> >> <span class='ocr_word' id='word_1_11' title="bbox 1888 336 1899 346"> >> <span class='ocrx_word' id='xword_1_11' title="x_wconf -1">.</span> >> </span> >> <span class='ocr_word' id='word_1_12' title="bbox 2451 335 2462 348"> >> <span class='ocrx_word' id='xword_1_12' title="x_wconf -1">.</span> >> </span> >> <span class='ocr_word' id='word_1_13' title="bbox 2599 293 2633 349"> >> <span class='ocrx_word' id='xword_1_13' title="x_wconf -7">3</span> >> </span> >> >> </span> >> </p> >> >> In a nutshell, Tess 3.01 outputs this pattern for each word: >> >> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346"> >> <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span> >> </span> >> >> And judging by pdfbeads code, tess 3.00 did something like this for >> each word: >> <span class='ocrx_word' id='xword_1_5' title="bbox 444 294 577 >> 346">Dul</span> >> >> pdfbeads 1.0.9 added a hack just to keep it from crashing >> when the ratio was 0 because ocrx_word does not have bbox info. >> > next if bbox == [0,0,0,0] >> This simple change does not actually make it use the bbox info that >> is in ocr_word. In fact, the net result is that only the bbox info >> from >> the entire line is used, and actual word positions are just >> guestimated >> by the pdf viewer -- which is sometimes nearly right, and other times >> horribly wrong. >> >> I assume that the author of pdfbeads (Alexey Kryukov) understands this >> change in the output of Tess3.01. Is he refusing to use ocr_word >> because >> it is not part of the standard ? This was implied by Carlos. >> >> Is there some useful discussion of the hocr output change in 3.01 >> somewhere? >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

