Zdenko,
Thanks for your work on that! I'm excited about using hOCR for some
projects, so I'm really glad that we're moving towards standards
compliance.
--Sven

On Sat, May 26, 2012 at 2:57 AM, zdenko podobny <[email protected]> wrote:
> Discussion could be found in (closed and open) Issues (;-) ).
>
> Initial hOCR support[1] comes from issue 263[2] and was submitted
> by amkryukov.
> As you can see this patch implemented 'ocr_word'and 'xocr_word'. They are
> not part of hOCR spec.
>
>  'xocr_word'was changed[3] to 'ocrx_word'based on issue issue 492[4] that
> complained its non conformity with hOCR spec.
>
> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
> to follow hOCR spec.
>
> I think we need to split this problem to several parts:
>
> A. Spec conformity. As far as I understood this is fixed (no report about
> non conformity to hOCR spec).
> B. Usability in other tools. This is a little bit tricky because it needs
> support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr
> produce valid hOCR document and some tool is not able to process it, than
> IMO that tool should be fixed... But it depends on problem. From my point of
> view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be closed.
> C. Other problems/enhancements: e.g. "empty words". This need to tested
> (improved) but I think other tools should be able to process it.
>
> [1] http://code.google.com/p/tesseract-ocr/source/detail?r=333
> [2] http://code.google.com/p/tesseract-ocr/issues/detail?id=263&can=1&q=hocr
> [3] http://code.google.com/p/tesseract-ocr/source/diff?spec=svn585&r=585&format=side&path=/trunk/api/baseapi.cpp
> [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=492
>
> --
> Zdenko
>
> On Wed, May 23, 2012 at 11:15 AM, Galt <[email protected]> wrote:
>>
>> Thanks, Zdenko!
>>
>> I found most of those same links too.
>>
>> FYI here is Tess3.01 output:
>>
>> <p class='ocr_par'>
>> <span class='ocr_line' id='line_1_3' title="bbox 444 293 2633 363">
>>
>> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346">
>>  <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span>
>> </span>
>> <span class='ocr_word' id='word_1_6' title="bbox 620 298 696 360">
>>  <span class='ocrx_word' id='xword_1_6' title="x_wconf -2">fé</span>
>> </span>
>> <span class='ocr_word' id='word_1_7' title="bbox 736 308 816 345">
>>  <span class='ocrx_word' id='xword_1_7' title="x_wconf -1">na</span>
>> </span>
>> <span class='ocr_word' id='word_1_8' title="bbox 859 296 1095 363">
>>  <span class='ocrx_word' id='xword_1_8' title="x_wconf -2">Gréine</
>> span>
>> </span> <span class='ocr_word' id='word_1_9' title="bbox 1325 332 1337
>> 345">
>>  <span class='ocrx_word' id='xword_1_9' title="x_wconf -3">.</span>
>> </span>
>> <span class='ocr_word' id='word_1_10' title="bbox 1605 334 1617 346">
>>  <span class='ocrx_word' id='xword_1_10' title="x_wconf -1">.</span>
>> </span>
>> <span class='ocr_word' id='word_1_11' title="bbox 1888 336 1899 346">
>>  <span class='ocrx_word' id='xword_1_11' title="x_wconf -1">.</span>
>> </span>
>> <span class='ocr_word' id='word_1_12' title="bbox 2451 335 2462 348">
>>  <span class='ocrx_word' id='xword_1_12' title="x_wconf -1">.</span>
>> </span>
>> <span class='ocr_word' id='word_1_13' title="bbox 2599 293 2633 349">
>>  <span class='ocrx_word' id='xword_1_13' title="x_wconf -7">3</span>
>> </span>
>>
>> </span>
>> </p>
>>
>> In a nutshell, Tess 3.01 outputs this pattern for each word:
>>
>> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346">
>>  <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span>
>> </span>
>>
>> And judging by pdfbeads code, tess 3.00 did something like this for
>> each word:
>> <span class='ocrx_word' id='xword_1_5' title="bbox 444 294 577
>> 346">Dul</span>
>>
>> pdfbeads 1.0.9 added a hack just to keep it from crashing
>> when the ratio was 0 because ocrx_word does not have bbox info.
>> >         next if bbox == [0,0,0,0]
>> This simple change does not actually make it use the bbox info that
>> is in ocr_word.  In fact, the net result is that only the bbox info
>> from
>> the entire line is used, and actual word positions are just
>> guestimated
>> by the pdf viewer -- which is sometimes nearly right, and other times
>> horribly wrong.
>>
>> I assume that the author of pdfbeads (Alexey Kryukov) understands this
>> change in the output of Tess3.01.  Is he refusing to use ocr_word
>> because
>> it is not part of the standard ?  This was implied by Carlos.
>>
>> Is there some useful discussion of the hocr output change in 3.01
>> somewhere?
>>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to