Re: Tess3.01 hocr output not working with pdfbeads

Galt Wed, 23 May 2012 02:59:11 -0700

Thanks, Zdenko!

I found most of those same links too.


FYI here is Tess3.01 output:

<p class='ocr_par'>
<span class='ocr_line' id='line_1_3' title="bbox 444 293 2633 363">

<span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346">
 <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span>
</span>
<span class='ocr_word' id='word_1_6' title="bbox 620 298 696 360">
 <span class='ocrx_word' id='xword_1_6' title="x_wconf -2">fé</span>
</span>
<span class='ocr_word' id='word_1_7' title="bbox 736 308 816 345">
 <span class='ocrx_word' id='xword_1_7' title="x_wconf -1">na</span>
</span>
<span class='ocr_word' id='word_1_8' title="bbox 859 296 1095 363">
 <span class='ocrx_word' id='xword_1_8' title="x_wconf -2">Gréine</
span>
</span> <span class='ocr_word' id='word_1_9' title="bbox 1325 332 1337
345">
 <span class='ocrx_word' id='xword_1_9' title="x_wconf -3">.</span>
</span>
<span class='ocr_word' id='word_1_10' title="bbox 1605 334 1617 346">
 <span class='ocrx_word' id='xword_1_10' title="x_wconf -1">.</span>
</span>
<span class='ocr_word' id='word_1_11' title="bbox 1888 336 1899 346">
 <span class='ocrx_word' id='xword_1_11' title="x_wconf -1">.</span>
</span>
<span class='ocr_word' id='word_1_12' title="bbox 2451 335 2462 348">
 <span class='ocrx_word' id='xword_1_12' title="x_wconf -1">.</span>
</span>
<span class='ocr_word' id='word_1_13' title="bbox 2599 293 2633 349">
 <span class='ocrx_word' id='xword_1_13' title="x_wconf -7">3</span>
</span>

</span>
</p>

In a nutshell, Tess 3.01 outputs this pattern for each word:

<span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346">
 <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span>
</span>

And judging by pdfbeads code, tess 3.00 did something like this for
each word:
<span class='ocrx_word' id='xword_1_5' title="bbox 444 294 577
346">Dul</span>

pdfbeads 1.0.9 added a hack just to keep it from crashing
when the ratio was 0 because ocrx_word does not have bbox info.
>         next if bbox == [0,0,0,0]
This simple change does not actually make it use the bbox info that
is in ocr_word.  In fact, the net result is that only the bbox info
from
the entire line is used, and actual word positions are just
guestimated
by the pdf viewer -- which is sometimes nearly right, and other times
horribly wrong.

I assume that the author of pdfbeads (Alexey Kryukov) understands this
change in the output of Tess3.01.  Is he refusing to use ocr_word
because
it is not part of the standard ?  This was implied by Carlos.

Is there some useful discussion of the hocr output change in 3.01
somewhere?

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess3.01 hocr output not working with pdfbeads

Reply via email to