Thanks, Zdenko! I found most of those same links too.
FYI here is Tess3.01 output: <p class='ocr_par'> <span class='ocr_line' id='line_1_3' title="bbox 444 293 2633 363"> <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346"> <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span> </span> <span class='ocr_word' id='word_1_6' title="bbox 620 298 696 360"> <span class='ocrx_word' id='xword_1_6' title="x_wconf -2">fé</span> </span> <span class='ocr_word' id='word_1_7' title="bbox 736 308 816 345"> <span class='ocrx_word' id='xword_1_7' title="x_wconf -1">na</span> </span> <span class='ocr_word' id='word_1_8' title="bbox 859 296 1095 363"> <span class='ocrx_word' id='xword_1_8' title="x_wconf -2">Gréine</ span> </span> <span class='ocr_word' id='word_1_9' title="bbox 1325 332 1337 345"> <span class='ocrx_word' id='xword_1_9' title="x_wconf -3">.</span> </span> <span class='ocr_word' id='word_1_10' title="bbox 1605 334 1617 346"> <span class='ocrx_word' id='xword_1_10' title="x_wconf -1">.</span> </span> <span class='ocr_word' id='word_1_11' title="bbox 1888 336 1899 346"> <span class='ocrx_word' id='xword_1_11' title="x_wconf -1">.</span> </span> <span class='ocr_word' id='word_1_12' title="bbox 2451 335 2462 348"> <span class='ocrx_word' id='xword_1_12' title="x_wconf -1">.</span> </span> <span class='ocr_word' id='word_1_13' title="bbox 2599 293 2633 349"> <span class='ocrx_word' id='xword_1_13' title="x_wconf -7">3</span> </span> </span> </p> In a nutshell, Tess 3.01 outputs this pattern for each word: <span class='ocr_word' id='word_1_5' title="bbox 444 294 577 346"> <span class='ocrx_word' id='xword_1_5' title="x_wconf -2">Dul</span> </span> And judging by pdfbeads code, tess 3.00 did something like this for each word: <span class='ocrx_word' id='xword_1_5' title="bbox 444 294 577 346">Dul</span> pdfbeads 1.0.9 added a hack just to keep it from crashing when the ratio was 0 because ocrx_word does not have bbox info. > next if bbox == [0,0,0,0] This simple change does not actually make it use the bbox info that is in ocr_word. In fact, the net result is that only the bbox info from the entire line is used, and actual word positions are just guestimated by the pdf viewer -- which is sometimes nearly right, and other times horribly wrong. I assume that the author of pdfbeads (Alexey Kryukov) understands this change in the output of Tess3.01. Is he refusing to use ocr_word because it is not part of the standard ? This was implied by Carlos. Is there some useful discussion of the hocr output change in 3.01 somewhere? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

