Re: [tesseract-ocr] Are character bboxes trustworthy?

'robinw...@googlemail.com' via tesseract-ocr Fri, 24 Jul 2020 15:45:07 -0700

> Do you use lstm or legacy engine?  

lstm.


I can find a couple of Noah Metzger patches:

https://github.com/tesseract-ocr/tesseract/commit/c350077b96077fa50fefe97fbaed04014407f0f1
 
and 
https://github.com/tesseract-ocr/tesseract/pull/2576

etc, but they've all been merged into master. As far as I can tell from his 
github, all his patches have been pulled in.

I'm using master.

Crap bounding boxes really knock the effectiveness of Tesseract as a 
library :(

Thanks.
On Friday, 24 July 2020 at 19:01:30 UTC+1 zdenop wrote:

> Do you use lstm or legacy engine?
>
> If lstm: search issue tracker/PR/(forum?) for bounding box problem (and  Noah 
> Metzger patches) 
>
> There are rumours that if you need really good bounding boxes you have to 
> use the latest 3.5 version because changes in the 4.x version (and later) 
> also affected legacy engine bounding box accuracy (compared to version 3). 
> But I never saw comparison test (especially on high volume of images)
>
> Zdenko
>
>
> pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr <
> tesser...@googlegroups.com> napísal(a):
>
>> Hi all,
>>
>> I'm using tesseract as a library, and broadly it seems to be working 
>> well. I am having some very strange problems with the character boxes I get 
>> back from the iterator though.
>>
>> The attached image is a png made from the 8bpp greyscale image that I 
>> feed it, overlaid with boxes to show all the 'b' characters I get back.
>>
>> Only one of the 4 'b' characters I get appears to have the box in the 
>> right place.
>>
>> The code I'm using to extract the data is:
>>
>> tesseract::ResultIterator *res_it = api->GetIterator(); 
>> while (!res_it->Empty(tesseract::RIL_BLOCK))
>> {
>> if (res_it->Empty(tesseract::RIL_WORD))
>> {
>> res_it->Next(tesseract::RIL_WORD);
>> continue;
>> }
>>
>> res_it->BoundingBox(tesseract::RIL_TEXTLINE,
>> line_bbox, line_bbox+1,
>> line_bbox+2, line_bbox+3);
>> res_it->BoundingBox(tesseract::RIL_WORD,
>> word_bbox, word_bbox+1,
>> word_bbox+2, word_bbox+3);
>> font_name = res_it->WordFontAttributes(&bold,
>> &italic,
>> &underlined,
>> &monospace,
>> &serif,
>> &smallcaps,
>> &pointsize,
>> &font_id);
>> do
>> {
>> const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);
>> if (graph && graph[0] != 0)
>> {
>> int unicode;
>> res_it->BoundingBox(tesseract::RIL_SYMBOL,
>> char_bbox, char_bbox+1,
>> char_bbox+2, char_bbox+3);
>> fz_chartorune(&unicode, graph);
>> callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox, 
>> pointsize);
>> }
>> res_it->Next(tesseract::RIL_SYMBOL);
>> }
>> while (!res_it->Empty(tesseract::RIL_BLOCK) &&
>> !res_it->IsAtBeginningOf(tesseract::RIL_WORD));
>> }
>>
>> The characters are coming back correctly, and *most* are in the correct 
>> position. Just a few are shifted.
>>
>> Is this to be expected? Am I doing something stupid?
>>
>> (Even being told "It's reliably correct for me" would be helpful at this 
>> point.)
>>
>> Thanks,
>>
>> Robin
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com.

Re: [tesseract-ocr] Are character bboxes trustworthy?

Reply via email to