Re: [tesseract-ocr] Are character bboxes trustworthy?

Zdenko Podobny Fri, 24 Jul 2020 11:02:21 -0700

Do you use lstm or legacy engine?

If lstm: search issue tracker/PR/(forum?) for bounding box problem (and  Noah
Metzger patches)


There are rumours that if you need really good bounding boxes you have to
use the latest 3.5 version because changes in the 4.x version (and later)
also affected legacy engine bounding box accuracy (compared to version 3).
But I never saw comparison test (especially on high volume of images)

Zdenko


pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr <
[email protected]> napísal(a):

> Hi all,
>
> I'm using tesseract as a library, and broadly it seems to be working well.
> I am having some very strange problems with the character boxes I get back
> from the iterator though.
>
> The attached image is a png made from the 8bpp greyscale image that I feed
> it, overlaid with boxes to show all the 'b' characters I get back.
>
> Only one of the 4 'b' characters I get appears to have the box in the
> right place.
>
> The code I'm using to extract the data is:
>
> tesseract::ResultIterator *res_it = api->GetIterator();
> while (!res_it->Empty(tesseract::RIL_BLOCK))
> {
> if (res_it->Empty(tesseract::RIL_WORD))
> {
> res_it->Next(tesseract::RIL_WORD);
> continue;
> }
>
> res_it->BoundingBox(tesseract::RIL_TEXTLINE,
> line_bbox, line_bbox+1,
> line_bbox+2, line_bbox+3);
> res_it->BoundingBox(tesseract::RIL_WORD,
> word_bbox, word_bbox+1,
> word_bbox+2, word_bbox+3);
> font_name = res_it->WordFontAttributes(&bold,
> &italic,
> &underlined,
> &monospace,
> &serif,
> &smallcaps,
> &pointsize,
> &font_id);
> do
> {
> const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);
> if (graph && graph[0] != 0)
> {
> int unicode;
> res_it->BoundingBox(tesseract::RIL_SYMBOL,
> char_bbox, char_bbox+1,
> char_bbox+2, char_bbox+3);
> fz_chartorune(&unicode, graph);
> callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox,
> pointsize);
> }
> res_it->Next(tesseract::RIL_SYMBOL);
> }
> while (!res_it->Empty(tesseract::RIL_BLOCK) &&
> !res_it->IsAtBeginningOf(tesseract::RIL_WORD));
> }
>
> The characters are coming back correctly, and *most* are in the correct
> position. Just a few are shifted.
>
> Is this to be expected? Am I doing something stupid?
>
> (Even being told "It's reliably correct for me" would be helpful at this
> point.)
>
> Thanks,
>
> Robin
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ySG0KByGLvys6eEWsHKpwcsRAM0NLLK%2BPYzVZ26v3BLg%40mail.gmail.com.

Re: [tesseract-ocr] Are character bboxes trustworthy?

Reply via email to