As I mentioned, if you need good bounding boxes you have to use a legacy
engine.
There are several issues & comments why it is problem to get accurate
bounding boxes e.g.
https://github.com/tesseract-ocr/tesseract/issues/2825#issuecomment-579220987


Zdenko


so 25. 7. 2020 o 0:44 'robinw...@googlemail.com' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> > Do you use lstm or legacy engine?
>
> lstm.
>
> I can find a couple of Noah Metzger patches:
>
>
> https://github.com/tesseract-ocr/tesseract/commit/c350077b96077fa50fefe97fbaed04014407f0f1
>
> and
> https://github.com/tesseract-ocr/tesseract/pull/2576
>
> etc, but they've all been merged into master. As far as I can tell from
> his github, all his patches have been pulled in.
>
> I'm using master.
>
> Crap bounding boxes really knock the effectiveness of Tesseract as a
> library :(
>
> Thanks.
> On Friday, 24 July 2020 at 19:01:30 UTC+1 zdenop wrote:
>
>> Do you use lstm or legacy engine?
>>
>> If lstm: search issue tracker/PR/(forum?) for bounding box problem (and  Noah
>> Metzger patches)
>>
>> There are rumours that if you need really good bounding boxes you have to
>> use the latest 3.5 version because changes in the 4.x version (and later)
>> also affected legacy engine bounding box accuracy (compared to version 3).
>> But I never saw comparison test (especially on high volume of images)
>>
>> Zdenko
>>
>>
>> pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr <
>> tesser...@googlegroups.com> napísal(a):
>>
>>> Hi all,
>>>
>>> I'm using tesseract as a library, and broadly it seems to be working
>>> well. I am having some very strange problems with the character boxes I get
>>> back from the iterator though.
>>>
>>> The attached image is a png made from the 8bpp greyscale image that I
>>> feed it, overlaid with boxes to show all the 'b' characters I get back.
>>>
>>> Only one of the 4 'b' characters I get appears to have the box in the
>>> right place.
>>>
>>> The code I'm using to extract the data is:
>>>
>>> tesseract::ResultIterator *res_it = api->GetIterator();
>>> while (!res_it->Empty(tesseract::RIL_BLOCK))
>>> {
>>> if (res_it->Empty(tesseract::RIL_WORD))
>>> {
>>> res_it->Next(tesseract::RIL_WORD);
>>> continue;
>>> }
>>>
>>> res_it->BoundingBox(tesseract::RIL_TEXTLINE,
>>> line_bbox, line_bbox+1,
>>> line_bbox+2, line_bbox+3);
>>> res_it->BoundingBox(tesseract::RIL_WORD,
>>> word_bbox, word_bbox+1,
>>> word_bbox+2, word_bbox+3);
>>> font_name = res_it->WordFontAttributes(&bold,
>>> &italic,
>>> &underlined,
>>> &monospace,
>>> &serif,
>>> &smallcaps,
>>> &pointsize,
>>> &font_id);
>>> do
>>> {
>>> const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL);
>>> if (graph && graph[0] != 0)
>>> {
>>> int unicode;
>>> res_it->BoundingBox(tesseract::RIL_SYMBOL,
>>> char_bbox, char_bbox+1,
>>> char_bbox+2, char_bbox+3);
>>> fz_chartorune(&unicode, graph);
>>> callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox,
>>> pointsize);
>>> }
>>> res_it->Next(tesseract::RIL_SYMBOL);
>>> }
>>> while (!res_it->Empty(tesseract::RIL_BLOCK) &&
>>> !res_it->IsAtBeginningOf(tesseract::RIL_WORD));
>>> }
>>>
>>> The characters are coming back correctly, and *most* are in the correct
>>> position. Just a few are shifted.
>>>
>>> Is this to be expected? Am I doing something stupid?
>>>
>>> (Even being told "It's reliably correct for me" would be helpful at this
>>> point.)
>>>
>>> Thanks,
>>>
>>> Robin
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wX558wi_YhC2ysf76OrkkaBPoQwGey05fR7H47NrTh7Q%40mail.gmail.com.

Reply via email to