As I mentioned, if you need good bounding boxes you have to use a legacy engine. There are several issues & comments why it is problem to get accurate bounding boxes e.g. https://github.com/tesseract-ocr/tesseract/issues/2825#issuecomment-579220987
Zdenko so 25. 7. 2020 o 0:44 'robinw...@googlemail.com' via tesseract-ocr < tesseract-ocr@googlegroups.com> napísal(a): > > Do you use lstm or legacy engine? > > lstm. > > I can find a couple of Noah Metzger patches: > > > https://github.com/tesseract-ocr/tesseract/commit/c350077b96077fa50fefe97fbaed04014407f0f1 > > and > https://github.com/tesseract-ocr/tesseract/pull/2576 > > etc, but they've all been merged into master. As far as I can tell from > his github, all his patches have been pulled in. > > I'm using master. > > Crap bounding boxes really knock the effectiveness of Tesseract as a > library :( > > Thanks. > On Friday, 24 July 2020 at 19:01:30 UTC+1 zdenop wrote: > >> Do you use lstm or legacy engine? >> >> If lstm: search issue tracker/PR/(forum?) for bounding box problem (and Noah >> Metzger patches) >> >> There are rumours that if you need really good bounding boxes you have to >> use the latest 3.5 version because changes in the 4.x version (and later) >> also affected legacy engine bounding box accuracy (compared to version 3). >> But I never saw comparison test (especially on high volume of images) >> >> Zdenko >> >> >> pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr < >> tesser...@googlegroups.com> napísal(a): >> >>> Hi all, >>> >>> I'm using tesseract as a library, and broadly it seems to be working >>> well. I am having some very strange problems with the character boxes I get >>> back from the iterator though. >>> >>> The attached image is a png made from the 8bpp greyscale image that I >>> feed it, overlaid with boxes to show all the 'b' characters I get back. >>> >>> Only one of the 4 'b' characters I get appears to have the box in the >>> right place. >>> >>> The code I'm using to extract the data is: >>> >>> tesseract::ResultIterator *res_it = api->GetIterator(); >>> while (!res_it->Empty(tesseract::RIL_BLOCK)) >>> { >>> if (res_it->Empty(tesseract::RIL_WORD)) >>> { >>> res_it->Next(tesseract::RIL_WORD); >>> continue; >>> } >>> >>> res_it->BoundingBox(tesseract::RIL_TEXTLINE, >>> line_bbox, line_bbox+1, >>> line_bbox+2, line_bbox+3); >>> res_it->BoundingBox(tesseract::RIL_WORD, >>> word_bbox, word_bbox+1, >>> word_bbox+2, word_bbox+3); >>> font_name = res_it->WordFontAttributes(&bold, >>> &italic, >>> &underlined, >>> &monospace, >>> &serif, >>> &smallcaps, >>> &pointsize, >>> &font_id); >>> do >>> { >>> const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL); >>> if (graph && graph[0] != 0) >>> { >>> int unicode; >>> res_it->BoundingBox(tesseract::RIL_SYMBOL, >>> char_bbox, char_bbox+1, >>> char_bbox+2, char_bbox+3); >>> fz_chartorune(&unicode, graph); >>> callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox, >>> pointsize); >>> } >>> res_it->Next(tesseract::RIL_SYMBOL); >>> } >>> while (!res_it->Empty(tesseract::RIL_BLOCK) && >>> !res_it->IsAtBeginningOf(tesseract::RIL_WORD)); >>> } >>> >>> The characters are coming back correctly, and *most* are in the correct >>> position. Just a few are shifted. >>> >>> Is this to be expected? Am I doing something stupid? >>> >>> (Even being told "It's reliably correct for me" would be helpful at this >>> point.) >>> >>> Thanks, >>> >>> Robin >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wX558wi_YhC2ysf76OrkkaBPoQwGey05fR7H47NrTh7Q%40mail.gmail.com.