> Do you use lstm or legacy engine? lstm.
I can find a couple of Noah Metzger patches: https://github.com/tesseract-ocr/tesseract/commit/c350077b96077fa50fefe97fbaed04014407f0f1 and https://github.com/tesseract-ocr/tesseract/pull/2576 etc, but they've all been merged into master. As far as I can tell from his github, all his patches have been pulled in. I'm using master. Crap bounding boxes really knock the effectiveness of Tesseract as a library :( Thanks. On Friday, 24 July 2020 at 19:01:30 UTC+1 zdenop wrote: > Do you use lstm or legacy engine? > > If lstm: search issue tracker/PR/(forum?) for bounding box problem (and Noah > Metzger patches) > > There are rumours that if you need really good bounding boxes you have to > use the latest 3.5 version because changes in the 4.x version (and later) > also affected legacy engine bounding box accuracy (compared to version 3). > But I never saw comparison test (especially on high volume of images) > > Zdenko > > > pi 24. 7. 2020 o 19:01 'Robin Watts' via tesseract-ocr < > tesser...@googlegroups.com> napĂsal(a): > >> Hi all, >> >> I'm using tesseract as a library, and broadly it seems to be working >> well. I am having some very strange problems with the character boxes I get >> back from the iterator though. >> >> The attached image is a png made from the 8bpp greyscale image that I >> feed it, overlaid with boxes to show all the 'b' characters I get back. >> >> Only one of the 4 'b' characters I get appears to have the box in the >> right place. >> >> The code I'm using to extract the data is: >> >> tesseract::ResultIterator *res_it = api->GetIterator(); >> while (!res_it->Empty(tesseract::RIL_BLOCK)) >> { >> if (res_it->Empty(tesseract::RIL_WORD)) >> { >> res_it->Next(tesseract::RIL_WORD); >> continue; >> } >> >> res_it->BoundingBox(tesseract::RIL_TEXTLINE, >> line_bbox, line_bbox+1, >> line_bbox+2, line_bbox+3); >> res_it->BoundingBox(tesseract::RIL_WORD, >> word_bbox, word_bbox+1, >> word_bbox+2, word_bbox+3); >> font_name = res_it->WordFontAttributes(&bold, >> &italic, >> &underlined, >> &monospace, >> &serif, >> &smallcaps, >> &pointsize, >> &font_id); >> do >> { >> const char *graph = res_it->GetUTF8Text(tesseract::RIL_SYMBOL); >> if (graph && graph[0] != 0) >> { >> int unicode; >> res_it->BoundingBox(tesseract::RIL_SYMBOL, >> char_bbox, char_bbox+1, >> char_bbox+2, char_bbox+3); >> fz_chartorune(&unicode, graph); >> callback(ctx, arg, unicode, font_name, line_bbox, word_bbox, char_bbox, >> pointsize); >> } >> res_it->Next(tesseract::RIL_SYMBOL); >> } >> while (!res_it->Empty(tesseract::RIL_BLOCK) && >> !res_it->IsAtBeginningOf(tesseract::RIL_WORD)); >> } >> >> The characters are coming back correctly, and *most* are in the correct >> position. Just a few are shifted. >> >> Is this to be expected? Am I doing something stupid? >> >> (Even being told "It's reliably correct for me" would be helpful at this >> point.) >> >> Thanks, >> >> Robin >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/a1ff6999-1fb8-490d-9da2-0964c3ec3b62o%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6c581b86-33cb-48c7-bf00-d8958b048d9cn%40googlegroups.com.