[tesseract-ocr] What part of the code performs Box Identification?

'Danny' via tesseract-ocr Tue, 03 Sep 2024 06:03:00 -0700

I'm still trying to improve recognition of television subtitles, especially 
traditional Chinese (see here 
<https://groups.google.com/g/tesseract-ocr/c/hwX_YFRUXf4/m/x3qtt3zqAgAJ>)


With either the stock *chi_tra* or our own trained model, it fails on 
certain text.  To investigate, I used the API to render box outlines on the 
input image. Something like:

mpTessApi = new tesseract::TessBaseAPI();
mpTessApi->Init(0, mLanguage.c_str()); // chi_tra, eng, etc
mpTessApi->SetImage(image);

// Get character and box rect for each detected character
const char *bt = mpTessApi->GetBoxText()

Then plot the boxes over the original input image.

Setting the language to 'eng'  does not properly recognize the text but the 
boxes are pretty close:
[image: sub_2.png]

But selecting *chi_tra* or our own model shows the boxes all over the 
place. Results vary a bit by changing the Page Segmentation Mode but none 
are even close.

With stock chi_tra:
*PSM 6*

*[image: sub_2.png]*

*PSM 7*

*[image: sub_2.png]*

*PSM 13*

*[image: sub_2.png]*
I'm planning to fix this but the in-code documentation is almost 
non-existent.

Can anyone tell me where in the code this gets done?  That would help a lot 
with debugging. We're running non-legacy mode.
Thanks.

PS. While the boxes are wildly all over the place, the output text is 
mostly accurate:

*你可以來接我嗎？*

How is that possible?  Does that mean GetBoxText()  is unreliable?
 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dba58504-0ffc-42e7-94ef-c6af16c08db6n%40googlegroups.com.

[tesseract-ocr] What part of the code performs Box Identification?

Reply via email to