I'm still trying to improve recognition of television subtitles, especially traditional Chinese (see here <https://groups.google.com/g/tesseract-ocr/c/hwX_YFRUXf4/m/x3qtt3zqAgAJ>)
With either the stock *chi_tra* or our own trained model, it fails on certain text. To investigate, I used the API to render box outlines on the input image. Something like: mpTessApi = new tesseract::TessBaseAPI(); mpTessApi->Init(0, mLanguage.c_str()); // chi_tra, eng, etc mpTessApi->SetImage(image); // Get character and box rect for each detected character const char *bt = mpTessApi->GetBoxText() Then plot the boxes over the original input image. Setting the language to 'eng' does not properly recognize the text but the boxes are pretty close: [image: sub_2.png] But selecting *chi_tra* or our own model shows the boxes all over the place. Results vary a bit by changing the Page Segmentation Mode but none are even close. With stock chi_tra: *PSM 6* *[image: sub_2.png]* *PSM 7* *[image: sub_2.png]* *PSM 13* *[image: sub_2.png]* I'm planning to fix this but the in-code documentation is almost non-existent. Can anyone tell me where in the code this gets done? That would help a lot with debugging. We're running non-legacy mode. Thanks. PS. While the boxes are wildly all over the place, the output text is mostly accurate: *你可以來接我嗎?* How is that possible? Does that mean GetBoxText() is unreliable? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dba58504-0ffc-42e7-94ef-c6af16c08db6n%40googlegroups.com.

