Sorry, can't help further. Like I said before: this reads as having to run this in a debugger and see what happens.
What DOES jump into the eye are those very odd (HUGE) b-box coordinate numbers: what you would expect to be X/y pixel coordinates of the original image and /nobody/ has images with over a billion pixels in the horizontal axis! All those 4 numbers are suspect, which leads me to suspect the binary API interface between go and c++ is possibly broken. No certainty but this smells pretty bad. For reference and to aid your debugging efforts, go and see what tesseract cli outputs re X/y coordinates in hocr of tav output modes. The bbox numbers should fall in the same price range, so to speak. ;-) Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: [email protected] mobile: +31-6-11 120 978 -------------------------------------------------- On Mon, 3 Nov 2025, 14:47 Harshit Goel, <[email protected]> wrote: > Hi Ger, > > Thanks a lot for the detailed guidance — it was really helpful. > > I ran deeper diagnostics and confirmed a few things: > > - > > Running *Tesseract CLI* directly works perfectly and extracts: *NO > SMOKING* > - > > However, when using *gosseract* from Go, I still get *empty text > output* and a single empty bounding box like: > Text: [ ], Box: (1476397136,32579)-(1476956064,32579) > > The image being processed is a valid 8-bit/16-bit PNG (confirmed via file > command). > > - > > Setting *TESSDATA_PREFIX *or > *SetTessdataPrefix("/usr/share/tessdata")* works correctly — no > language load errors. > - > > Even after forcing engine mode with *tessedit_ocr_engine_mode = 1 *(LSTM > only) and using *PSM_SPARSE_TEXT*, gosseract still returns empty text. > - > > This makes me think gosseract is initializing Tesseract differently > (maybe not loading the same configs or missing something in the setup > phase), because the CLI and Go layer are using the same image and tessdata. > > Do you have any suggestions for checking whether gosseract is properly > initializing TessBaseAPI with the same defaults as CLI? > > Thanks again for your help — your earlier hint about checking bounding > boxes and configuration alignment was spot on. > > Best regards, > Harshit > > On Sun, Nov 2, 2025 at 4:09 AM Ger Hobbelt <[email protected]> wrote: > >> I expect you're in for a debug session. >> >> I do not use Go, so here's just a few general tidbits: >> >> - you tested with the tesseract CLI. Excellent! So that proves things can >> go well at the core; one major problem area less to worry about. >> - next is the gosseract library/layer itself: how does it talk to >> tesseract, what does it pass (and what doesn't it), etc.: from a very swift >> glance at the code, there's nothing blatantly obviously wrong in their >> bindings.cpp, AFAICT. Haven;t looked any further than that. >> - my own usage of tesseract as a library has shown me that getting the >> parameters right can be a bit of a hassle sometimes; one of the potential >> failure modes is not noting that tesseract does not receive the same config >> baseline setup as when it ran via CLI: this is where debugging is mandatory. >> >> My first guess would be to make very sure your tesseract config files are >> loaded the same way. While that can be a bit harsh to do when you're not >> comfortable with running this stuff in a debugger, here's a preparation >> step I would definitely look at if I were you: >> 1. tesseract via your Go code doesn't produce *anything*, while >> 2. tesseract CLI does deliver text ("No smoking") >> which MAY be due to tesseract not finding any text word bounding-boxes >> when run via the Go-code route. >> >> I see they (gosseract) present a GetBoundingBoxes API, so I would first >> try to run that one to see if I get any boxes at all, and if any, where >> they are in the image (i.e.: do I get: (a) no boxes, (b) only get gibberish >> boxes only or (c) at least the ones covering "NO" and "SMOKING", or what? >> Then try the same for the CLI (IIRC vanilla tesseract has an option to >> cough up bboxes only; haven't used that in a while and I'm running a >> customized tesseract here, so check code and documentation, don't take me >> at my word!) >> >> To see what I was looking at: >> https://github.com/otiai10/gosseract/blob/main/tessbridge.cpp#L108 >> >> If the bounding boxes don't show up in your Go run, then it smells like a >> config/setup bit not making it into the tesseract engine, so it's debugging >> the gosseract bindings.cpp interlayer to see what happens, really. Are CLI >> and Go code really, really pointing at the same config search paths, for >> example? >> If the bounding boxes show up and match the set in the CLI, we have a >> serious conundrum. >> >> Either way, that's the road I'd travel if walking in your shoes. >> (If you can debug-step the tesseract CLI the same way, you can more >> easily compare both, perhaps, as the CLI is using the same APIs gosseract >> is using (with some differences, but my current bet is those are not >> relevant). >> >> Also monitor the gosseract/tesseract run for error and warning messages >> from tesseract, as well. If it is silent, maybe force it once to barf a >> hairball, just so you know the error/warning/info outputs are working. >> Whatever you do, my bet is you have some debugging on the road ahead. >> >> Note: I don't do Go, so haven't used gosseract. This would be my general >> tactic though, anyway. >> >> >> Met vriendelijke groeten / Best regards, >> >> Ger Hobbelt >> >> -------------------------------------------------- >> web: http://www.hobbelt.com/ >> http://www.hebbut.net/ >> mail: [email protected] >> mobile: +31-6-11 120 978 >> -------------------------------------------------- >> >> >> On Fri, Oct 31, 2025 at 6:45 PM Harshit Goel <[email protected]> >> wrote: >> >>> Hi team >>> >>> I’m facing an issue where Tesseract OCR works correctly from the CLI, >>> but returns an empty string when called programmatically using Go (via >>> gosseract). >>> >>> For this particular image: >>> https://pmi-api.ubconnex.ca/files/icons/2025-03/11c6051eec503f52c43f0de382980d31.png, >>> the OCR always returns an empty string when running programmatically. Yet >>> when I run the exact same image manually using Tesseract from terminal by >>> command: *tesseract /tmp/ocr-3678469497.png stdout* >>> >>> It correctly detects and returns *NO SMOKING* >>> >>> *Environment* >>> >>> - OS: Linux (Server) >>> - >>> >>> Tesseract version: tesseract 5.x (CLI works fine) >>> - >>> >>> Go binding: github.com/otiai10/gosseract/v2 >>> - >>> >>> Go version: go1.23.x >>> >>> I've tried with the following approaches but still no effect: >>> >>> - >>> >>> Different PSM modes (SPARSE_TEXT, SINGLE_BLOCK, etc.) >>> - >>> >>> Preprocessing (grayscale, contrast enhancement, flattening >>> transparency). >>> - >>> >>> Verified that the image file is saved correctly and readable by >>> Tesseract. >>> - >>> >>> Tried increasing image size and contrast. >>> >>> Is there any known discrepancy between the CLI binary and the gosseract >>> API in how page segmentation modes or image preprocessing are handled >>> internally? >>> >>> Any insight on why Tesseract detects text in CLI but gosseract binding >>> returns empty output would be very helpful. >>> >>> Best Regards, >>> >>> Harshit Goel >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion visit >>> https://groups.google.com/d/msgid/tesseract-ocr/54875e13-9f91-4f45-9eb8-ee8eec4e5846n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/54875e13-9f91-4f45-9eb8-ee8eec4e5846n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foBhh_8kWyiP9-zVyfO8JrxwgDmvm%3DZH5pnE3sHYiu_1g%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foBhh_8kWyiP9-zVyfO8JrxwgDmvm%3DZH5pnE3sHYiu_1g%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/CADRW4UeJiWeZa6aO%2BS2pZoqG1zkMX0q18Rg0efCk7irb5u6Zsw%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CADRW4UeJiWeZa6aO%2BS2pZoqG1zkMX0q18Rg0efCk7irb5u6Zsw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foPnpL2S-N_9bMSmnZjr9qJtKy5%3DJMXZ_--Jwx2XmnqOA%40mail.gmail.com.

