Using GetUTF8Text() and then GetWords() is an overkill. However you can examine GetWords() and then GetComponentImages() for typical use of the "PageIterator" class which is a main means to access Tesseract's result details, including bounding boxes, at the API level.
Warm regards, Dmitri Silaev www.CustomOCR.com On Thu, Sep 8, 2011 at 8:59 AM, haoest <[email protected]> wrote: > Hi Dmitri, > > here's the snipet: > > //numberStrip is an opencv IplImage object > tess->SetImage((uchar*) numberStrip->imageData, numberStrip- >>width, numberStrip->height, > numberStrip->depth / 8, > numberStrip->widthStep); > text = tess->GetUTF8Text(); //text is fine, it contains digits > from the OpenCV image > Boxa* bounds = tess->GetWords(NULL); > l_int32 count = bounds->n; // count > 3 million :( > for(int i=0; i<count; i++){ > Box* b = bounds->box[i]; > /// coords below are all 0's, and sometimes I have bad access > int x = b->x; > int y = b->y; > int w = b->w; > int h = b->h; > } > > > > On Sep 7, 4:42 pm, Dmitri Silaev <[email protected]> wrote: >> Well, it's hard to tell without having seen your own code. Send it if >> you can afford. >> >> Warm regards, >> Dmitri Silaevwww.CustomOCR.com >> >> >> >> >> >> >> >> On Wed, Sep 7, 2011 at 3:31 AM, haoest <[email protected]> wrote: >> > Hi Dmitri, >> >> > Thanks for the guidance. >> >> > I looked up GetHOCRText() and compared it with GetWords(Pixa pixa). >> > They do very similar things, as they both get the coordinates through >> > word->bounding_box(); however my test show that GetHOCRText() produces >> > an html file with correct coordinates for the words, but GetWords >> > still gives me a Boxa object with 3 million words (gibberish). I don't >> > quite know what I did wrong. >> >> > On Sep 6, 3:37 am, Dmitri Silaev <[email protected]> wrote: >> >> Examine control paths for 'tessedit_create_hocr' variable and see how >> >> rectangle coordinates are being obtained. >> >> >> Warm regards, >> >> Dmitri Silaevwww.CustomOCR.com >> >> >> On Tue, Sep 6, 2011 at 5:04 AM, haoest <[email protected]> wrote: >> >> > Hello, >> >> >> > I have a very simple OCR app based on Tesseract. After the recognition >> >> > step, I also provide a user verification step that allows correction >> >> > in case OCR is wrong. To improve the user interface, I plan to draw a >> >> > rectangle on top of the OCR-ed character on the original input image, >> >> > and put it side by side with the OCR output. To get to that, I need >> >> > the coordinate of the recognized characters. >> >> >> > I tried something like this but it seems to give me gibberish: >> >> >> > ETEXT_DESC output; >> >> > tess->Recognize(&output); >> >> > text = tess->GetUTF8Text(); >> >> >> > Now if I access output->count, it gives me some value above 10,000, >> >> > which is obviously wrong because the whole image only has 20 or so. >> >> >> > Am I on the right track? Can I have some direction please? >> >> >> > -- >> >> > You received this message because you are subscribed to the Google >> >> > Groups "tesseract-ocr" group. >> >> > To post to this group, send email to [email protected] >> >> > To unsubscribe from this group, send email to >> >> > [email protected] >> >> > For more options, visit this group at >> >> >http://groups.google.com/group/tesseract-ocr?hl=en >> >> > -- >> > You received this message because you are subscribed to the Google >> > Groups "tesseract-ocr" group. >> > To post to this group, send email to [email protected] >> > To unsubscribe from this group, send email to >> > [email protected] >> > For more options, visit this group at >> >http://groups.google.com/group/tesseract-ocr?hl=en > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

