I just wanted to see tesseract giving me something reasonable back, so I called GetHOCRText(1) so I can peek at the returned string as a sanity check. The result is that I got a EXC_BAD_ACCESS error, the same error that I get when I try to call boxaGetBox(boxa, 0, L_COPY).
As a desperate move, I am going to speculate that either the iPhone build is bugged (i am using tesseract on the iphone by the way), or tesseract doesn't recognize the source image as a page because I use an image processing library to trim out the junk except the number strip I wanted, leaving the source image that is thin and small. But i don't really know where to go from here. Any more pointers please? On Sep 9, 5:03 pm, haoest <[email protected]> wrote: > I will investigate your suggestions as soon as I can. Thank you for > the pointer, kind sir. > > On Sep 8, 8:28 pm, Dmitri Silaev <[email protected]> wrote: > > > > > > > > > Using GetUTF8Text() and then GetWords() is an overkill. However you > > can examine GetWords() and then GetComponentImages() for typical use > > of the "PageIterator" class which is a main means to access > > Tesseract's result details, including bounding boxes, at the API > > level. > > > Warm regards, > > Dmitri Silaevwww.CustomOCR.com > > > On Thu, Sep 8, 2011 at 8:59 AM, haoest <[email protected]> wrote: > > > Hi Dmitri, > > > > here's the snipet: > > > > //numberStrip is an opencv IplImage object > > > tess->SetImage((uchar*) numberStrip->imageData, numberStrip- > > >>width, numberStrip->height, > > > numberStrip->depth / 8, > > > numberStrip->widthStep); > > > text = tess->GetUTF8Text(); //text is fine, it contains digits > > > from the OpenCV image > > > Boxa* bounds = tess->GetWords(NULL); > > > l_int32 count = bounds->n; // count > 3 million :( > > > for(int i=0; i<count; i++){ > > > Box* b = bounds->box[i]; > > > /// coords below are all 0's, and sometimes I have bad access > > > int x = b->x; > > > int y = b->y; > > > int w = b->w; > > > int h = b->h; > > > } > > > > On Sep 7, 4:42 pm, Dmitri Silaev <[email protected]> wrote: > > >> Well, it's hard to tell without having seen your own code. Send it if > > >> you can afford. > > > >> Warm regards, > > >> Dmitri Silaevwww.CustomOCR.com > > > >> On Wed, Sep 7, 2011 at 3:31 AM, haoest <[email protected]> wrote: > > >> > Hi Dmitri, > > > >> > Thanks for the guidance. > > > >> > I looked up GetHOCRText() and compared it with GetWords(Pixa pixa). > > >> > They do very similar things, as they both get the coordinates through > > >> > word->bounding_box(); however my test show that GetHOCRText() produces > > >> > an html file with correct coordinates for the words, but GetWords > > >> > still gives me a Boxa object with 3 million words (gibberish). I don't > > >> > quite know what I did wrong. > > > >> > On Sep 6, 3:37 am, Dmitri Silaev <[email protected]> wrote: > > >> >> Examine control paths for 'tessedit_create_hocr' variable and see how > > >> >> rectangle coordinates are being obtained. > > > >> >> Warm regards, > > >> >> Dmitri Silaevwww.CustomOCR.com > > > >> >> On Tue, Sep 6, 2011 at 5:04 AM, haoest <[email protected]> wrote: > > >> >> > Hello, > > > >> >> > I have a very simple OCR app based on Tesseract. After the > > >> >> > recognition > > >> >> > step, I also provide a user verification step that allows correction > > >> >> > in case OCR is wrong. To improve the user interface, I plan to draw > > >> >> > a > > >> >> > rectangle on top of the OCR-ed character on the original input > > >> >> > image, > > >> >> > and put it side by side with the OCR output. To get to that, I need > > >> >> > the coordinate of the recognized characters. > > > >> >> > I tried something like this but it seems to give me gibberish: > > > >> >> > ETEXT_DESC output; > > >> >> > tess->Recognize(&output); > > >> >> > text = tess->GetUTF8Text(); > > > >> >> > Now if I access output->count, it gives me some value above 10,000, > > >> >> > which is obviously wrong because the whole image only has 20 or so. > > > >> >> > Am I on the right track? Can I have some direction please? > > > >> >> > -- > > >> >> > You received this message because you are subscribed to the Google > > >> >> > Groups "tesseract-ocr" group. > > >> >> > To post to this group, send email to [email protected] > > >> >> > To unsubscribe from this group, send email to > > >> >> > [email protected] > > >> >> > For more options, visit this group at > > >> >> >http://groups.google.com/group/tesseract-ocr?hl=en > > > >> > -- > > >> > You received this message because you are subscribed to the Google > > >> > Groups "tesseract-ocr" group. > > >> > To post to this group, send email to [email protected] > > >> > To unsubscribe from this group, send email to > > >> > [email protected] > > >> > For more options, visit this group at > > >> >http://groups.google.com/group/tesseract-ocr?hl=en > > > > -- > > > You received this message because you are subscribed to the Google > > > Groups "tesseract-ocr" group. > > > To post to this group, send email to [email protected] > > > To unsubscribe from this group, send email to > > > [email protected] > > > For more options, visit this group at > > >http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

