Re: How to get OCR-ed character coordinates?

Dmitri Silaev Thu, 08 Sep 2011 19:41:43 -0700

Using GetUTF8Text() and then GetWords() is an overkill. However you
can examine GetWords() and then GetComponentImages() for typical use
of the "PageIterator" class which is a main means to access
Tesseract's result details, including bounding boxes, at the API
level.


Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Thu, Sep 8, 2011 at 8:59 AM, haoest <[email protected]> wrote:
> Hi Dmitri,
>
> here's the snipet:
>
>     //numberStrip is an opencv IplImage object
>     tess->SetImage((uchar*) numberStrip->imageData, numberStrip-
>>width, numberStrip->height,
>                                   numberStrip->depth / 8,
>                                   numberStrip->widthStep);
>    text = tess->GetUTF8Text(); //text is fine, it contains digits
> from the OpenCV image
>    Boxa* bounds = tess->GetWords(NULL);
>    l_int32 count = bounds->n; // count > 3 million :(
>    for(int i=0; i<count; i++){
>        Box* b = bounds->box[i];
>        /// coords below are all 0's, and sometimes I have bad access
>        int x = b->x;
>        int y = b->y;
>        int w = b->w;
>        int h = b->h;
>    }
>
>
>
> On Sep 7, 4:42 pm, Dmitri Silaev <[email protected]> wrote:
>> Well, it's hard to tell without having seen your own code. Send it if
>> you can afford.
>>
>> Warm regards,
>> Dmitri Silaevwww.CustomOCR.com
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Sep 7, 2011 at 3:31 AM, haoest <[email protected]> wrote:
>> > Hi Dmitri,
>>
>> > Thanks for the guidance.
>>
>> > I looked up GetHOCRText() and compared it with GetWords(Pixa pixa).
>> > They do very similar things, as they both get the coordinates through
>> > word->bounding_box(); however my test show that GetHOCRText() produces
>> > an html file with correct coordinates for the words, but GetWords
>> > still gives me a Boxa object with 3 million words (gibberish). I don't
>> > quite know what I did wrong.
>>
>> > On Sep 6, 3:37 am, Dmitri Silaev <[email protected]> wrote:
>> >> Examine control paths for 'tessedit_create_hocr' variable and see how
>> >> rectangle coordinates are being obtained.
>>
>> >> Warm regards,
>> >> Dmitri Silaevwww.CustomOCR.com
>>
>> >> On Tue, Sep 6, 2011 at 5:04 AM, haoest <[email protected]> wrote:
>> >> > Hello,
>>
>> >> > I have a very simple OCR app based on Tesseract. After the recognition
>> >> > step, I also provide a user verification step that allows correction
>> >> > in case OCR is wrong. To improve the user interface, I plan to draw a
>> >> > rectangle on top of the OCR-ed character on the original input image,
>> >> > and put it side by side with the OCR output. To get to that, I need
>> >> > the coordinate of the recognized characters.
>>
>> >> > I tried something like this but it seems to give me gibberish:
>>
>> >> >        ETEXT_DESC output;
>> >> >        tess->Recognize(&output);
>> >> >        text = tess->GetUTF8Text();
>>
>> >> > Now if I access output->count, it gives me some value above 10,000,
>> >> > which is obviously wrong because the whole image only has 20 or so.
>>
>> >> > Am I on the right track? Can I have some direction please?
>>
>> >> > --
>> >> > You received this message because you are subscribed to the Google
>> >> > Groups "tesseract-ocr" group.
>> >> > To post to this group, send email to [email protected]
>> >> > To unsubscribe from this group, send email to
>> >> > [email protected]
>> >> > For more options, visit this group at
>> >> >http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to [email protected]
>> > To unsubscribe from this group, send email to
>> > [email protected]
>> > For more options, visit this group at
>> >http://groups.google.com/group/tesseract-ocr?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: How to get OCR-ed character coordinates?

Reply via email to