Re: how to get characters position in tesseract 3.02 r729?

Sven Pedersen Mon, 16 Jul 2012 07:43:35 -0700

hOCR output includes character location -- look into that.
--Sven

On Sat, Jul 14, 2012 at 7:07 AM, moka <[email protected]> wrote:
> I used the source code by "tesseractdotnet", it can get characters position.
>
> "tesseractdotnet" based-on tesseract-ocr v3.01 r590.
>
> When I get the tesseract 3.02 r729 from svn, and use the under code on
> Windows XP(VS2008),
>
> bool succed = api->Recognize(monitor) >= 0;
> succed return true,
> int nChars = head->count;
> the nChars is always zero.
> how to get characters position in tesseract 3.02 r729?
>
> detail code:
>
> int main(int argc, char **argv) {
>
>     char* image1 = "c:\\eurotext.tif";
>     char* lang1 = "eng";
>     int psm = 3;
>     char* text = new char[1024];
>     int rc = 0;
>
>     rc = image2text(image1,lang1,psm,text);
>
>     printf("text: %s\n", text);
> }
>
> int image2text(const char* image,const char* lang,int psm,char* textout){
>     tesseract::PageSegMode pagesegmode = tesseract::PSM_AUTO;
>     pagesegmode = static_cast<tesseract::PageSegMode>(psm);
>
>     tesseract::TessBaseAPI  api;
>
>     // initialize monitor
>     ETEXT_DESC* monitor = NULL;
>     ETEXT_DESC* head = NULL;
>     int fixed_buffer_factor = 100;
>     int MAX_CHAR_RECOGNIZE = 32000;
>     int n = 127;
>     monitor = new ETEXT_DESC[fixed_buffer_factor*n];
>     monitor[1].more_to_come = 127;
>     monitor[1].count = 0;
>
>
>     // initialize api
>     int rc = api.Init(image, lang, tesseract::OEM_DEFAULT);
>     if (rc) {
>         // fprintf(stderr, "Could not initialize tesseract.\n");
>         return 1;
>     }
>
>     // We have 2 possible sources of pagesegmode: a config file and
>     // the command line. For backwards compatability reasons, the
>     // default in tesseract is tesseract::PSM_SINGLE_BLOCK, but the
>     // default for this program is tesseract::PSM_AUTO. We will let
>     // the config file take priority, so the command-line default
>     // can take priority over the tesseract default, so we use the
>     // value from the command line only if the retrieved mode
>     // is still tesseract::PSM_SINGLE_BLOCK, indicating no change
>     // in any config file. Therefore the only way to force
>     // tesseract::PSM_SINGLE_BLOCK is from the command line.
>     // It would be simpler if we could set the value before Init,
>     // but that doesn't work.
>     if (api.GetPageSegMode() == tesseract::PSM_SINGLE_BLOCK)
>         api.SetPageSegMode(pagesegmode);
>
>     FILE* fin = fopen(image, "rb");
>     if (fin == NULL) {
>         // printf("Cannot open input file: %s\n", image);
>         return 2;
>     }
>     fclose(fin);
>
>     PIX   *pixs;
>     if ((pixs = pixRead(image)) == NULL) {
>         // printf("Unsupported image type.\n");
>         return 3;
>     }
>
>     // detect characters
>     api.Clear();
>     api.ClearAdaptiveClassifier();
>     try
>     {
>         api.SetImage(pixs);
>         bool succed = api.Recognize(monitor) >= 0;
>         char* text = api.GetUTF8Text();
>         sprintf(textout, "%s", text);
>         delete text;
>
>         head = &monitor[1];
>
>         int lineIndex=0;
>         int lineIdx = 0;
>         int nChars = head->count;
>         int i = 0;
>         while (i < nChars)
>         {
>             // typedef struct {                  /*single character */
>             // It should be noted that the format for char_code for version
> 2.0 and beyond
>             // is UTF8 which means that ASCII characters will come out as
> one structure but
>             // other characters will be returned in two or more instances of
> this structure
>             // with a single byte of the  UTF8 code in each, but each will
> have the same
>             // bounding box. Programs which want to handle languagues with
> different
>             // characters sets will need to handle extended characters
> appropriately, but
>             // *all* code needs to be prepared to receive UTF8 coded
> characters for
>             // characters such as bullet and fancy quotes.
>             //  uinT16 char_code;              /*character itself */
>             //  inT16 left;                    /*of char (-1) */
>             //  inT16 right;                   /*of char (-1) */
>             //  inT16 top;                     /*of char (-1) */
>             //  inT16 bottom;                  /*of char (-1) */
>             //  inT16 font_index;              /*what font (0) */
>             //  uinT8 confidence;              /*0=perfect, 100=reject
> (0/100) */
>             //  uinT8 point_size;              /*of char, 72=i inch, (10) */
>             //  inT8 blanks;                   /*no of spaces before this
> char (1) */
>             //  uinT8 formatting;              /*char formatting (0) */
>             //} EANYCODE_CHAR;                 /*single character */
>
>
>             EANYCODE_CHAR* ch = &(head + i)->text[0];
>             printf("%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,", ch->char_code,
> ch->left, ch->top, ch->right, ch->bottom, ch->font_index, ch->confidence,
> ch->point_size, ch->formatting);
>             i++; /*go to next char*/
>         } /* end while */
>
>     }
>     catch (...)
>     {
>
>     }
>
>     head = NULL;
>     monitor = NULL;
>
>     pixDestroy(&pixs);
>
>     return 0;
> }
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en




-- 
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
>From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: how to get characters position in tesseract 3.02 r729?

Reply via email to