Sriranga(78yrs),

Here's some instructions and pictures of how to use Visual Studio 2008
to see where dawg2wordlist is crashing on Windows.

Assuming that I have the following folder hierarchy:

   BuildFolder\
      tesseract-3.02\
         tessdata\
            kan.traineddata
      testing\
         kan\

And I ran the following command from the testing folder:

   combine_tessdata -u \BuildFolder\tesseract-3.02\tessdata
\kan.traineddata kan\kan.

Then inside \BuildFolder\Testing\kan I will get:

   3/09/2012  20:47              82  kan.config
   3/09/2012  20:47           1,650  kan.freq-dawg
   3/09/2012  20:47       9,086,073  kan.inttemp
   3/09/2012  20:47         157,351  kan.normproto
   3/09/2012  20:47             226  kan.number-dawg
   3/09/2012  20:47          17,065  kan.pffmtable
   3/09/2012  20:47             490  kan.punc-dawg
   3/09/2012  20:47         155,812  kan.shapetable
   3/09/2012  20:47           1,047  kan.unicharambigs
   3/09/2012  20:47         109,228  kan.unicharset
   3/09/2012  20:47         184,562  kan.word-dawg

Open tesseract-3.02\vs2008\tesseract.sln in Visual Studio 2008. Right-
click on the dawg2wordlist Project in the Solution Explorer and choose
Properties from the popup menu. The Debugging Property pane should
look like this: http://www.screencast.com/t/4eTQi8lZEa.

Start debugging by right-clicking on the dawg2wordlist Project and
choosing Debug -> Step into new instance (http://www.screencast.com/t/
wNw7ziQoQ56).

You'll see something like this (http://www.screencast.com/t/bwuIsPuVw)
with the debugger stopped at the first line of the  dawg2wordlist
program. Press F5 or Choose Debug -> Continue (http://
www.screencast.com/t/m0k93YPm2Gp) from the menubar to start debugging.

You'll quickly get an Unhandled exception message (http://
www.screencast.com/t/tfeo3b4JD). Click the Break button and you'll now
see exactly where the error is occurring (http://www.screencast.com/t/
KtBtmvmJ).

Note in particular the suspicious value for the "edge" variable in the
bottom left "Locals" Debugger pane: 0x0003373737373737.

The "Call Stack" Debugger pane on the bottom right, shows you what
functions were called to get to the current point.

It appears that something bad is happening here:

   void Dawg::iterate_words_rec(const WERD_CHOICE &word_so_far,
                                NODE_REF to_explore,
                                TessCallback1<const char *> *cb) const
{
     NodeChildVector children;
>>>  this->unichar_ids_of(to_explore, &children);
     for (int i = 0; i < children.size(); i++) {
       WERD_CHOICE next_word(word_so_far);
       next_word.append_unichar_id(children[i].unichar_id, 1, 0.0,
0.0);
       if (this->end_of_word(children[i].edge_ref)) {
         STRING s;
         next_word.string_and_lengths(&s, NULL);
         cb->Run(s.string());
       }
       NODE_REF next = next_node(children[i].edge_ref);
       if (next != 0) {
         iterate_words_rec(next_word, next, cb);
       }
     }
   }

which leads one to think that maybe the dawg loaded by the following
line is corrupt somehow:

   tesseract::Dawg *dict = LoadSquishedDawg(unicharset, dawg_file);

training\dawg2wordlist.cpp>LoadSquishedDawg() is:

   tesseract::Dawg *LoadSquishedDawg(const UNICHARSET &unicharset,
                                     const char *filename) {
     const int kDictDebugLevel = 1;
     FILE *dawg_file = fopen(filename, "r");
     if (dawg_file == NULL) {
       tprintf("Could not open %s for reading.\n", filename);
       return NULL;
     }
     tprintf("Loading word list from %s\n", filename);
     tesseract::Dawg *retval = new tesseract::SquishedDawg(
         dawg_file, tesseract::DAWG_TYPE_WORD, "eng",
SYSTEM_DAWG_PERM,
         kDictDebugLevel);
     tprintf("Word list loaded.\n");
     fclose(dawg_file);
     return retval;
   }

Aha! When you have a program that works on unix but crashes on
Windows, then you learn to look *very* closely at any fopen() calls
you see. In almost all cases, lines like:

     FILE *dawg_file = fopen(filename, "r");

should instead be:

     FILE *dawg_file = fopen(filename, "rb");

meaning we *don't* want to automatically convert a linefeed character
to a carriage return, followed by linefeed.

And in fact, fixing that line causes the program to run successfully.
I've updated the repository [1] so dawg2wordlist should no longer
crash on Windows when doing:

   dawg2wordlist kan\kan.unicharset kan\kan.word-dawg word.wordlist

[1] http://code.google.com/p/tesseract-ocr/source/detail?r=703

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to