Sriranga(78yrs),
Here's some instructions and pictures of how to use Visual Studio 2008
to see where dawg2wordlist is crashing on Windows.
Assuming that I have the following folder hierarchy:
BuildFolder\
tesseract-3.02\
tessdata\
kan.traineddata
testing\
kan\
And I ran the following command from the testing folder:
combine_tessdata -u \BuildFolder\tesseract-3.02\tessdata
\kan.traineddata kan\kan.
Then inside \BuildFolder\Testing\kan I will get:
3/09/2012 20:47 82 kan.config
3/09/2012 20:47 1,650 kan.freq-dawg
3/09/2012 20:47 9,086,073 kan.inttemp
3/09/2012 20:47 157,351 kan.normproto
3/09/2012 20:47 226 kan.number-dawg
3/09/2012 20:47 17,065 kan.pffmtable
3/09/2012 20:47 490 kan.punc-dawg
3/09/2012 20:47 155,812 kan.shapetable
3/09/2012 20:47 1,047 kan.unicharambigs
3/09/2012 20:47 109,228 kan.unicharset
3/09/2012 20:47 184,562 kan.word-dawg
Open tesseract-3.02\vs2008\tesseract.sln in Visual Studio 2008. Right-
click on the dawg2wordlist Project in the Solution Explorer and choose
Properties from the popup menu. The Debugging Property pane should
look like this: http://www.screencast.com/t/4eTQi8lZEa.
Start debugging by right-clicking on the dawg2wordlist Project and
choosing Debug -> Step into new instance (http://www.screencast.com/t/
wNw7ziQoQ56).
You'll see something like this (http://www.screencast.com/t/bwuIsPuVw)
with the debugger stopped at the first line of the dawg2wordlist
program. Press F5 or Choose Debug -> Continue (http://
www.screencast.com/t/m0k93YPm2Gp) from the menubar to start debugging.
You'll quickly get an Unhandled exception message (http://
www.screencast.com/t/tfeo3b4JD). Click the Break button and you'll now
see exactly where the error is occurring (http://www.screencast.com/t/
KtBtmvmJ).
Note in particular the suspicious value for the "edge" variable in the
bottom left "Locals" Debugger pane: 0x0003373737373737.
The "Call Stack" Debugger pane on the bottom right, shows you what
functions were called to get to the current point.
It appears that something bad is happening here:
void Dawg::iterate_words_rec(const WERD_CHOICE &word_so_far,
NODE_REF to_explore,
TessCallback1<const char *> *cb) const
{
NodeChildVector children;
>>> this->unichar_ids_of(to_explore, &children);
for (int i = 0; i < children.size(); i++) {
WERD_CHOICE next_word(word_so_far);
next_word.append_unichar_id(children[i].unichar_id, 1, 0.0,
0.0);
if (this->end_of_word(children[i].edge_ref)) {
STRING s;
next_word.string_and_lengths(&s, NULL);
cb->Run(s.string());
}
NODE_REF next = next_node(children[i].edge_ref);
if (next != 0) {
iterate_words_rec(next_word, next, cb);
}
}
}
which leads one to think that maybe the dawg loaded by the following
line is corrupt somehow:
tesseract::Dawg *dict = LoadSquishedDawg(unicharset, dawg_file);
training\dawg2wordlist.cpp>LoadSquishedDawg() is:
tesseract::Dawg *LoadSquishedDawg(const UNICHARSET &unicharset,
const char *filename) {
const int kDictDebugLevel = 1;
FILE *dawg_file = fopen(filename, "r");
if (dawg_file == NULL) {
tprintf("Could not open %s for reading.\n", filename);
return NULL;
}
tprintf("Loading word list from %s\n", filename);
tesseract::Dawg *retval = new tesseract::SquishedDawg(
dawg_file, tesseract::DAWG_TYPE_WORD, "eng",
SYSTEM_DAWG_PERM,
kDictDebugLevel);
tprintf("Word list loaded.\n");
fclose(dawg_file);
return retval;
}
Aha! When you have a program that works on unix but crashes on
Windows, then you learn to look *very* closely at any fopen() calls
you see. In almost all cases, lines like:
FILE *dawg_file = fopen(filename, "r");
should instead be:
FILE *dawg_file = fopen(filename, "rb");
meaning we *don't* want to automatically convert a linefeed character
to a carriage return, followed by linefeed.
And in fact, fixing that line causes the program to run successfully.
I've updated the repository [1] so dawg2wordlist should no longer
crash on Windows when doing:
dawg2wordlist kan\kan.unicharset kan\kan.word-dawg word.wordlist
[1] http://code.google.com/p/tesseract-ocr/source/detail?r=703
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en