Here is the sequence of calls we are using to get the complete
information about text in the image:

myTess->SetImage(grayScaleImageData, grayScaleWidth, grayScaleHeight,
1, grayScaleWidth);
BLOCK_LIST* block_list = myTess->FindLinesCreateBlockList();
PAGE_RES* page_res_pass1 = myTess->RecognitionPass1(block_list);
myTess->SetVariable("tessedit_char_blacklist", "P");
matchedChars = myTess->TesseractExtractResult(&textOCR, &lengths,
&costs, &x0, &y0, &x1, &y1, page_res_pass1);

Unfortunately TesseractExtractResults totally ignores the blacklist
and whitelist variables.

So we tried instead to call this:

myTess->SetImage(grayScaleImageData, grayScaleWidth, grayScaleHeight,
1, grayScaleWidth);
myTess->SetVariable("tessedit_char_blacklist", "P");
char *tmpS = myTess->GetBoxText();

This works (blacklist is used) BUT the set of characters returned are
totally without spaces (lacking both spaces and newline markers) which
is essentially useless: figuring out end of lines is easy but not
figuring out spaces.

This other alternatives includes the spaces:

myTess->SetImage(grayScaleImageData, grayScaleWidth, grayScaleHeight,
1, grayScaleWidth);
myTess->SetVariable("tessedit_char_blacklist", "P");
char *tmpS = myTess->GetUTF8Text();

BUT now coordinates are not provided ...

I could call both GetBoxText() and then GetUTF8Text() to get text +
coordinates + spaces - Recognize is called only once - then stitch the
two together ... but there MUST be an easier way ...

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to