Here is the sequence of calls we are using to get the complete
information about text in the image:
myTess->SetImage(grayScaleImageData, grayScaleWidth, grayScaleHeight,
1, grayScaleWidth);
BLOCK_LIST* block_list = myTess->FindLinesCreateBlockList();
PAGE_RES* page_res_pass1 = myTess->RecognitionPass1(block_list);
myTess->SetVariable("tessedit_char_blacklist", "P");
matchedChars = myTess->TesseractExtractResult(&textOCR, &lengths,
&costs, &x0, &y0, &x1, &y1, page_res_pass1);
Unfortunately TesseractExtractResults totally ignores the blacklist
and whitelist variables.
So we tried instead to call this:
myTess->SetImage(grayScaleImageData, grayScaleWidth, grayScaleHeight,
1, grayScaleWidth);
myTess->SetVariable("tessedit_char_blacklist", "P");
char *tmpS = myTess->GetBoxText();
This works (blacklist is used) BUT the set of characters returned are
totally without spaces (lacking both spaces and newline markers) which
is essentially useless: figuring out end of lines is easy but not
figuring out spaces.
This other alternatives includes the spaces:
myTess->SetImage(grayScaleImageData, grayScaleWidth, grayScaleHeight,
1, grayScaleWidth);
myTess->SetVariable("tessedit_char_blacklist", "P");
char *tmpS = myTess->GetUTF8Text();
BUT now coordinates are not provided ...
I could call both GetBoxText() and then GetUTF8Text() to get text +
coordinates + spaces - Recognize is called only once - then stitch the
two together ... but there MUST be an easier way ...
Thanks!
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.