Actually, there's an issue already on this point: http://code.google.com/p/tesseract-ocr/issues/detail?id=455&sort=-id I don't see any progress on it, though
Warm regards, Dmitri Silaev On Thu, Mar 31, 2011 at 7:55 AM, patrickq <[email protected]> wrote: > Upon further experimentation I think I found out that the whole > whitelist is render irrelevant whenever a character in the blacklist > is NOT in the training set ... this is crazy of course but it appears > to be the case, as if the code handling this list decides to stop > processing the list if one of the characters is not in the training > set in the first place. > > On Mar 30, 10:33 pm, patrickq <[email protected]> wrote: >> I am trying to provide a black list with UTF8 characters specified >> using their byte codes, as follows: >> >> // U+FB00 ff ef ac 80 LATIN SMALL LIGATURE FF >> // U+FB01 fi ef ac 81 LATIN SMALL LIGATURE FI >> >> myTess->SetVariable("tessedit_char_blacklist", "\xef\xac\x80\xef\xac >> \x81"); >> >> But this doesn't work. I tried "\x0ef\x0ac\x080" (adding a leading 0) >> but same result. The call doesn't return an error but the characters >> in question are not black listed. >> >> Is this string variable not in UTF8 format? Is there a problem in the >> C syntax I used to provide the hex codes? >> >> Thanks! >> Patrick > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

