Upon further experimentation I think I found out that the whole whitelist is render irrelevant whenever a character in the blacklist is NOT in the training set ... this is crazy of course but it appears to be the case, as if the code handling this list decides to stop processing the list if one of the characters is not in the training set in the first place.
On Mar 30, 10:33 pm, patrickq <[email protected]> wrote: > I am trying to provide a black list with UTF8 characters specified > using their byte codes, as follows: > > // U+FB00 ff ef ac 80 LATIN SMALL LIGATURE FF > // U+FB01 fi ef ac 81 LATIN SMALL LIGATURE FI > > myTess->SetVariable("tessedit_char_blacklist", "\xef\xac\x80\xef\xac > \x81"); > > But this doesn't work. I tried "\x0ef\x0ac\x080" (adding a leading 0) > but same result. The call doesn't return an error but the characters > in question are not black listed. > > Is this string variable not in UTF8 format? Is there a problem in the > C syntax I used to provide the hex codes? > > Thanks! > Patrick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

