Hi all experts, I would like to clarify if there is a limit to the number of whitelisted characters when using the *tessedit_char_whitelist* parameter in the config file. In my case, I noticed that once the number of whitelisted characters exceeds ~1300, an error "read_params_file: parameter not found" along with the remaining characters will be thrown. This suggests that tesseract is attempting to pass the rest of the characters as a parameter once it passes the around 1300 characters (multibyte ones).
It might sound strange that I have so many characters, but this is due to my need to limit Japanese kanji character recognition down to only the 1900+ commonly used kanji, instead of the whole lot (which is many times more). Blacklist is also out of the question as there are MORE to blacklist than to whitelist. I've also tried passing the* tessedit_char_whitelist* parameter twice, but only the latter one was considered. Apart from that, I have also tried passing it as a -c parameter in the commandline but that also failed. While I know it would be possible to train for only the limited set of kanji, we are already at a point where doing so would be very wasteful in terms of time. Does anyone know of any other solution to this issue? Thanks in advance. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5a9be4d6-34a2-461b-aa87-90ee77272541%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

