[tesseract-ocr] Limit on number of whitelist characters

Reuben L. Tue, 09 Sep 2014 00:40:11 -0700

Hi all experts,

I would like to clarify if there is a limit to the number of whitelisted 
characters when using the *tessedit_char_whitelist* parameter in the config 
file. In my case, I noticed that once the number of whitelisted characters 
exceeds ~1300, an error "read_params_file: parameter not found" along with 
the remaining characters will be thrown. This suggests that tesseract is 
attempting to pass the rest of the characters as a parameter once it passes 
the around 1300 characters (multibyte ones).


It might sound strange that I have so many characters, but this is due to 
my need to limit Japanese kanji character recognition down to only the 
1900+ commonly used kanji, instead of the whole lot (which is many times 
more). Blacklist is also out of the question as there are MORE to blacklist 
than to whitelist.

I've also tried passing the* tessedit_char_whitelist* parameter twice, but 
only the latter one was considered. Apart from that, I have also tried 
passing it as a -c parameter in the commandline but that also failed.

While I know it would be possible to train for only the limited set of 
kanji, we are already at a point where doing so would be very wasteful in 
terms of time. 

Does anyone know of any other solution to this issue? Thanks in advance.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5a9be4d6-34a2-461b-aa87-90ee77272541%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Limit on number of whitelist characters

Reply via email to