Hi again, I recently added a wordlist to my training, and was disappointed to find that it didn't seem to substantially improve the results. I suspect this is in significant part due to the unicharset not recognising equivalent upper and lower case letters (and hence not matching dictionary words case insensitively).
Examining the provided unicharset file for ell.trainingdata I see that the 7th column appears to refer to the id of the opposite case letter. So for example the two lines: Α 5 39,70,132,255,39,204,0,44,52,288 Greek 25 0 101 Α>--# Α [391 ]A α 3 59,72,188,200,98,175,0,67,102,288 Greek 101 0 25 α>-# α [3b1 ]a refer to each other as 101 and 25 respectively. However my generated unicharset file includes no such references, with the 7th column being always 0. For example: Α 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 777 0 0 #>-# Α [391 ]A α 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 766 0 0 #>-# α [3b1 ]a Should this case information be handled automatically when the unicharset is created? If so, any clues as to how may I go about tracking down why it isn't working? If not, make a note to add that to the wiki when it's updated for 3.02. Thanks for any advice, Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

