Hi again,

I recently added a wordlist to my training, and was disappointed to
find that it didn't seem to substantially improve the results. I
suspect this is in significant part due to the unicharset not
recognising equivalent upper and lower case letters (and hence not
matching dictionary words case insensitively).

Examining the provided unicharset file for ell.trainingdata I see
that the 7th column appears to refer to the id of the opposite case
letter. So for example the two lines:

Α 5 39,70,132,255,39,204,0,44,52,288 Greek 25 0 101 Α>--# Α [391 ]A
α 3 59,72,188,200,98,175,0,67,102,288 Greek 101 0 25 α>-# α [3b1 ]a

refer to each other as 101 and 25 respectively.

However my generated unicharset file includes no such references,
with the 7th column being always 0. For example:

Α 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 777 0 0 #>-# Α [391 ]A
α 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 766 0 0 #>-# α [3b1 ]a

Should this case information be handled automatically when the
unicharset is created? If so, any clues as to how may I go about
tracking down why it isn't working? If not, make a note to add that
to the wiki when it's updated for 3.02.

Thanks for any advice,

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to