Yes zdenko I've noticed the same problem where it shows the value of the column 'script' is NULL, and when i try to change the NULL value to real script name the unicharset file become unreadable and thrwon error message when use it.
I wonder Is there a way to set the script name when creating unicharset file? OR is there a way to edit the file without damage? On 1 June 2012 18:26, zdenko podobny <[email protected]> wrote: > Description of unicharset is in its manual page[1]. > > Also in past I found that some information are missing from unicharset > (generated by unicharset_extractor) e.g. 'script' is NULL, 'glyph_metrics' > is IMO useless). This is one of the reason why I am looking for test suite > - to see if adding such information helps to OCR output or not. > > [1] > http://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html > > > -- > Zdenko > > On Fri, Jun 1, 2012 at 10:49 AM, Nick White <[email protected]>wrote: > >> Does nobody have any clues here? Any suggestions of where else I >> could ask, or where to go in the code to work it out for myself? >> >> Thanks again. >> >> Nick >> >> On Wed, May 23, 2012 at 05:33:37PM +0100, Nick White wrote: >> > Hi again, >> > >> > I recently added a wordlist to my training, and was disappointed to >> > find that it didn't seem to substantially improve the results. I >> > suspect this is in significant part due to the unicharset not >> > recognising equivalent upper and lower case letters (and hence not >> > matching dictionary words case insensitively). >> > >> > Examining the provided unicharset file for ell.trainingdata I see >> > that the 7th column appears to refer to the id of the opposite case >> > letter. So for example the two lines: >> > >> > Α 5 39,70,132,255,39,204,0,44,52,288 Greek 25 0 101 Α>--# Α [391 ]A >> > α 3 59,72,188,200,98,175,0,67,102,288 Greek 101 0 25 α>-# α [3b1 ]a >> > >> > refer to each other as 101 and 25 respectively. >> > >> > However my generated unicharset file includes no such references, >> > with the 7th column being always 0. For example: >> > >> > Α 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 777 0 0 #>-# Α [391 ]A >> > α 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 766 0 0 #>-# α [3b1 ]a >> > >> > Should this case information be handled automatically when the >> > unicharset is created? If so, any clues as to how may I go about >> > tracking down why it isn't working? If not, make a note to add that >> > to the wiki when it's updated for 3.02. >> > >> > Thanks for any advice, >> > >> > Nick >> > >> > -- >> > You received this message because you are subscribed to the Google >> > Groups "tesseract-ocr" group. >> > To post to this group, send email to [email protected] >> > To unsubscribe from this group, send email to >> > [email protected] >> > For more options, visit this group at >> > http://groups.google.com/group/tesseract-ocr?hl=en >> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

