Re: unicharset matching upper and lower case letters

Taha Alasli Sat, 02 Jun 2012 20:45:44 -0700

Yes zdenko I've noticed the same problem where it shows the value of the
column 'script' is NULL, and when i try to change
the NULL value to real script name the unicharset file become unreadable
and thrwon error message when use it.


I wonder Is there a way to set the script name when creating unicharset
file? OR is there a way to edit the file without damage?


On 1 June 2012 18:26, zdenko podobny <[email protected]> wrote:

> Description of unicharset is in its manual page[1].
>
> Also in past I found that some information are missing from unicharset
> (generated by unicharset_extractor) e.g. 'script' is NULL, 'glyph_metrics'
> is IMO useless). This is one of the reason why I am looking for test suite
> - to see if adding such information helps to OCR output or not.
>
> [1]
> http://tesseract-ocr.googlecode.com/svn-history/r683/trunk/doc/unicharset.5.html
>
>
> --
> Zdenko
>
> On Fri, Jun 1, 2012 at 10:49 AM, Nick White <[email protected]>wrote:
>
>> Does nobody have any clues here? Any suggestions of where else I
>> could ask, or where to go in the code to work it out for myself?
>>
>> Thanks again.
>>
>> Nick
>>
>> On Wed, May 23, 2012 at 05:33:37PM +0100, Nick White wrote:
>> > Hi again,
>> >
>> > I recently added a wordlist to my training, and was disappointed to
>> > find that it didn't seem to substantially improve the results. I
>> > suspect this is in significant part due to the unicharset not
>> > recognising equivalent upper and lower case letters (and hence not
>> > matching dictionary words case insensitively).
>> >
>> > Examining the provided unicharset file for ell.trainingdata I see
>> > that the 7th column appears to refer to the id of the opposite case
>> > letter. So for example the two lines:
>> >
>> > Α 5 39,70,132,255,39,204,0,44,52,288 Greek 25 0 101 Α>--# Α [391 ]A
>> > α 3 59,72,188,200,98,175,0,67,102,288 Greek 101 0 25 α>-# α [3b1 ]a
>> >
>> > refer to each other as 101 and 25 respectively.
>> >
>> > However my generated unicharset file includes no such references,
>> > with the 7th column being always 0. For example:
>> >
>> > Α 5 0,255,0,255,0,32767,0,32767,0,32767 NULL 777 0 0 #>-# Α [391 ]A
>> > α 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 766 0 0 #>-# α [3b1 ]a
>> >
>> > Should this case information be handled automatically when the
>> > unicharset is created? If so, any clues as to how may I go about
>> > tracking down why it isn't working? If not, make a note to add that
>> > to the wiki when it's updated for 3.02.
>> >
>> > Thanks for any advice,
>> >
>> > Nick
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To post to this group, send email to [email protected]
>> > To unsubscribe from this group, send email to
>> > [email protected]
>> > For more options, visit this group at
>> > http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: unicharset matching upper and lower case letters

Reply via email to