Re: unicharset script and metrics questions

steve8918 Fri, 08 Jun 2012 20:23:08 -0700

Thanks again, Zdenko and Nick for your responses.  I was able to get 
somewhat better results by modifying my training image.

I'm trying to create a training set for basically the English letters A-Z, 
but with subscripts and superscripts with various letters or numbers.  I 
want to map each to a separate character.  So for example, A^1 would be 
"A", A^2 would be "a", etc.  When I run tesseract, I run it against an 
image of a single character, so I use SINGLE_CHAR mode.

I was able to create training data for this on 3.02 revision 729, but when 
I actually used tesseract to identify letters, the wasn't producing any 
good results, even when I used it against the training image itself.

I tried a second time, this time removing the subscripts and superscripts 
in the training image, and just using the English letters from A to Z. 
 This worked much better, I was able to get some results.

However, I'm getting a weird result where every single "D" is being 
recognized as "I".

Are there any switches or options that would produce some sort of output so 
that I can use to figure out why this mis-identification is occurring?  I 
tried using the "segdemo inter" option, but it looks like the interactive 
mode was removed from 3.02.

Thanks for your help,

Steve

On Friday, June 8, 2012 2:02:20 AM UTC-7, Nick White wrote:
>
> Hi Steve, 
>
> I'll cut up your email to reply to bits. 
>
> > Zdeno, does your note on unicharset_extractor mean that the currently 
> > codeline doesn't work properly? 
> > You mentioned a script to correct the information, is there any place 
> that 
> > documents how I can fix the file so that it works properly? 
> > 
> > Nick, have you been able to train either 3.01 or 3.02/current codeline 
> to 
> > recognize a new language properly? 
>
> Yes, the training I'm doing is with the 3.02 trunk code, and is 
> working very well now. As Zdenko says, we're just keen to make it as 
> good as possible, hence looking into unicharset oddities. My 
> training is already above my expectations. So don't be put off! 
>
> On Thu, Jun 07, 2012 at 01:02:59PM -0700, steve8918 wrote: 
> > Thanks Zdeno and Nick.  Yes, I'm using the latest code of tesseract 
> > (revision 729) because the 3.01 version doesn't appear to work well for 
> me, 
> > I'm getting "Couldn't find matching blob" for only one of my characters 
> for 
> > some reason. 
>
> I get that for various of my training images, for no obvious reason. 
> It doesn't seem to have a major impact on the training for me 
> though, so I wouldn't worry too much about it. 
>
> > After following your instructions, I was able to get 
> > everything working without crashing or errors.  However, the training 
> > didn't seem to work, because it's not recognizing anything properly. 
>
> That is suprising. Can you give more information about what 
> (mis)recognition is happening? 
>   
> Nick 
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: unicharset script and metrics questions

Reply via email to