Re: Does tesseract 3.02 require new training?

Speedy Mon, 01 Oct 2012 03:27:55 -0700

Hi Zdenko,
 
thank you very much for your quick reply. In fact, the first approach was 
to simply use the 3.01-trained font with the 3.02 library. However, this 
does give worse results. The confusions H->M have increased sevenfold, O->D 
by a factor of 2.5.
 
What additional information does shapeclustering provide? What is its 
function?
 
Best regards,
Marcus


On Monday, October 1, 2012 11:34:25 AM UTC+2, zdenop wrote:

> You can use 3.01 language data file in 3.02 (tested ;-) ) 
> 3.02 training requries[1] usage of additional tool - shapeclustering [2] 
> but I did not tested if it make difference (e.g. 3.01 vs 3.02 language data 
> file). Maybe Nick did some tests (he created grc[2] file for 2.0x, 
> 3.01[3] and 3.02[4])...
>
> [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=629#c8
> [2] 
> http://tesseract-ocr.googlecode.com/svn/trunk/doc/shapeclustering.1.html
> [3] http://code.google.com/p/tesseract-ocr/issues/detail?id=770
> [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=754
>
> -- 
> Zdenko
>
> On Mon, Oct 1, 2012 at 11:10 AM, Speedy <[email protected]<javascript:>
> > wrote:
>
>> Hi,
>>  
>> I'll try another shot: When I move from tesseract 3.01 to tesseract 3.02 
>> should I retrain my fonts with the 3.02 training tools or does this not 
>> matter?
>>  
>> Best regards,
>> Marcus
>>  
>> On Thursday, September 20, 2012 4:31:50 PM UTC+2, Speedy wrote:
>>
>>> Hi there,
>>>  
>>> we are currently using tesseract 3.01 as OCR engine and have trained a 
>>> number of fonts with it. Things work quite well, but we would like to move 
>>> to version 3.02 for two reasons:
>>>
>>>    - It is possible to combine fonts 
>>>    - The character recognition is supposed to be significantly improved
>>>
>>> In our tests we found that the character recognition has chenged, but 
>>> the results are mixed. In particular, quite a few characters that 
>>> previously had few confusions now have none (which is good), but then there 
>>> are characters that are much worse, making the overall result worse. For 
>>> example, in one dataset the number of confusions from H to M has increased 
>>> from 7 to 52 and the number of confusions from O to D has increased from 15 
>>> to 37.
>>>  
>>> Is there a difference in the font files between tesseract 3.01 and 3.02? 
>>> Does it matter to tesseract 3.02 whether a font was trained with 3.01 
>>> training? Would it help to retrain the fonts with tesseract 3.02 training 
>>> tools or should this not matter?
>>>  
>>> In what way was character recognition improved in tesseract 3.02?
>>>  
>>> Thanks in advance for any help you can provide!
>>>  
>>> Best regards,
>>> Marcus
>>>  
>>>
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Does tesseract 3.02 require new training?

Reply via email to