Hi Zdenko, thank you very much for your quick reply. In fact, the first approach was to simply use the 3.01-trained font with the 3.02 library. However, this does give worse results. The confusions H->M have increased sevenfold, O->D by a factor of 2.5. What additional information does shapeclustering provide? What is its function? Best regards, Marcus
On Monday, October 1, 2012 11:34:25 AM UTC+2, zdenop wrote: > You can use 3.01 language data file in 3.02 (tested ;-) ) > 3.02 training requries[1] usage of additional tool - shapeclustering [2] > but I did not tested if it make difference (e.g. 3.01 vs 3.02 language data > file). Maybe Nick did some tests (he created grc[2] file for 2.0x, > 3.01[3] and 3.02[4])... > > [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=629#c8 > [2] > http://tesseract-ocr.googlecode.com/svn/trunk/doc/shapeclustering.1.html > [3] http://code.google.com/p/tesseract-ocr/issues/detail?id=770 > [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=754 > > -- > Zdenko > > On Mon, Oct 1, 2012 at 11:10 AM, Speedy <[email protected]<javascript:> > > wrote: > >> Hi, >> >> I'll try another shot: When I move from tesseract 3.01 to tesseract 3.02 >> should I retrain my fonts with the 3.02 training tools or does this not >> matter? >> >> Best regards, >> Marcus >> >> On Thursday, September 20, 2012 4:31:50 PM UTC+2, Speedy wrote: >> >>> Hi there, >>> >>> we are currently using tesseract 3.01 as OCR engine and have trained a >>> number of fonts with it. Things work quite well, but we would like to move >>> to version 3.02 for two reasons: >>> >>> - It is possible to combine fonts >>> - The character recognition is supposed to be significantly improved >>> >>> In our tests we found that the character recognition has chenged, but >>> the results are mixed. In particular, quite a few characters that >>> previously had few confusions now have none (which is good), but then there >>> are characters that are much worse, making the overall result worse. For >>> example, in one dataset the number of confusions from H to M has increased >>> from 7 to 52 and the number of confusions from O to D has increased from 15 >>> to 37. >>> >>> Is there a difference in the font files between tesseract 3.01 and 3.02? >>> Does it matter to tesseract 3.02 whether a font was trained with 3.01 >>> training? Would it help to retrain the fonts with tesseract 3.02 training >>> tools or should this not matter? >>> >>> In what way was character recognition improved in tesseract 3.02? >>> >>> Thanks in advance for any help you can provide! >>> >>> Best regards, >>> Marcus >>> >>> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

