Re: Thoughts on having the training process take font files directly

shree Tue, 15 Oct 2013 20:38:15 -0700

Hi jozef,

Did your company publish the article regarding
"tesseract including training and noise adding where our experience and 
expertise will be described in more detail"



 
On Tuesday, October 16, 2012 9:54:48 PM UTC+5:30, jm wrote:
>
>
>
> On Tuesday, October 16, 2012 12:27:43 PM UTC+2, TP wrote:
>>
>> On Mon, Oct 15, 2012 at 2:27 PM, Nick White <[email protected]> 
>> wrote: 
>> >> As an added step, you could might consider: rendering to grayscale, 
>> >> slightly blurring (optional), adding a bit of noise, and then 
>> >> re-converting to b&w to simulate what physical scanners do?  Maybe do 
>> >> this at 1200dpi and also downsample to 300 dpi. 
>> > 
>> > I wouldn't have thought adding random noise would be helpful; it 
>> > will just distort the shapes which Tesseract will use to match, and 
>> > as it will always get different noise to the type I generated, it 
>> > would only hinder it further. At least that's what I had assumed. Am 
>> > I wrong about this? Has anybody tested whether adding random noise 
>> > to an otherwise clean training improves things? 
>>
>> I only suggested this because of the following quote from [1]: 
>>
>>     Next print and scan (or use some electronic rendering method) to 
>> create 
>>     an image of your training page. Up to 32 training files can be used 
>> (of 
>>     multiple pages). It is best to create a mix of fonts and styles (but 
>> in 
>>     separate files), including italic and bold. 
>>
>>     NOTE: training from real images is actually quite hard, due to the 
>>     spacing-out requirements. This will be improved in a future release. 
>> For 
>>     now it is much easier if you can print/scan your own training text. 
>>
>> The reoccurrence of the word SCAN led me to believe that they are 
>> suggesting actually physically scanning (which implies adding a bit of 
>> noise). 
>>
>
> It depends on your document set but if the input document set is *not* 
> crystal clear (e.g., 600dpi generated from pdfs) than it helps. 
>
> For one of our test document set
> without noise 92% character accuracy
> with noise 97% character accuracy. 
>
> The interesting question is which algorithm(s) to use...
>
> jozef
>
> ps: my company plans to publish an article about tesseract including 
> training and noise adding where our experience and expertise will be 
> described in more detail
>
>  
>
>>
>> [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 
>>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Thoughts on having the training process take font files directly

Reply via email to