Hi jozef,

Did your company publish the article regarding
"tesseract including training and noise adding where our experience and 
expertise will be described in more detail"


 
On Tuesday, October 16, 2012 9:54:48 PM UTC+5:30, jm wrote:
>
>
>
> On Tuesday, October 16, 2012 12:27:43 PM UTC+2, TP wrote:
>>
>> On Mon, Oct 15, 2012 at 2:27 PM, Nick White <[email protected]> 
>> wrote: 
>> >> As an added step, you could might consider: rendering to grayscale, 
>> >> slightly blurring (optional), adding a bit of noise, and then 
>> >> re-converting to b&w to simulate what physical scanners do?  Maybe do 
>> >> this at 1200dpi and also downsample to 300 dpi. 
>> > 
>> > I wouldn't have thought adding random noise would be helpful; it 
>> > will just distort the shapes which Tesseract will use to match, and 
>> > as it will always get different noise to the type I generated, it 
>> > would only hinder it further. At least that's what I had assumed. Am 
>> > I wrong about this? Has anybody tested whether adding random noise 
>> > to an otherwise clean training improves things? 
>>
>> I only suggested this because of the following quote from [1]: 
>>
>>     Next print and scan (or use some electronic rendering method) to 
>> create 
>>     an image of your training page. Up to 32 training files can be used 
>> (of 
>>     multiple pages). It is best to create a mix of fonts and styles (but 
>> in 
>>     separate files), including italic and bold. 
>>
>>     NOTE: training from real images is actually quite hard, due to the 
>>     spacing-out requirements. This will be improved in a future release. 
>> For 
>>     now it is much easier if you can print/scan your own training text. 
>>
>> The reoccurrence of the word SCAN led me to believe that they are 
>> suggesting actually physically scanning (which implies adding a bit of 
>> noise). 
>>
>
> It depends on your document set but if the input document set is *not* 
> crystal clear (e.g., 600dpi generated from pdfs) than it helps. 
>
> For one of our test document set
> without noise 92% character accuracy
> with noise 97% character accuracy. 
>
> The interesting question is which algorithm(s) to use...
>
> jozef
>
> ps: my company plans to publish an article about tesseract including 
> training and noise adding where our experience and expertise will be 
> described in more detail
>
>  
>
>>
>> [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 
>>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to