Hi, I am creating a tiff image from a pdf document. The convert command provides too many options to create image. eg. convert -monochrome -depth 8 -geometry 4000 -density 600 -quality 100 sample.pdf sample.tiff
I want to tune these parameters to get most suitable image for OCR. I can change depth, geometry, density, quality, opt for monochrome image etc. As you said 600 DPI image would be good for OCRs. But I am not able to relate 600 DPI with these parameters. My guess is DPI is same as density. Any suggestion would be highly appreciated. On Wednesday, 22 August 2012 20:20:06 UTC+5:30, Jani Monoses wrote: > > > > > OK, I see. One thing you could do would be to experiment with > > increasing Tesseract's trust in its dictionary. I have done > > something similar with my training. Create a file with this in: > > > > language_model_penalty_non_freq_dict_word 0.2 > > language_model_penalty_non_dict_word 0.3 > > > > Thanks, I tried this and the output is certainly different, but as > with the dpi changes > some things got better, other regressed with no clear winner. > > I tried increasing the values even more but then the regressions seem > to multiply too. > What I notice now is that at higher dpi, all lowercase o is recognized > as e, so I'll probably stick to 600dpi for now. > > So there's no way of just adding new words to the existing dictionary > without redoing the whole training? > > Are any other tunables such as the above that you think may help looking > into? > > Jani > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

