I have scanned nearly 3,000 pages and fed them into tesseract. Some were very poor quality -- memeographs from the '60s and other very poor quality faded originals. I found that paying attention to making sure the tifs being input to tesseract were as clean and noise free as possible, that the dpi was high -- but not too high -- really paid off.
I solved the problem with ligatures by writing a simple editor script that converted them to their component characters. I really think this wikipedia page is a good idea. On Sat, Dec 14, 2013 at 11:41 AM, Tom Morris <[email protected]> wrote: > On Friday, December 13, 2013 11:25:42 AM UTC-5, Nick White wrote: >> >> >> I've drafted such a page, and I'd be keen to get feedback on it. Is >> it clear? Is it a good idea? I haven't filled out all of the "Image >> processing" sections yet, but (presuming people don't hate the idea >> in general) I will do soon, including image examples. >> > > I think that's a good idea and a good start. > > Another thing that would be useful is to give more people wiki privileges. > One doesn't need full committer status to be able to edit wiki pages on > Google Code projects. > > Tom > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- /greg -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

