I have scanned nearly 3,000 pages and fed them into tesseract.  Some were
very poor quality -- memeographs from the '60s and other very poor quality
faded originals.  I found that paying attention to making sure the tifs
being input to tesseract were as clean and noise free as possible, that the
dpi was high -- but not too high -- really paid off.

I solved the problem with ligatures by writing a simple editor script that
converted them to their component characters.

I really think this wikipedia page is a good idea.



On Sat, Dec 14, 2013 at 11:41 AM, Tom Morris <[email protected]> wrote:

> On Friday, December 13, 2013 11:25:42 AM UTC-5, Nick White wrote:
>>
>>
>> I've drafted such a page, and I'd be keen to get feedback on it. Is
>> it clear? Is it a good idea? I haven't filled out all of the "Image
>> processing" sections yet, but (presuming people don't hate the idea
>> in general) I will do soon, including image examples.
>>
>
> I think that's a good idea and a good start.
>
> Another thing that would be useful is to give more people wiki privileges.
> One doesn't need full committer status to be able to edit wiki pages on
> Google Code projects.
>
> Tom
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>



-- 
/greg

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to