*** On behalf of Andy Syme who could not post in this group probably due to spam removal artefacts ***
...my problem is that I have some documents written in 1890-1920 that I scanned & want to OCR. They are in English & using the standard English language file I was getting 40-50% recognition. I then tried to train a new font. I made an image file with at least 1 often 3 or 4 copies of each character & used pyTesseract to make the box file for this new font. Rebuilt the trained data file (after some trial & error), including adding the new font & updating the ambiguous character sets e.g. g’ = g, \\’ = W etc. When I rerun tesseract the OCR recognition is no better. I then created a language file which was basically all the English files but with only the ‘new font’ in. OCR accuracy dropped. Is there something I’m doing wrong? The new box file had all the letters (upper & lower) numbers & some punctuation but no newer symbols (e.g. &s or @s ) as they are not present in these docs. I can send the files I made if it will help you. Will post this again if you prefer but I am desperately looking for some help in this. Andy *** End Andy Syme *** The provided file can be downloaded here: https://docs.google.com/leaf?id=0B4FRY5H4TwI8ZWUzZDkzNjYtZTFiNC00NTBmLWIyY2ItMDFmNDAxZGI1ZTdk&hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

