Hi, I'm a researcher in statistical machine translation, and use for my work of bunch of translated texts (in multiple languages), some of which were automatically generated via OCR. I recently noticed that some texts included subtantial numbers of OCR errors, which I would of course like to correct to improve the quality of my data.
I was therefore wondering if I could use tesseract or some related software tool in order to correct at least some of these OCR-generated errors (through e.g. statistical language modelling techniques). Note that I unfortunately don't have access to the original scans, I only have the raw, OCR-produced text. Any suggestions? Thanks! Pierre -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1327422f-3459-4bfe-a567-7dc9707aee83%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

