[tesseract-ocr] Post-correction of OCR-generated text

Pierre Lison Tue, 02 Sep 2014 09:08:48 -0700

Hi,

I'm a researcher in statistical machine translation, and use for my work of 
bunch of translated texts (in multiple languages), some of which were 
automatically generated via OCR.  I recently noticed that some texts 
included subtantial numbers of OCR errors, which I would of course like to 
correct to improve the quality of my data.


I was therefore wondering if I could use tesseract or some related software 
tool in order to correct at least some of these OCR-generated errors 
(through e.g. statistical language modelling techniques).  Note that I 
unfortunately don't have access to the original scans, I only have the raw, 
OCR-produced text.  

Any suggestions?

Thanks!

Pierre

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1327422f-3459-4bfe-a567-7dc9707aee83%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Post-correction of OCR-generated text

Reply via email to