Hello, I have a large set of already-OCR'ed PDF documents (court judgments) that contain a hidden text layer. I can extract this text using simple pdf-to-text tools like pdftotext with good results.
However, the text contains some errors that seem pretty specific to the OCR procedure and I believe could be easily fixed by some automated procedure. I also have a set of proper digital PDFs from the same source that could be used a training corpus. I have some limited experience doing this sort of thing using n-grams and Markov chains, but I'm not familiar with OCR-specific correction algorithms or more advanced techniques. Does tesseract have any facilities of making this sort of corrections? If not, what are some places or algorithms I can investigate? Thanks, Orestis -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to firstname.lastname@example.org. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fa9f1c9f-affd-4523-b137-46b426862d53%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.