I have a large set of already-OCR'ed PDF documents (court judgments) that
contain a hidden text layer. I can extract this text using simple
pdf-to-text tools like pdftotext with good results.
However, the text contains some errors that seem pretty specific to the OCR
procedure and I believe could be easily fixed by some automated procedure.
I also have a set of proper digital PDFs from the same source that could be
used a training corpus.
I have some limited experience doing this sort of thing using n-grams and
Markov chains, but I'm not familiar with OCR-specific correction algorithms
or more advanced techniques.
Does tesseract have any facilities of making this sort of corrections? If
not, what are some places or algorithms I can investigate?
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
To post to this group, send email to email@example.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
For more options, visit https://groups.google.com/d/optout.