I have a large set of already-OCR'ed PDF documents (court judgments) that 
contain a hidden text layer. I can extract this text using simple 
pdf-to-text tools like pdftotext with good results.

However, the text contains some errors that seem pretty specific to the OCR 
procedure and I believe could be easily fixed by some automated procedure. 
I also have a set of proper digital PDFs from the same source that could be 
used a training corpus.

I have some limited experience doing this sort of thing using n-grams and 
Markov chains, but I'm not familiar with OCR-specific correction algorithms 
or more advanced techniques.

Does tesseract have any facilities of making this sort of corrections? If 
not, what are some places or algorithms I can investigate?


You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
For more options, visit https://groups.google.com/d/optout.

Reply via email to