On 6/14/2016 8:33 AM, Matus UHLAR - fantomas wrote:
that is just what I would like to know: If OCR produces results good
enough
for BAYES and other rules.
I don't think there's difference between bayes and other rules.
It's also possible that BAYES would have better results with misread
characters than other rules.
I've dealt with OCR in the past, and have always had to go back
afterwards and manually proofread the results. I expect the impact on
Bayes would be a massively increased dictionary of rare words that
result from poor "keming" in the image. Some PDFs are written in
extractable text instead of images, but those tend to use
fractional-width spaces for kerning so it's not always easy to figure
out what's a real word there either.
That said, Google seems to use OCR on images in their filtering (quoth
Wikipedia), so maybe it works when you have a sufficiently enormous data
set that the OCR glitches are no longer rare and a decent inference can
be made from them.