Re: How SA reactes to a bunch of garbage characters

Joe Quinn Tue, 14 Jun 2016 05:57:30 -0700

On 6/14/2016 8:33 AM, Matus UHLAR - fantomas wrote:

that is just what I would like to know: If OCR produces results goodenough

for BAYES and other rules.


I don't think there's difference between bayes and other rules.
It's also possible that BAYES would have better results with misread
characters than other rules.

I've dealt with OCR in the past, and have always had to go backafterwards and manually proofread the results. I expect the impact onBayes would be a massively increased dictionary of rare words thatresult from poor "keming" in the image. Some PDFs are written inextractable text instead of images, but those tend to usefractional-width spaces for kerning so it's not always easy to figureout what's a real word there either.

That said, Google seems to use OCR on images in their filtering (quothWikipedia), so maybe it works when you have a sufficiently enormous dataset that the OCR glitches are no longer rare and a decent inference canbe made from them.

Re: How SA reactes to a bunch of garbage characters

Reply via email to