schrieb Aleks Kleyn: > Few words about Google errors mentioned bellow. As far as understand > they restored text from scanned image. This is artificial > intelligence, the field which evolves slowly.
While OCR in general is a hard problem, those 'typical errors' I referred to can very well be tackled by a dictionary approach. In the German language a word cannot start with 'ß'. So a words starting with that letter has a high probability of being an erroneous match and can automatically be fed into a dictionary assisted recognition stage. The same is true for words starting with exactly two capital letters 'AV'. Note, I'm only speaking of the simple cases where the rest of the word is already spelled correctly. The presence of such typical errors indicates Google (so far) doesn't use a dictionary to decrease the error rate. Best regards, Stephan Hennig > -----Original Message----- > From: [email protected] [mailto:[email protected]] On > Behalf Of Stephan Hennig > Sent: Thursday, June 30, 2011 7:25 PM > To: About TeX hyphenation patterns, old and new. > Subject: [tex-hyphen] Google Books corpus > > Additionally, the German corpus contains lots of > typical OCR errors like > > incorrect correct > > ßrot Brot > AVahrscheinlichkeit Wahrscheinlichkeit > > that I would have expected to be handled better by Google. (Well, there > are many of such typical errors, but with low frequencies each so that > in total they shouldn't generate significant skew to the data.)
