I agree with you. However we do not have code behind. We do not know how much rules uses Google team in code. Also, German has one rules, French has others. In the book can be few languages at the same time. But we as human being when watch the text may fix it.
The problem is more deep than you think. Of course you can trust Watson. However when you speak with other person you want to be sure that he or she tells you truth. I tell you story from my past. In 1970 there was in Ukraine PC which was able to make analytic calculations. I used it to calculate gravitational fields. Everything was fine. But once I got in answer curvature 0. It was special field, namely Friedman space, and it was natural to believe in answer. Then I felt something wrong and debugged the code. I discovered that at certain time a large expression appeared in memory and PC lost half of expression. The rest was equal 0. The following thought arose in head. If software develops which such speed, then once may happens next. Somebody knows about existence of differential equations but does not know how to solve them. Due his business he relies to PC application, however due of lock of his knowledge he does not see when PC gives wrong answer. I think you do search in German books because you know German. This helps you to catch errors. I can advice also to contact Google team to help them to fix these problems. But I do not promise that this task is easy. Aleks Kleyn http://alekskleyn.dyndns-home.com:4080/ http://sites.google.com/site/AleksKleyn/ http://arxiv.org/a/kleyn_a_1 http://AleksKleyn.blogspot.com/ http://KleynAleks.blogspot.com/ -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Stephan Hennig Sent: Tuesday, July 12, 2011 12:42 PM To: [email protected] Subject: Re: [tex-hyphen] Google Books corpus schrieb Aleks Kleyn: > Few words about Google errors mentioned bellow. As far as understand > they restored text from scanned image. This is artificial > intelligence, the field which evolves slowly. While OCR in general is a hard problem, those 'typical errors' I referred to can very well be tackled by a dictionary approach. In the German language a word cannot start with 'ß'. So a words starting with that letter has a high probability of being an erroneous match and can automatically be fed into a dictionary assisted recognition stage. The same is true for words starting with exactly two capital letters 'AV'. Note, I'm only speaking of the simple cases where the rest of the word is already spelled correctly. The presence of such typical errors indicates Google (so far) doesn't use a dictionary to decrease the error rate. Best regards, Stephan Hennig > -----Original Message----- > From: [email protected] [mailto:[email protected]] On > Behalf Of Stephan Hennig > Sent: Thursday, June 30, 2011 7:25 PM > To: About TeX hyphenation patterns, old and new. > Subject: [tex-hyphen] Google Books corpus > > Additionally, the German corpus contains lots of > typical OCR errors like > > incorrect correct > > ßrot Brot > AVahrscheinlichkeit Wahrscheinlichkeit > > that I would have expected to be handled better by Google. (Well, there > are many of such typical errors, but with low frequencies each so that > in total they shouldn't generate significant skew to the data.)
