Thanks for the idea and references Cliff. It is common for a visitor to test a chatterbot by pounding on the keyboard, speaking a foreign language, typing binary code, etc.
Currently in EllaZ systems, if the input exceeds a minimum length (I don't recall the number of characters off-hand) we check that there are at least two words from an English Scrabble dictionary ("Enable" is the name as I recall). The advantage of a scrabble dictionary is that it contains plural and tense variations in a simple word list DB. I imagine that we could add lists of proper names and place names also without bogging down the on-line program too much. A similar technique could be used to ID which foreign language, but the database could start to snowball. The trigram method could be a more elegant and efficient way to do the same thing. Later . . . Kevin -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Cliff Stabbert Sent: Monday, December 09, 2002 11:50 AM To: Gary Miller Subject: Re: [agi] general patterns induction . . . . As a quick and dirty method for checking language, counting trigram frequencies might work. A trigram is a specific sequence of three letters; just as "e" is the most common letter in the english language, certain trigrams occur more often, others less, and the specific distribution varies from language to language. E.g., "cht" is more common in german (than in english), "cce" in italian and "eau" in french. Some public domain dictionaries, a few probability formulas and you're on your way. Google around a bit for "trigram frequencies" and the like, it's often used in cryptography; http://web.mit.edu/craighea/www/ldetect/ might help. . . . ------- To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/?[EMAIL PROTECTED]