Thanks for the idea and references Cliff.  It is common for a visitor to
test a chatterbot by pounding on the keyboard, speaking a foreign language,
typing binary code, etc.

Currently in EllaZ systems, if the input exceeds a minimum length (I don't
recall the number of characters off-hand) we check that there are at least
two words from an English Scrabble dictionary ("Enable" is the name as I
recall).  The advantage of a scrabble dictionary is that it contains plural
and tense variations in a simple word list DB.  I imagine that we could add
lists of proper names and place names also without bogging down the on-line
program too much.  A similar technique could be used to ID which foreign
language, but the database could start to snowball.  The trigram method
could be a more elegant and efficient way to do the same thing.

Later . . . Kevin

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of
Cliff Stabbert
Sent: Monday, December 09, 2002 11:50 AM
To: Gary Miller
Subject: Re: [agi] general patterns induction
. . . .
As a quick and dirty method for checking language, counting trigram
frequencies might work.  A trigram is a specific sequence of three
letters; just as "e" is the most common letter in the english
language, certain trigrams occur more often, others less, and the
specific distribution varies from language to language.  E.g.,
"cht" is more common in german (than in english), "cce" in italian and
"eau" in french.  Some public domain dictionaries, a few probability
formulas and you're on your way.  Google around a bit for "trigram
frequencies" and the like, it's often used in cryptography;
http://web.mit.edu/craighea/www/ldetect/ might help.
. . .


-------
To unsubscribe, change your address, or temporarily deactivate your subscription, 
please go to http://v2.listbox.com/member/?[EMAIL PROTECTED]

Reply via email to