Peter Kirk wrote:If it's a heuristic we're after, then why split hairs and try to make all the rules ourselves? Get a big ol' mess of training data in as many languages as you can and hand it over to a class full of CS graduate students studying Machine Learning. Throw it at some neural networks, go Bayesian with digraphs, whatever. Analyzing multigraph frequency (say, strings of up to four characters) would probably do a pretty decent job just by itself.
This one also looks dangerous.
What do you mean by "dangerous"? This is an heuristic algorithm, so it is only supposed to work always but only in some lucky cases.
If lucky cases average to, say, 20% or less then it is a bad and useless
algorithm; if they average to, say, 80% or more, then it is good and
useless. But you can't ask that it works in the 100% of cases, or it
wouldn't be heuristic anymore.
~mark

