Re: Detecting encoding in Plain text

Mark E. Shoulson Tue, 13 Jan 2004 22:52:37 -0800

On 01/13/04 05:40, Marco Cimarosti wrote:

Peter Kirk wrote:

This one also looks dangerous.
What do you mean by "dangerous"? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.
If lucky cases average to, say, 20% or less then it is a bad and useless algorithm; if they average to, say, 80% or more, then it is good and useless. But you can't ask that it works in the 100% of cases, or it wouldn't be heuristic anymore.

If it's a heuristic we're after, then why split hairs and try to make all the rules ourselves? Get a big ol' mess of training data in as many languages as you can and hand it over to a class full of CS graduate students studying Machine Learning. Throw it at some neural networks, go Bayesian with digraphs, whatever. Analyzing multigraph frequency (say, strings of up to four characters) would probably do a pretty decent job just by itself.

~mark

Re: Detecting encoding in Plain text

Reply via email to