Re: Diacritic and similar foldings and spam filtering

Kenneth Whistler Thu, 08 Jul 2004 16:14:34 -0700

Peter Kirk said:

> I made a serious point, not apparently made in the UTR draft, that 
> diacritic folding may be useful for spam filtering and similar 
> applications including finding misleading URIs.


This seems like a reasonable point to make and to add to the discussion
of folding in UTR #30.

> António suggested a 
> serious point that for more comprehensive spam filtering an enhanced 
> folding might be useful, including such foldings as | > I (capital i) 
> and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be 
> feasible and useful? 

Well, someone could try, I suppose, but this stuff tails out pretty
rapidly into mind-boggling complexity, because leetspeek (1337) is
deliberately obscurantist in its own right, let alone as a
spoofing technique to fool spam filters:

http://en.wikipedia.org/wiki/Leet

You can't just fold this stuff into "English" by some kind of
set of transliteration tables -- it really requires an elaborate
system of lexical replacement. It's a *cant* as well as an
obscurantist orthography.

And the leetspeek interleaves with another entire set of conventions
for chatroom abbreviations ("cya l8r"), and it also grades off
into Gangsta.

> They would have to be part of a general similar 
> shapes folding. 

I think it goes way beyond that. The first level of similar
shapes folding appropriate to Unicode is simply the normal,
shape-based confusion that the well-meaning user of the
characters may have to deal with.

But 1337 can treat "><" as equivalent to "x" and "xXoRs" as
equivalent to "x". The first is somewhat shape-based, but
the latter is just lexical conventions at work.

> 
> Could something like this be defined within the framework of UTR #30? 

I think it's out of scope.

> Should it be defined within the UTR? I suspect it would be better left 
> to the discretion of individual developers, who could then rapidly 
> tailor their foldings to any new lookalikes exploited by spammers.

This particular war is currently being won by the spammers,
by the way.

--Ken

Re: Diacritic and similar foldings and spam filtering

Reply via email to