Peter Kirk said: > I made a serious point, not apparently made in the UTR draft, that > diacritic folding may be useful for spam filtering and similar > applications including finding misleading URIs.
This seems like a reasonable point to make and to add to the discussion of folding in UTR #30. > António suggested a > serious point that for more comprehensive spam filtering an enhanced > folding might be useful, including such foldings as | > I (capital i) > and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be > feasible and useful? Well, someone could try, I suppose, but this stuff tails out pretty rapidly into mind-boggling complexity, because leetspeek (1337) is deliberately obscurantist in its own right, let alone as a spoofing technique to fool spam filters: http://en.wikipedia.org/wiki/Leet You can't just fold this stuff into "English" by some kind of set of transliteration tables -- it really requires an elaborate system of lexical replacement. It's a *cant* as well as an obscurantist orthography. And the leetspeek interleaves with another entire set of conventions for chatroom abbreviations ("cya l8r"), and it also grades off into Gangsta. > They would have to be part of a general similar > shapes folding. I think it goes way beyond that. The first level of similar shapes folding appropriate to Unicode is simply the normal, shape-based confusion that the well-meaning user of the characters may have to deal with. But 1337 can treat "><" as equivalent to "x" and "xXoRs" as equivalent to "x". The first is somewhat shape-based, but the latter is just lexical conventions at work. > > Could something like this be defined within the framework of UTR #30? I think it's out of scope. > Should it be defined within the UTR? I suspect it would be better left > to the discretion of individual developers, who could then rapidly > tailor their foldings to any new lookalikes exploited by spammers. This particular war is currently being won by the spammers, by the way. --Ken

