Diacritic and similar foldings and spam filtering

Peter Kirk Thu, 08 Jul 2004 14:52:54 -0700

As Sarasvati points out, the thread "Looking for transcription or transliteration standards latin- >arabic" had gone way off topic; also I understand that some might find AntÃnio's examples inappropriate. But the discussion of diacritic and similar foldings is an important one, relevant to Unicode and specifically to the UTR #30 draft. The public review period for this has now finished, but in the version to be reviewed, http://www.unicode.org/reports/tr30/tr30-3.html, the data file for DiacriticRemoval is still "TBD". Is there in fact now a released data file or draft, for this folding?

I made a serious point, not apparently made in the UTR draft, that diacritic folding may be useful for spam filtering and similar applications including finding misleading URIs. AntÃnio suggested a serious point that for more comprehensive spam filtering an enhanced folding might be useful, including such foldings as | > I (capital i) and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be feasible and useful? They would have to be part of a general similar shapes folding. And such a folding would also need to deal with such foldings as Cyrillic A and Greek capital alpha > A, as with the whole of Unicode available spammers could very easily write ÐÐÐÐ (Cyrillic) or SÎÎÎ (mostly Greek) instead of SPAM, in an attempt to defeat spam filtering.

Could something like this be defined within the framework of UTR #30? Should it be defined within the UTR? I suspect it would be better left to the discretion of individual developers, who could then rapidly tailor their foldings to any new lookalikes exploited by spammers.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Diacritic and similar foldings and spam filtering

Reply via email to