Re: Diacritic and similar foldings and spam filtering

Peter Kirk Thu, 08 Jul 2004 16:39:42 -0700

On 08/07/2004 23:22, Doug Ewell wrote:

Peter Kirk <peterkirk at qaya dot org> wrote:
AntÃnio suggested a serious point that for more comprehensive spam filtering an enhanced folding might be useful, including such foldings as | > I (capital i) and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be feasible and useful? They would have to be part of a general similar shapes folding.
They might be useful for certain applications, in specific situations,
but Unicode should not ever try to get entangled in this business of
mapping unrelated characters on the basis of glyph similarity alone.
It's just too font-dependent and subjective.
See the sub-heading "Spoofing" in TUS 4.0, Section 5.19 "Unicode
Security," pp. 141-142 for more information.

Thank you for pointing me to this section. This is a useful discussion which shows clearly why spoofing cannot be avoided by identical encoding of confusables. (And I am glad to see some clearer terminology than I had been using.) But it doesn't address my point that UTR #30 folding can be useful in this area, in providing a framework for what might be called "confusable folding".

But I think I agree with you that Unicode should not get into detailed listing of confusables, because it is too font-dependent and subjective. This kind of thing is best left as a user definable folding.

Actually I am unclear from UTR #30 whether this is supposed to be a framework for user definable foldings or should be restricted to the defined list of foldings; the existence of "Foldings based on tailored collation data" suggest that foldings can at least be tailored, but there are no further details of how such foldings are covered by the UTR.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Diacritic and similar foldings and spam filtering

Reply via email to