On Fri, May 12, 2023 at 09:49:40AM -0500, Dave Funk wrote: > On Fri, 12 May 2023, Matija Nalis wrote: > > That is because those domains are not EQUAL? Od did you wanted a > > rule that checks only on SIMILAR domain names (e.g. with lowercase > > letter "L" replaced with number "1" as in your example)? > > Now I get it, the OP is looking for some kind of comparison function that > does an "apparent linguistic distance" evaluation of two strings and returns > a score that indicates a "visual similarity" value. > (EG replacing 'l' with '1' or 'O' with '0', etc).
It should be relatively easy to write SA plugin for that: - replace those numeric and uppercase letters in one of the strings, convert both to lowercase, and compare them - it should also remove spacer characters (like "paypal" vs "pay-pal") - It should also not only hit on exact matches, but return similarity in percentage (so trying to fake "spamassassin" with "spamasassin" can be detected). Of course, non-ASCII would complicate those replacement tables significantly (there are MANY more similar-looking glyphs then in pure ASCII), but as I treat any IDN domains as suspicios, and they are easy to detect, it would probably not be such a big deal. > I've hand coded rules to check for this stuff when frequently abused but I > don't know of a programmatic algorithm to do it automagically. I wonder if someone has already done it, and something sufficiently similar to be used to that purpose? -- Opinions above are GNU-copylefted.