Hi,

I’ve recently gotten emails (a lot of them, as it happened) with the following 
subject line:

Subject: H¡gh level of r¡sk. Your account has been hacked. Change yøur passwørd.

and I’ve seen other similar emails in the past using simple mechanical 
substitutions (Greek alpha for ‘a’, Cyrillic a for ‘a’, Cyrillic A for ‘A’, 
Cyrillic VE for ‘B’, Cyrillic IE for ‘E’, Cyrillic EN for ‘H’, etc).

The String::Approx module (see https://metacpan.org/pod/String::Approx) allows 
for weighting insertions/deletions/substitutions, and what we’re seeing here is 
a heavy use of substitutions.

I’m thinking about a module where you could enter the ASCII string of:

High level of risk. Your account has been hacked. Change your password.

and all permutations of it via substitution would be matched as long as some 
threshold isn’t exceeded (say 10 or 15% substitutions, which seems like a 
reasonable ceiling).

There are also Spam I’ve seen where words have been deliberately misspelled as 
a way of avoiding exact matches, with doubled letters being dropped, similar 
letters being transposed (’n’ for ‘m’, ‘z’ for ’s’, ‘k’ for ‘c’, etc) so simply 
replacing non-ASCII letters with their ASCII “approximates” wouldn’t be 
sufficient because of the shuffling in the ASCII space as well.

Has anyone else considered approximate string matching?

Thanks,

-Philip



Reply via email to