Hi,

>> This is what you want:
>>
>>   uri  URI_PROTO_MC  /^(?!(?-i:[Hh]ttps?:))https?:/i
>>
>> The string inside the parentheses is what you want to _not_ hit, and that
>> part is _not_ case-insensitive, even though the rest of the expression _is_
>> case-insensitive.
>>
>> Also, for the TLD rule: after a bit of thought I realized it would be very
>> unlikely a spammer would be doing this to a .gov URI, so I substituted .biz:
>>
>>   uri  __URI_TLD_MC
>> /\.(?!(?-i:com|net|org|biz|info))(?:com|net|org|biz|info)\b/i
...
>
> So far working good. Caught 4620 spams since sunday morning with these mixed
> case rules. I added this as a separate rule.
>
> /^(?!(?-i:[Hh]ttps?:\/\/www))https?:\/\/www/i
>
> Found some cases where the HTTP was lower case but the WWW was mixed.

Can you really make scoring decisions based on a mixed-case URI? Do
you have it as part of a meta with the other rules that John provided?

I'm looking at John's sandbox entries, and wondering if there is a
rule to be made from those URIs he's created, or are you just probing
to see if they are tagged at this point?

Thanks,
Alex

Reply via email to