Re: Re[2]: [sniffer] Charset

2004-08-20 Thread Scott Fisher
We don't want any violent Mad Scientists!

<<< [EMAIL PROTECTED]  8/20 11:59a >>>
On Friday, August 20, 2004, 11:20:44 AM, Vivek wrote:


VK> On Aug 20, 2004, at 10:36 AM, Jorge Asch wrote:

>> Well, since 100% of my users speak english/spanish I can safely bet
>> that NONE of my mail should have strange character sets. So I can 
>> assume if they do, they must be spam.

VK> Be careful about that.  I've gotten pure English email from folks in
VK> various parts of the world who's default character set was other than
VK> one I'd expect.  Charset != Language.

Along these lines, I saw spam today that was in english but used one
of the character sets that were recently blocked by request (Only
locally - no such thing will happen in the core system so nobody has
to worry).

I violently agree - blocking on character sets can be dangerous, so if
you request these rules to be added be sure you watch for unexpected
false positives afterward. ;-)

_M




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html



This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re[6]: [sniffer] Charset

2004-08-20 Thread Pete McNeil
On Friday, August 20, 2004, 12:01:31 PM, Scott wrote:

SF> -Mad,

SF> How set up is Message Sniffer to determine if an e-mail in a foreign
SF> language is spam and then code for it.
SF> I dutifully submit my Spanish spam to the spam at sortmonster.com address.
SF> It's a very, very small percentage of my overall spam, but it consistently
SF> lands in my battleground grey-weight ranges.

SF> I only ask, because I have seen the amount of non-English spam trending
SF> upwards. I've noticed spam here in Russian, German, Spanish, Korean,
SF> Portuguese and Chinese.

So far, so good.

Most of the time we are able to recognize and tag appropriate elements
in these messages and create appropriate rules. Sometimes this
requires a bit of interpretation... when we feel we have a problem
with something we reach for babblefish or use some of our internal
abilities (Gonzo does pretty well with German, we all can take a stab
at Spanish from time to time...)

Most spam takes a similar form no matter what language - so we can
most frequently get by with architectural features and other research
tools we have. (Our robots add a lot of data to SPHUD and often grab
critical elements of spam on their way in...)

Hope this helps,
_M




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re[2]: [sniffer] Charset

2004-08-20 Thread Pete McNeil
On Friday, August 20, 2004, 11:20:44 AM, Vivek wrote:


VK> On Aug 20, 2004, at 10:36 AM, Jorge Asch wrote:

>> Well, since 100% of my users speak english/spanish I can safely bet
>> that NONE of my mail should have strange character sets. So I can 
>> assume if they do, they must be spam.

VK> Be careful about that.  I've gotten pure English email from folks in
VK> various parts of the world who's default character set was other than
VK> one I'd expect.  Charset != Language.

Along these lines, I saw spam today that was in english but used one
of the character sets that were recently blocked by request (Only
locally - no such thing will happen in the core system so nobody has
to worry).

I violently agree - blocking on character sets can be dangerous, so if
you request these rules to be added be sure you watch for unexpected
false positives afterward. ;-)

_M




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re: [sniffer] Charset

2004-08-20 Thread Vivek Khera
On Aug 20, 2004, at 11:53 AM, Scott Fisher wrote:
Language based spam - filtering is a tough nut.
There are some very good language classifiers out there.  SpamAssassin 
uses one which seems to be incredibly accurate given enough text.



smime.p7s
Description: S/MIME cryptographic signature


Re: Re[4]: [sniffer] Charset

2004-08-20 Thread Scott Fisher
-Mad,

How set up is Message Sniffer to determine if an e-mail in a foreign
language is spam and then code for it.
I dutifully submit my Spanish spam to the spam at sortmonster.com address.
It's a very, very small percentage of my overall spam, but it consistently
lands in my battleground grey-weight ranges.

I only ask, because I have seen the amount of non-English spam trending
upwards. I've noticed spam here in Russian, German, Spanish, Korean,
Portuguese and Chinese.

- Original Message - 
From: "Pete McNeil" <[EMAIL PROTECTED]>
To: "Michiel Prins" <[EMAIL PROTECTED]>
Sent: Friday, August 20, 2004 7:04 AM
Subject: Re[4]: [sniffer] Charset


> On Friday, August 20, 2004, 2:35:35 AM, Michiel wrote:
>
> MP> Pete, even your message had a chaset header:
>
> MP> Content-Type: text/plain; charset=us-ascii
>
> Yes, a tricky gadget indeed.
>
> MP> I think you'll generate more FP's if you do something like that than
FN's
> MP> you might have now. Aren't there spamassassin config files that detect
this
> MP> spam?
>
> Just to be clear - we're not precisely talking about spam per-se.
> Rather we're talking about stating that all traffic on a particular
> system should be only in one language as a matter of policy...
>
> The distinction is small I suppose, but in my mind important. In
> filtering spam we're usually trying to target only messages that are
> unsolicited commercial email, pornography, or somehow harmful... With
> this other approach instead of trying to defeat what we don't want, we
> are trying to only accept what we do want... Not so much putting up
> blocks, more like putting up a huge block and punching holes.
>
> There are some SA filters that do this kind of thing...
> Ultimately I think it boils down to filtering out anything with a
> charset that is not wanted.
>
> If we achieve this by attrition (rather than attempting to capture all
> of the charsets at once) then we will achieve a strong result quickly
> at a relatively low cost and we might avoid potential false positives
> that are out there.
>
> MHO,
> _M
>
>
>
>
> This E-Mail came from the Message Sniffer mailing list. For information
and (un)subscription instructions go to
http://www.sortmonster.com/MessageSniffer/Help/Help.html
>
>



This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re: [sniffer] Charset

2004-08-20 Thread Scott Fisher
A troublesome one for me was Chinese, the GB2312 character set. I started
weighting based on charset=GB2312 and started noticing legitimate e-mail in
English from users/computers in China using the GB2312 character set. The
characters a-z,A-Z are the same in the GB2312 character set. So just because
it uses the character set, doesn't mean it is that language.

I also get someof Spanish spam. So I thought, I'll add some weight on the ñ
character. Soon I started getting false hits on el niño, piñata, señor. So
that went out the door too.

Language based spam - filtering is a tough nut.

- Original Message - 

From: "Jorge Asch" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, August 20, 2004 9:36 AM
Subject: Re: [sniffer] Charset


>
> >Just to be clear - we're not precisely talking about spam per-se.
> >Rather we're talking about stating that all traffic on a particular
> >system should be only in one language as a matter of policy...
> >
> >
> Well, since 100% of my users speak english/spanish I can safely bet that
> NONE of my mail should have strange character sets. So I can assume if
> they do, they must be spam.
>
> It's just a matter of demographics, and I am sure such a rule would not
> apply to all other customers. But for some of them, it would... (foreign
> spam messages seems to have increased ten-fold over the last couple of
> months).
>
>
> -- 
> Jorge Asch Revilla
> CONEXION DCR
> www.conexion.co.cr
> 800-CONEXION
>
>
>
>
> This E-Mail came from the Message Sniffer mailing list. For information
and (un)subscription instructions go to
http://www.sortmonster.com/MessageSniffer/Help/Help.html
>
>



This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re: [sniffer] Charset

2004-08-20 Thread Vivek Khera
On Aug 20, 2004, at 10:36 AM, Jorge Asch wrote:
Well, since 100% of my users speak english/spanish I can safely bet 
that NONE of my mail should have strange character sets. So I can 
assume if they do, they must be spam.
Be careful about that.  I've gotten pure English email from folks in 
various parts of the world who's default character set was other than 
one I'd expect.  Charset != Language.



smime.p7s
Description: S/MIME cryptographic signature


Re: [sniffer] Charset

2004-08-20 Thread Jorge Asch

Just to be clear - we're not precisely talking about spam per-se.
Rather we're talking about stating that all traffic on a particular
system should be only in one language as a matter of policy...
 

Well, since 100% of my users speak english/spanish I can safely bet that 
NONE of my mail should have strange character sets. So I can assume if 
they do, they must be spam.

It's just a matter of demographics, and I am sure such a rule would not 
apply to all other customers. But for some of them, it would... (foreign 
spam messages seems to have increased ten-fold over the last couple of 
months).

--
Jorge Asch Revilla
CONEXION DCR
www.conexion.co.cr
800-CONEXION

This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re[4]: [sniffer] Charset

2004-08-20 Thread Pete McNeil
On Friday, August 20, 2004, 2:35:35 AM, Michiel wrote:

MP> Pete, even your message had a chaset header:

MP> Content-Type: text/plain; charset=us-ascii

Yes, a tricky gadget indeed.

MP> I think you'll generate more FP's if you do something like that than FN's
MP> you might have now. Aren't there spamassassin config files that detect this
MP> spam?

Just to be clear - we're not precisely talking about spam per-se.
Rather we're talking about stating that all traffic on a particular
system should be only in one language as a matter of policy...

The distinction is small I suppose, but in my mind important. In
filtering spam we're usually trying to target only messages that are
unsolicited commercial email, pornography, or somehow harmful... With
this other approach instead of trying to defeat what we don't want, we
are trying to only accept what we do want... Not so much putting up
blocks, more like putting up a huge block and punching holes.

There are some SA filters that do this kind of thing...
Ultimately I think it boils down to filtering out anything with a
charset that is not wanted.

If we achieve this by attrition (rather than attempting to capture all
of the charsets at once) then we will achieve a strong result quickly
at a relatively low cost and we might avoid potential false positives
that are out there.

MHO,
_M




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html