Re: [sniffer] Charset

2004-08-20 Thread Vivek Khera
On Aug 20, 2004, at 11:53 AM, Scott Fisher wrote:
Language based spam - filtering is a tough nut.
There are some very good language classifiers out there.  SpamAssassin 
uses one which seems to be incredibly accurate given enough text.



smime.p7s
Description: S/MIME cryptographic signature


Re: [sniffer] Charset

2004-08-20 Thread Scott Fisher
A troublesome one for me was Chinese, the GB2312 character set. I started
weighting based on charset=GB2312 and started noticing legitimate e-mail in
English from users/computers in China using the GB2312 character set. The
characters a-z,A-Z are the same in the GB2312 character set. So just because
it uses the character set, doesn't mean it is that language.

I also get someof Spanish spam. So I thought, I'll add some weight on the ñ
character. Soon I started getting false hits on el niño, piñata, señor. So
that went out the door too.

Language based spam - filtering is a tough nut.

- Original Message - 

From: "Jorge Asch" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, August 20, 2004 9:36 AM
Subject: Re: [sniffer] Charset


>
> >Just to be clear - we're not precisely talking about spam per-se.
> >Rather we're talking about stating that all traffic on a particular
> >system should be only in one language as a matter of policy...
> >
> >
> Well, since 100% of my users speak english/spanish I can safely bet that
> NONE of my mail should have strange character sets. So I can assume if
> they do, they must be spam.
>
> It's just a matter of demographics, and I am sure such a rule would not
> apply to all other customers. But for some of them, it would... (foreign
> spam messages seems to have increased ten-fold over the last couple of
> months).
>
>
> -- 
> Jorge Asch Revilla
> CONEXION DCR
> www.conexion.co.cr
> 800-CONEXION
>
>
>
>
> This E-Mail came from the Message Sniffer mailing list. For information
and (un)subscription instructions go to
http://www.sortmonster.com/MessageSniffer/Help/Help.html
>
>



This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re: [sniffer] Charset

2004-08-20 Thread Vivek Khera
On Aug 20, 2004, at 10:36 AM, Jorge Asch wrote:
Well, since 100% of my users speak english/spanish I can safely bet 
that NONE of my mail should have strange character sets. So I can 
assume if they do, they must be spam.
Be careful about that.  I've gotten pure English email from folks in 
various parts of the world who's default character set was other than 
one I'd expect.  Charset != Language.



smime.p7s
Description: S/MIME cryptographic signature


Re: [sniffer] Charset

2004-08-20 Thread Jorge Asch

Just to be clear - we're not precisely talking about spam per-se.
Rather we're talking about stating that all traffic on a particular
system should be only in one language as a matter of policy...
 

Well, since 100% of my users speak english/spanish I can safely bet that 
NONE of my mail should have strange character sets. So I can assume if 
they do, they must be spam.

It's just a matter of demographics, and I am sure such a rule would not 
apply to all other customers. But for some of them, it would... (foreign 
spam messages seems to have increased ten-fold over the last couple of 
months).

--
Jorge Asch Revilla
CONEXION DCR
www.conexion.co.cr
800-CONEXION

This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re: [sniffer] Charset

2004-08-19 Thread Jorge Asch

Well,... If you really wanted to do it then it could be done.
Create a set of rules that look for any of the most common spanish
words - especially any that use high-bit characters. With enough of
these it should be broad enough to catch most... The trick is to
include words that are also not common in normal conversation on the
local system.
 

Could a filter be created that will tag as spam any messages that 
contaning NON-ascii characters? I mean allow only CHRS 1 through 255.

I believe this fill filter out all these foreign character sets, and let 
through regular old and plain messages through...

Of course such a rule will only apply for most of us on the western 
hemisphere...

--
Jorge Asch Revilla
CONEXION DCR
www.conexion.co.cr
800-CONEXION 


This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re: [sniffer] Charset

2004-08-19 Thread Jorge Asch

We could then turn on or off the languages we didn't want.
From my foray with dealing with Chinese, it certainly much easier said than done. Chinese was doable, I've had no luck stopping my Spanish spam.
Then again, you might be better at it than I.
Problem with spanish, is that we use the same western character set as 
you do... so it makes it harder to detect...

--
Jorge Asch Revilla
CONEXION DCR
www.conexion.co.cr
800-CONEXION

This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re: [sniffer] Charset

2004-08-19 Thread Jorge Asch
Michiel Prins wrote:
Can't you use the content filter of your mail server to detect if the
charset is used? 

I've tried, but it's not 100% effective
--
Jorge Asch Revilla
CONEXION DCR
www.conexion.co.cr
800-CONEXION

This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html


RE: [sniffer] Charset

2004-08-19 Thread Michiel Prins
Can't you use the content filter of your mail server to detect if the
charset is used? 


Met vriendelijke groet,

ing. Michiel Prins
SOS Small Office Solutions / REJECT
Wannepad 27
1066 HW Amsterdam
tel. 020-4082627
fax. 020-4082628
[EMAIL PROTECTED]


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Jorge Asch
Sent: donderdag 19 augustus 2004 15:16
To: [EMAIL PROTECTED]
Subject: [sniffer] Charset

I asked about this about ayear ago, with no luck... Is there anyw ay Message
Sniffer, could be used to block certaing message, depending on their
Charset-Type (in content-type).

For example, I would like to block all Windows-1251 (Cyrillic) messages from
my server. I know SpamAssasing has such a feature, but I would rather do it
with Message Sniffer.

Is such a thing possible now? How about in the future? I am getting
bombarded with messages in foreign languages, and Message Sniffer does
*not* detect them (and it seems forwarding them to [EMAIL PROTECTED] is
pointless, since they still coming in... seems that theres no easy way to
create a rulebase for them)

-- 
Jorge Asch Revilla
CONEXION DCR
www.conexion.co.cr
800-CONEXION 



This E-Mail came from the Message Sniffer mailing list. For information and
(un)subscription instructions go to
http://www.sortmonster.com/MessageSniffer/Help/Help.html




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html