Re: A New Approach: Find the Ham
I agree that this isn't going to be the best approach. Detecting ham is simply more difficult: 1. New types of ham emerge more often than new types of spam. Spammers generally stick to tried-and-true subjects while ham is all over the place. 2. Ham is more personalized than spam. Everyone gets very similar spam, but nobody gets the same mix of ham messages that I get. 3. Ham has a much greater range of potential subjects and patterns than spam. For all the spam, nobody's doing anything creative like trying to sell fountain pens or beverage dispensers or books of poetry with spam - it's all fake rolexes and cheap pharmaceuticals. Ham, on the other hand, has a million potential subjects and you get one-of-a-kind messages every day. 4. Spammers will have an easier time faking ham characteristics than removing spam characteristics, which may be endemic to their methods (spamming software, botnets, etc.) 5. Network effects are very helpful with spam (DNS blacklists, Razor, etc.) but not very helpful with ham. Of course, ham rules are helpful - especially personalized ones. I use a bunch. But they're best used with the existing framework of spam detection.
Re: A New Approach: Find the Ham
Duncan Michael, Thank you for the careful thought and detailed input. Please read my Protype Config email of yesterday afternoon. This is not as it appears, NOT a weighted ham finding rules approach but rather a non weighted ham tuned spam finding rules approach. Its unconventional and takes a little getting used to. Thanks! Dan On Feb 12, 2007, at 0:59, michael moncur wrote: I agree that this isn't going to be the best approach. Detecting ham is simply more difficult: 1. New types of ham emerge more often than new types of spam. Spammers generally stick to tried-and-true subjects while ham is all over the place. 2. Ham is more personalized than spam. Everyone gets very similar spam, but nobody gets the same mix of ham messages that I get. 3. Ham has a much greater range of potential subjects and patterns than spam. For all the spam, nobody's doing anything creative like trying to sell fountain pens or beverage dispensers or books of poetry with spam - it's all fake rolexes and cheap pharmaceuticals. Ham, on the other hand, has a million potential subjects and you get one-of-a-kind messages every day. 4. Spammers will have an easier time faking ham characteristics than removing spam characteristics, which may be endemic to their methods (spamming software, botnets, etc.) 5. Network effects are very helpful with spam (DNS blacklists, Razor, etc.) but not very helpful with ham. Of course, ham rules are helpful - especially personalized ones. I use a bunch. But they're best used with the existing framework of spam detection.
HTML mail (was Re: A New Approach: Find the Ham)
Tom Allison wrote: Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot. Speaking from an ISP perspective: I hate to break it to you, but most end users want some sort of formatted mail. The days of all email being ASCII-only are over, just as the days of all websites being text-only are over. Now, if you can come up with another markup language for formatting email... * That satisfies end users' wants without being vulnerable to the filter-evasion that HTML makes possible * And you can get all the major email clients to render it * And you can get all the major email clients to use it for formatted composition instead of HTML (so end users can still make their text blue and embed the latest cute image of kittens) * And you can get commercial email campaign software to use it instead of HTML (so organizations can include a company logo, or pictures of the items that they're promoting in this week's newsletter) ...*then* it'll be viable to discard HTML. Obviously, individuals and businesses handling their own mail can apply stricter rules. But it's not something that can be done (yet) on a large scale without disappointing a lot of people -- and not just the spammers. -- Kelson Vibber SpeedGate Communications www.speed.net
Re: HTML mail (was Re: A New Approach: Find the Ham)
On Monday 12 February 2007 13:27, Kelson wrote: Tom Allison wrote: Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot. Speaking from an ISP perspective: I hate to break it to you, but most end users want some sort of formatted mail. The days of all email being ASCII-only are over, just as the days of all websites being text-only are over. With all due respect, that's 100% BS. MIME was invented to handle the non-ascii stuff, and does it very well except for M$, who couldn't follow a std rule with a loaded 44 magnum stuck in Bills ear. Now, if you can come up with another markup language for formatting email... * That satisfies end users' wants without being vulnerable to the filter-evasion that HTML makes possible * And you can get all the major email clients to render it * And you can get all the major email clients to use it for formatted composition instead of HTML (so end users can still make their text blue and embed the latest cute image of kittens) * And you can get commercial email campaign software to use it instead of HTML (so organizations can include a company logo, or pictures of the items that they're promoting in this week's newsletter) ...*then* it'll be viable to discard HTML. There is, its the proper use of mimetypes. Obviously, individuals and businesses handling their own mail can apply stricter rules. But it's not something that can be done (yet) on a large scale without disappointing a lot of people -- and not just the spammers. -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2007 by Maurice Eugene Heskett, all rights reserved.
RE: HTML mail (was Re: A New Approach: Find the Ham)
Gene Heskett wrote: On Monday 12 February 2007 13:27, Kelson wrote: Now, if you can come up with another markup language for formatting email... [...] * And you can get all the major email clients to use it for formatted composition instead of HTML (so end users can still make their text blue and embed the latest cute image of kittens) [...] There is, its the proper use of mimetypes. I'm sorry, but I must have missed the class where they explained how MIME types can be used for text markup. Can you link me to a website explaining how to use MIME types to change font colors and display inline images?
Re: HTML mail (was Re: A New Approach: Find the Ham)
Gene Heskett wrote: With all due respect, that's 100% BS. MIME was invented to handle the non-ascii stuff, and does it very well except for M$, who couldn't follow a std rule with a loaded 44 magnum stuck in Bills ear. 100% BS? So end-users don't like formatting in their messages? Email is still all-ASCII? Websites are still all-text? Or are you responding to something else? There is, its the proper use of mimetypes. I'm not talking about the MIME structure, I'm talking about the formatted version of the message. Last I looked, MIME *by itself* didn't allow you to change fonts or colors, add bold or italics, create bulleted lists that flow properly, allow images to appear within a document instead of as a separate segment, etc. In other words, what can adequately replace text/html in the non-plaintext multipart/alternative section such that HTML becomes irrelevant for legitimate uses? Microsoft Word? PDF? RTF? Any of those would be worse, IMO. text/richtext might do the job, except Eudora is the only client I can think of that composes in it. -- Kelson Vibber SpeedGate Communications www.speed.net
Re: HTML mail (was Re: A New Approach: Find the Ham)
--On Monday, February 12, 2007 12:50 PM -0800 Kelson [EMAIL PROTECTED] wrote: In other words, what can adequately replace text/html in the non-plaintext multipart/alternative section such that HTML becomes irrelevant for legitimate uses? Microsoft Word? PDF? RTF? Any of those would be worse, IMO. text/richtext might do the job, except Eudora is the only client I can think of that composes in it. Mulberry does that and text/enriched. http://www.mulberrymail.com/ The author is currently preparing it for open-sourcing. I think all you need is an inline image markup and that format would then serve most needs. Of course, Word and Outlook would likely still generate messages 10x the size of equivalent text, and add additional non-standard undocumented markup to embrace and extend the basic text/enriched format and lock out competitors from interoperability.
Re: HTML mail (was Re: A New Approach: Find the Ham)
Kelson wrote: Tom Allison wrote: Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot. Speaking from an ISP perspective: I hate to break it to you, but most end users want some sort of formatted mail. The days of all email being ASCII-only are over, just as the days of all websites being text-only are over. Speaking with my postmaster hat on, I agree: the days of ASCII-only are over, and I, as a postmaster, must allow the flow of cooked mail (HTML, Doc, graphics, etc.), and cannot force nor enforce raw text email. Speaking with my a sender and recipient of email user hat on: I couldn't give a rodent's posterior what other people do or don't want to put in their email. By the time I see it, it's plain text, and if that removes some essential content, that's the other person's problem for having made a poor choice in the format of their message. I take no responsibility for what they intended to send me. Stepping back from both of those perspectives: I can't force other people to any particular thing, and I don't want to. But they can't force me to do any particular thing either. I'm going to read plain text email, and sometimes look at attached images if I want to. I wont stop you from sending me html-only, but I wont read it, either. (in fact, earlier today, someone at work sent me a please answer the part in green message, and I answered back none of it was in green ... probably because I filter out any non-plain-text components of the email ... still waiting for her to reply)
Re: A New Approach: Find the Ham
On Sun, Feb 11, 2007 at 11:10:53PM -0500, Duncan Findlay wrote: I've read most of the e-mails on this topic and I think the underlying problem is that this method relies on knowing exactly which profiles (i.e. combinations of rules) valid ham can hit. After re-reading your message with your prototype (there was one thing I missed before), I'd like to revise my criticism. Will you respond? I see a number of problems: - How do we actually generate the profiles that are to be considered ham? Does it just need to happen once in anyone's mass-check logs? Does it have to happen multiple times? How many rules will this generate? This remains valid. - Won't spammers be able to craft their messages (possibly even by breaking more rules) to meet a (known) ham profile? Currently spammers can craft messages but they have to avoid the major rules. By allowing certain profiles (depending on your answer to the previous point) this will give spammers enough room to put some pretty obvious spam through. This remains valid, though the possibly even by breaking more rules bit doesn't really make sense. My point is still that spammers can craft spam messages that will fit your ham profiles, and (depending on your response to my point above) this can probably let through some pretty spammy messages. - What happens when you add a new local rule? How would you figure which combinations are valid ham profiles with the new rule. Still valid. - Suppose we start seeing a new type of ham that hits two low scoring rules under the current system? How will your system deal with that? The way I uderstand it, these new ham messages would be considered spam. Still valid. Interesting theory, but I don't think it'll work in practice. Still valid. Thanks, -- Duncan Findlay pgpxqMUhoehZc.pgp Description: PGP signature
Re: A New Approach: Find the Ham
On Mon, Feb 12, 2007 at 11:00:06PM -0500, Duncan Findlay wrote: On Sun, Feb 11, 2007 at 11:10:53PM -0500, Duncan Findlay wrote: I've read most of the e-mails on this topic and I think the underlying problem is that this method relies on knowing exactly which profiles (i.e. combinations of rules) valid ham can hit. After re-reading your message with your prototype (there was one thing I missed before), I'd like to revise my criticism. Will you respond? That paragraph seems to have an unnecessarily adversarial tone to it. That was unintentional. I would like to know if I'm still missing something, or if you have a cool solution to these problems. :-) -- Duncan Findlay pgpzitN9uehre.pgp Description: PGP signature
Re: A New Approach: Find the Ham
Giampaolo Tomassoni wrote: From: Miles Fidelman [mailto:[EMAIL PROTECTED] Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high. I definitely agree with you. By the way, if Dan really brought a new perspective to us (i.e.: a new way to detect ham), what would stop us in integrating it into SA? Nothing would stop you from integrating it into SA. For one, you could give every message a +5 just for existing. Now you've assumed all messages are spam, and you're going to require that the message characteristics lower the score below 5. The problem I see with this approach is that: spam, by its nature, all has characteristics in common that are already targeted: a) coming from common points of origin, such as spamhauses, open relays, etc. (countered with blacklisting) b) urging you to take certain actions, such as clicking on links, calling phone numbers, replying in order to opt-out, etc. (URIBLs, RE's and bayes) c) similar topics, such as medication, porn, stocks, etc. (RE's and bayes) d) mailers with similar bad behaviors, such as things which are easy to target via greet_pause, greylisting, nolisting, looking for format violations, etc. So, in the finding the spam approach, you're looking for these features as a means of trying to identify the message as spam. In order to develop a find the ham approach, you have to figure out what are the characteristics of ham? e) does it come from common points of origin? no. It can, and in my experience does, come from anywhere. f) does it urge you to take certain actions? not generally. g) does it all have similar topics? for my mailing lists, sure... but rarely do my gf and mother talk about the same topic... Trying to narrow ham down to a range of sources, actions, and topics seems to be MUCH more difficult than trying to do the same for spam. About the only thing you can do that sets ham apart from spam in these lists is d -- you could have a set h which says if it comes from an RFC compliant source, we'll mark it as being slightly more ham-like. At which point, all of the spammers will get more RFC compliant. That still leaves the problem that e-g are no where near as identifiable as targets as a-c are. (that said: I'm not saying don't try -- do try ... I would love to be proven wrong, as long as the solution doesn't involve something as bad for the internet as challenge-response type systems are)
Re: A New Approach: Find the Ham
On Saturday 10 February 2007, Dan wrote: On Feb 10, 2007, at 14:38, Mathieu Bouchard wrote: How do you ever find FPs if you have so many TP to sort through that it's not even worth sorting through FP+TP to find the FP ? IMHO, that'd be why we assume that mails are ham rather than assume that they are spam. I haven't found FP reviewing to be a big deal. In my latest SA based configuration, for example... Whoa... Which side of the fence are you on? How can you cite your current configuration of SA as any kind of indication of how hard it would be to find FPs in a totally reversed situation? -- _ John Andersen pgpUkeMq2M17i.pgp Description: PGP signature
Re: A New Approach: Find the Ham
Long-time SpamAssassin users with a good memory might recall back in SpamAssassin 2.4x, we included quite a few ham-targeting rules, such as was this sent using User-Agent: Mozilla?, is this formatted like a reply to a previous message?, does it include headers from a mailing list? and is it formatted like a PGP-signed message? Pretty soon, spammers simply adopted _all_ of those attributes, sending spam containing User-Agent: mozilla, In-Reply-To headers, formatted like PGP-signed reply messages ;) If you give spammers a way to get negative points easily, they'll attack it. it's simply unsafe to assume they won't. A published ruleset that does this based on forgeable attributes will be quickly attacked (again). Having said that, rules that are *unforgeable* are entirely safe to use, and we include those -- namely whitelist_from_rcvd/spf/dk/dkim, and the locally-trained Bayes tests (which spammers have a much harder time guessing). Also, writing your own local ham-spotting rules is generally safe, as long as you don't publish them where spammers can find out about them. --j. Nigel Frankcom writes: On Sat, 10 Feb 2007 15:14:56 -0500, Miles Fidelman [EMAIL PROTECTED] wrote: Dan wrote: I've developed a new approach to scoring that I want to 1) share with=20 everyone and 2) make into a working system thats as accurate as what=20 I've already built, but easier to use. First, the theory: NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. =20 ie, build tests that target the spam (keeping all the tests you've=20 already built), then score the thousands of ways ham triggers on those= =20 tests. It strikes me that the hardest part of this approach is filtering out=20 too much ham. At least for me, it's more important to make sure that=20 people reach me, than to filter out all spam. If we take the approach=20 that everything is to be filtered out, except x,y,z - then the risk of=20 filtering out too much seems pretty high. These are my local stats... I'd far rather those numbers were the other way round. Even if Dan is wrong, at least he's thinking. http://www.blue-canoe.com/stats/index.php?D1=3D11 What do Theo, Matt Co have to say? They've been doing this a lot longer than us. Kind regards
Re: A New Approach: Find the Ham
On Feb 10, 2007, at 3:19 PM, Giampaolo Tomassoni wrote: From: Tom Allison [mailto:[EMAIL PROTECTED] Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot. Sending text/ascii e-mails may probably fit your habits and the ones from your contacts, but it would result in thrashing a lot of ham on larger userbases. Giampaolo I am clearly thinking in more revolutionary terms then what email has been doing over the last decade of trying to accommodate every Tom Dick and Harry that comes along with a wish list.
RE: A New Approach: Find the Ham
From: tom [mailto:[EMAIL PROTECTED] On Feb 10, 2007, at 3:19 PM, Giampaolo Tomassoni wrote: From: Tom Allison [mailto:[EMAIL PROTECTED] Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot. Sending text/ascii e-mails may probably fit your habits and the ones from your contacts, but it would result in thrashing a lot of ham on larger userbases. Giampaolo I am clearly thinking in more revolutionary terms then what email has been doing over the last decade of trying to accommodate every Tom Dick and Harry that comes along with a wish list. Well, I don't know: I don't dislike the fact that e-mail messages may be vectors of html content. The problem is not what you bring to a destination, it is the how. The problem is that all the RFC set of regulations about electonic mailing fail in definitely avoiding the use of fake addresses and the complete anonimicity of the sender to the occasional destinator. This, combined with the very low cost at which one can send spam, do result in a lot of spam. If the identity of the sender could be really trusted, I believe that it would be a lot more easy to control spam and, eventually, get rid of it. There are RFCs for message signing and the like, but they are basicly optional operations, not mandatory. I may probably have get a pessimistic view on the world, but as long as there will be a business around the spam, it will be very difficul to impose a new, sender-identity-concerned, mandatory standard on electronic mailing: there are too many economical interests around it. From ISPs to computer resellers, everybody gains something from it. Giampaolo
Re: A New Approach: Find the Ham
On Sat, Feb 10, 2007 at 08:22:41PM +, Nigel Frankcom wrote: What do Theo, Matt Co have to say? They've been doing this a lot longer than us. Unless I'm missing something, this approach is the standard block everything except for what we explicitly want to receive. Which is great, if you can define what we want to receive in a way that isn't able to be forged. By in large, that means whitelist_from_*. Then everything that isn't whitelisted gets blocked. If that's what you want to do, that's fine. The main issue is that it's very likely that all of the mails that you would want to receive aren't whitelisted. If you don't care, then you're done. If you do care ... then you can't block the mails, and need to accept them to figure out if you actually want to receive them. How do you deal with that? Since you've already gotten rid of the mails that you know you want, you need to filter the rest so that you get rid of the stuff you don't want (ie: spam). In the end, this is the methodology described on this list for years. -- Randomly Selected Tagline: Kluge.net belongs to Theo, my ex-roommate from Worcester, who I can say with some measure of admiration, is insane. - Alan Caulkins, http://www.maxint.net/~fatman/ pgpNgY5XHP7kI.pgp Description: PGP signature
RE: A New Approach: Find the Ham
Apologies if this has been answered before or anything, but where/how are you generating those stats? I'm not using SA with SQL so I'm not sure if it will work for me, but those I like! Stats in question: http://www.blue-canoe.com/stats/index.php?D1=11 Kind Regards, Philip Seccombe Turnstone Technologies NZ Limited Phone: +64 9 970 5550 Fax: +64 9 970 5559 DDI: +64 9 970 5552 Email: [EMAIL PROTECTED] Web: www.turnstone.co.nz -Original Message- From: Nigel Frankcom [mailto:[EMAIL PROTECTED] Sent: Sunday, 11 February 2007 9:23 a.m. To: Miles Fidelman Cc: SpamAssassin Users Subject: Re: A New Approach: Find the Ham On Sat, 10 Feb 2007 15:14:56 -0500, Miles Fidelman [EMAIL PROTECTED] wrote: Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high. These are my local stats... I'd far rather those numbers were the other way round. Even if Dan is wrong, at least he's thinking. http://www.blue-canoe.com/stats/index.php?D1=11 What do Theo, Matt Co have to say? They've been doing this a lot longer than us. Kind regards
Re: A New Approach: Find the Ham
On 10 Feb 2007 at 11:43, Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: [...] NEW SITUATION Ham is now the tiniest minority of all email. NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. NEW RESULT Spend less time and energy while catching more of what you do want and less of what you don't. CHALLENGE All filtering software is written to score for results that equal spam - catch the bad SOLUTION Make filtering software score for results that equal ham - uncatch the good. Your thoughts? Dan The science fiction periodical ANALOG had a story based exactly on this premise. I think the story ran in 2005. In the story, just about everything is connected to a public interface so therefore everything is subject to getting spam'ed - and worse.
Re: A New Approach: Find the Ham
Hey Dan, I've read most of the e-mails on this topic and I think the underlying problem is that this method relies on knowing exactly which profiles (i.e. combinations of rules) valid ham can hit. I see a number of problems: - How do we actually generate the profiles that are to be considered ham? Does it just need to happen once in anyone's mass-check logs? Does it have to happen multiple times? How many rules will this generate? - Won't spammers be able to craft their messages (possibly even by breaking more rules) to meet a (known) ham profile? Currently spammers can craft messages but they have to avoid the major rules. By allowing certain profiles (depending on your answer to the previous point) this will give spammers enough room to put some pretty obvious spam through. - What happens when you add a new local rule? How would you figure which combinations are valid ham profiles with the new rule. - Suppose we start seeing a new type of ham that hits two low scoring rules under the current system? How will your system deal with that? The way I uderstand it, these new ham messages would be considered spam. Interesting theory, but I don't think it'll work in practice. -- Duncan Findlay pgp8GwTi2pg2F.pgp Description: PGP signature
RE: A New Approach: Find the Ham
From: Dan [mailto:[EMAIL PROTECTED] I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: SITUATION In the beginning, all email was ham. When spam came along, we left the ham alone and targeted the annoyance (spam). ASSUMPTION All messages are ham unless x,y,z score says they're spam. APPROACH Block nothing, then create rules to catch what you don't want. ie, build tests that target the spam, then score the millions of ways spam can occur. RESULT Huge time spent tuning and retuning weights, catching everything in sight (including much ham). NEW SITUATION Ham is now the tiniest minority of all email. NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. NEW RESULT Spend less time and energy while catching more of what you do want and less of what you don't. CHALLENGE All filtering software is written to score for results that equal spam - catch the bad SOLUTION Make filtering software score for results that equal ham - uncatch the good. Your thoughts? How can this method spend less time and energy? Aren't you going to build a mirrored method with respect to the actual one? Your rules wouldn't be like the actual ones, but negated? Giampaolo Dan BTW, is there a better forum for this level of question?
Re: A New Approach: Find the Ham
On Sat, 10 Feb 2007 20:52:17 +0100, Giampaolo Tomassoni [EMAIL PROTECTED] wrote: From: Dan [mailto:[EMAIL PROTECTED] I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: SITUATION In the beginning, all email was ham. When spam came along, we left the ham alone and targeted the annoyance (spam). ASSUMPTION All messages are ham unless x,y,z score says they're spam. APPROACH Block nothing, then create rules to catch what you don't want. ie, build tests that target the spam, then score the millions of ways spam can occur. RESULT Huge time spent tuning and retuning weights, catching everything in sight (including much ham). NEW SITUATION Ham is now the tiniest minority of all email. NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. NEW RESULT Spend less time and energy while catching more of what you do want and less of what you don't. CHALLENGE All filtering software is written to score for results that equal spam - catch the bad SOLUTION Make filtering software score for results that equal ham - uncatch the good. Your thoughts? How can this method spend less time and energy? Aren't you going to build a mirrored method with respect to the actual one? Your rules wouldn't be like the actual ones, but negated? Giampaolo Dan BTW, is there a better forum for this level of question? Dan has a good point; on the surface at least. spam now accounts for 80%+ of all mail, so why are we concentrating on that? At least the point is worth debate (IMHO). Can it be done? Even I can see that it can, given the right impetus. Though perhaps too many companies are making a good $/£/Y off anti-spam systems based on, around or directly using SA. Be interesting to see where this thread goes. Kind regards Nigel
Re: A New Approach: Find the Ham
CHALLENGE All filtering software is written to score for results that equal spam - catch the bad SOLUTION Make filtering software score for results that equal ham - uncatch the good. Your thoughts? How can this method spend less time and energy? Aren't you going to build a mirrored method with respect to the actual one? Your rules wouldn't be like the actual ones, but negated? Giampaolo Dan BTW, is there a better forum for this level of question? This would be easier to filter. It would also be more adaptive to a statistical approach than a regex approach. Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot.
Re: A New Approach: Find the Ham
Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high.
RE: A New Approach: Find the Ham
From: Tom Allison [mailto:[EMAIL PROTECTED] CHALLENGE All filtering software is written to score for results that equal spam - catch the bad SOLUTION Make filtering software score for results that equal ham - uncatch the good. Your thoughts? How can this method spend less time and energy? Aren't you going to build a mirrored method with respect to the actual one? Your rules wouldn't be like the actual ones, but negated? Giampaolo Dan BTW, is there a better forum for this level of question? This would be easier to filter. It would also be more adaptive to a statistical approach than a regex approach. Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot. Sending text/ascii e-mails may probably fit your habits and the ones from your contacts, but it would result in thrashing a lot of ham on larger userbases. Giampaolo
RE: A New Approach: Find the Ham
From: Tom Allison [mailto:[EMAIL PROTECTED] CHALLENGE All filtering software is written to score for results that equal spam - catch the bad SOLUTION Make filtering software score for results that equal ham - uncatch the good. Your thoughts? How can this method spend less time and energy? Aren't you going to build a mirrored method with respect to the actual one? Your rules wouldn't be like the actual ones, but negated? Giampaolo Dan BTW, is there a better forum for this level of question? This would be easier to filter. It would also be more adaptive to a statistical approach than a regex approach. Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot. Sending text/ascii e-mails may probably fit your habits and the ones from your contacts, but it would result in thrashing a lot of ham on larger userbases. Giampaolo
Re: A New Approach: Find the Ham
One consideration is that spam getting through is never more than an annoyance. Ham getting caught can be a big problem. So any kind of deny by default system has to deal with how to respond to people sending you mail that gets trapped and provide a way for the sender to get approval. How does one join the global whitelist and how does one prevent spammers from joining it? I dont think spam will ever go away until sending email costs money, via some kind of global digital stamp system. Which, frankly, i would welcome with open arms, but will probably never happen. Dan has a good point; on the surface at least. spam now accounts for 80%+ of all mail, so why are we concentrating on that? At least the point is worth debate (IMHO). Can it be done? Even I can see that it can, given the right impetus. Though perhaps too many companies are making a good $/£/Y off anti-spam systems based on, around or directly using SA. Be interesting to see where this thread goes. Kind regards Nigel
Re: A New Approach: Find the Ham
This would be easier to filter. It would also be more adaptive to a statistical approach than a regex approach. Personally, I think HTML email should be outright discarded from the start. If you look at this arguement presented by the OP then it reinforces the idea that most ascii is ham and most html is spam. Therefore, reject delivery of all html based email. Or to be more succinct -- reject any MIME type of alternative content or html only content. That would remove probably 90% of the spam in one shot. Yeah, for about a week. Obviously they wont keep sending HTML mail if everyone is blocking it, right?
Re: A New Approach: Find the Ham
On Sat, 10 Feb 2007 15:14:56 -0500, Miles Fidelman [EMAIL PROTECTED] wrote: Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high. These are my local stats... I'd far rather those numbers were the other way round. Even if Dan is wrong, at least he's thinking. http://www.blue-canoe.com/stats/index.php?D1=11 What do Theo, Matt Co have to say? They've been doing this a lot longer than us. Kind regards
RE: A New Approach: Find the Ham
From: Miles Fidelman [mailto:[EMAIL PROTECTED] Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high. I definitely agree with you. By the way, if Dan really brought a new perspective to us (i.e.: a new way to detect ham), what would stop us in integrating it into SA? I would like to see this new perspective, however... Giampaolo
Re: A New Approach: Find the Ham
Clarifications: 1) I'm not talking about generating new rules. Rules stay the same. I'm describing a new scoring process only. 2) This would not be a replacement to SA, but an improvement. Just a new way to process results already generated by SA. Ideally, this would be a replacement for weights and metas. Dan How can this method spend less time and energy? Aren't you going to build a mirrored method with respect to the actual one? Your rules wouldn't be like the actual ones, but negated? Giampaolo Dan has a good point; on the surface at least. spam now accounts for 80%+ of all mail, so why are we concentrating on that? At least the point is worth debate (IMHO). Can it be done? Even I can see that it can, given the right impetus. Though perhaps too many companies are making a good $/£/Y off anti-spam systems based on, around or directly using SA. Be interesting to see where this thread goes. Kind regards Nigel
Re: A New Approach: Find the Ham
Is that the same as whitelisting, maybe I do not understand, but a very rigorous approach would be a whitelist methodology which, once a new account is created, they send email to everyone they want to communicate with, and it 'autowhitelists' those addresses, so you can only receive from those you communicate with (or want to), i.e. the user will have to authorize the receipt of a message into the whitelist (that way the email address owner is soley responsible for what they receive). The main problem (although someone may be able to come up with an appropriate compromise), is that if everyone were using this methodology, how would one ever receive email? But nonetheless, since there is less ham than spam nowadays, it make more since to do what you are saying and deal with only the traffic the user wishes to see instead of that which they don't, seems the actual programming need to deal with this would be less stressful on machine resources as well. I.e. less resources would be consumed dealing with less incoming crap (er mail, I mean) Stop it at the connection... maybe a ulog plugin just a thought Miles Fidelman wrote: Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high.
Re: A New Approach: Find the Ham
On Feb 10, 2007, at 12:14, Miles Fidelman wrote: Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high. Actually, [unparalleled] accuracy is built into this approach. Currently, a ham gets caught and you either take out the rule that caught it or make a whitelist entry. Lots of ongoing work = little cumulative return With Find the Ham, whitelisting is almost obsolete. When you find an FP, you make an exception for the specific profile, the permutation of which tests/rules caught the message so this specific assembly doesn't catch any more. The rules stays at full strength for every other permutation and no whitelist is needed. This training process is the best part of the whole approach. It begins with huge FPs, but significant improvements take only a few weeks. A few months (depending on the diversity of your ham) and FPs are very very rare. Little ongoing work = huge cumulative return Dan
Re: A New Approach: Find the Ham
NEW SITUATION Ham is now the tiniest minority of all email. NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. NEW RESULT Spend less time and energy while catching more of what you do want and less of what you don't. CHALLENGE All filtering software is written to score for results that equal spam - catch the bad SOLUTION Make filtering software score for results that equal ham - uncatch the good. Your thoughts? Here is my $0,02. I have a similar approach already. My problem is that 80% of the messages are in pt_BR, which makes a lot of the rules in SA that target english uneffective. There is a lot of grey area that have too much spam (FN) and ham (FP). So, my approach is to quarentine mail from some users a low as 4.0 (or even less). This mail is separated to an imap folder and then manually inspected to ham and spam folders. This let rules be created to catch spam, but also to catch ham (which is harder and dangerous ground). If necessary, white and black lists are created, but this is the last resource as it is not an affordable/scalable solution. The spam and ham folder is then trainned with sa-learn and the ham is given back to the user if necessary. This approach has a drawback. An explicity authorization of the user is necessary (in my view). So a user (if wants to help) may choose to let their mail be quarentined and then get it back, or let their mail (above 4.0 score) be analysed but not quarantined (just a copy is kept and it is not necessary to give back). A good side of this is that is not necessary lot of users to let their mail be analysed. The rules will improve for everyone based of a few users. Bayes also plays a more important rule than in a english environment, because of the lack of good rules in the native language. Site-wide Bayes is missed (per user is used), but would help separated the grey area even more for non monitored users or low volume users. in the scripts side I use Mail::IMAPClient and I urge anyone writting your own scripts to stay away from Mail::Box. -Raul Dias
Re: A New Approach: Find the Ham
On Sat, 10 Feb 2007, Dan wrote: With Find the Ham, whitelisting is almost obsolete. When you find an FP, How do you ever find FPs if you have so many TP to sort through that it's not even worth sorting through FP+TP to find the FP ? IMHO, that'd be why we assume that mails are ham rather than assume that they are spam. _ _ __ ___ _ _ _ ... | Mathieu Bouchard - tél:+1.514.383.3801 - http://artengine.ca/matju | Freelance Digital Arts Engineer, Montréal QC Canada
Re: A New Approach: Find the Ham
On Feb 10, 2007, at 14:38, Mathieu Bouchard wrote: How do you ever find FPs if you have so many TP to sort through that it's not even worth sorting through FP+TP to find the FP ? IMHO, that'd be why we assume that mails are ham rather than assume that they are spam. I haven't found FP reviewing to be a big deal. In my latest SA based configuration, for example, I organize captures according to the quantity of tests a given message fails. The more tests are involved, the less a message needs to be double checked. So as with other particulars, ease of use will depend on how well the approach is implemented. Dan
Re: A New Approach: Find the Ham
Good point, but will cause trouble UNLESS we find a way to recognize ham 100%. And it must me exactly 100% (99% won't be enough). As other users said, with current system, if we can filter 70-80 of the spam, remaining 20-30% will only be an annoyance, but ham will be delivered. But with the new approach event if the spam stopped 100%, only 1% undelivered ham will cause a lot of trouble. Just my 1 Yen :-) Dan wrote: I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory: SITUATION In the beginning, all email was ham. When spam came along, we left the ham alone and targeted the annoyance (spam). ASSUMPTION All messages are ham unless x,y,z score says they're spam. APPROACH Block nothing, then create rules to catch what you don't want. ie, build tests that target the spam, then score the millions of ways spam can occur. RESULT Huge time spent tuning and retuning weights, catching everything in sight (including much ham). NEW SITUATION Ham is now the tiniest minority of all email. NEW ASSUMPTION All messages are spam unless x,y,z score says they're ham. NEW APPROACH Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests. NEW RESULT Spend less time and energy while catching more of what you do want and less of what you don't. CHALLENGE All filtering software is written to score for results that equal spam - catch the bad SOLUTION Make filtering software score for results that equal ham - uncatch the good. Your thoughts? Dan BTW, is there a better forum for this level of question?