Re: URIBL_RHS_DOB high hits

2014-10-12 Thread Reindl Harald



Am 12.10.2014 um 03:39 schrieb Karsten Bräckelmann:

On Sun, 2014-10-12 at 02:58 +0200, Reindl Harald wrote:

 You claimed DOB listed sourceforge.net, which it didn't

that i admit

 You repeatedly claimed their listing to be random, which
 it isn't. That is what I referred to as false accusations

i saw the first message in a relaxed tail -f on the  and that it was 
one of my netatalk list messages, the thread is some days old, the 
sender not a new one - so for me that is enough to call it random 
because i don't know about other users mailbody


i grepped the maillog for URIBL_RHS_DOB and looked at other tags and 
what i saw for message-ids, senders and rcpts is by definiton a random 
result


 You are free to disable DOB on your server
 You are not free to claim $list responses to be random without proof

well, so the next time i just disable things making problems on my 
server without write a mail - i don't see the affected domains in the 
logs but know our mailflow well, so i can assure you if i have more than 
1 URIBL_RHS_DOB combined with BAYES_00 on a day it stinks



if there are no data for whatever reason the answer should be NXDOMAIN
and not 127.0.0.1 in doubt because FP does more harm than FN


False accusation, again. You just claimed $list would return anything
other than NXDOMAIN in case of not-being-listed.


no - you wanted to read it that way, i claim that they have an error in 
aggregate the data and they have while that started yesterday to have 
rnadom domains listed which is proven by at least two hits


URIBL_RHS_DOB | grep BAYES_00  | grep Oct 11: 7
URIBL_RHS_DOB | grep BAYES_00  | grep Oct 10: 0
URIBL_RHS_DOB | grep BAYES_00  | grep Oct  9: 3
URIBL_RHS_DOB | grep BAYES_00  | grep Oct  8: 0
URIBL_RHS_DOB | grep BAYES_00  | grep Oct  7: 174
URIBL_RHS_DOB | grep BAYES_00  | grep Oct  6: 255
URIBL_RHS_DOB | grep BAYES_00  | grep Oct  5: 0

on the other hand we had exactly that a few days ago: answer to anything 
not listed *randomly* depending on the dns server and so yes i claim 
that i can no longer trust that list and frankly i can claim what i 
trust or not trust because it is my opinion, not anbody elses


goo.gl as link-shortener has already penalties and a FP is so more bad 
than for other domains URIBL_DBL_ABUSE_REDIR,URIBL_DBL_REDIR,URIBL_RHS_DOB



   $ host not-registered-domain.com.dob.sibl.support-intelligence.net
   Host not-registered-domain.com.dob.sibl.support-intelligence.net not found: 
3(NXDOMAIN)

We're talking false positive listings. Not random responses, neither
positive listing if in doubt.

Again, stop unfounded false accusations on this list.


again - we both don't know how that domains got listed and what others 
are the same way but i can say for sure that nobody maintains that list 
by hand and so they have a major bug in aggregate the data and the only 
thing i claimed was that it looks like if there is a error response by 
aggregate data it got wrongly parsed - that said from a *software 
developers point of view* (my daily job)


from the mailadmins point of view responsible for legit business mail i 
don't care but need to take action in case of false positives because 
they can lead in the socre high enough to reject legit mail (and yes SA 
with spamass-milter rejects)




signature.asc
Description: OpenPGP digital signature


Re: Site-wide bayes and individual bayes

2014-10-12 Thread LuKreme
On 10 Oct 2014, at 06:49 , RW rwmailli...@googlemail.com wrote:
 And, if not, is it generally better to do sitewide?
 
 It's hard to say, there are advantages and disadvantages either way.

OK, so specific example then.

Small server with a few dozen email users spread over several domains. Almost 
none of these users does any spam training at all, the rest just delete 
unwanted messages (not even marking them as junk) or even worse, just ignore 
them. One user is very aggressive in marking Spam and in keeping the Inbox 
clear of all spam.

I am of two minds. First, that everyone else would benefit from this user’s 
actions or, alternatively, that the user’s aggressive tagging will actually 
‘poison’ the bayes db for the other users who maybe do not think that endless 
emails from pinterest or some political candidate are actually spam.

-- 
You see, in this world there's two kinds of people, my friend: Those
with loaded guns and those who dig. You dig.



Re: Site-wide bayes and individual bayes

2014-10-12 Thread Reindl Harald


Am 12.10.2014 um 18:59 schrieb LuKreme:

On 10 Oct 2014, at 06:49 , RW rwmailli...@googlemail.com wrote:

And, if not, is it generally better to do sitewide?


It's hard to say, there are advantages and disadvantages either way.


OK, so specific example then.

Small server with a few dozen email users spread over several domains. Almost 
none of these users does any spam training at all, the rest just delete 
unwanted messages (not even marking them as junk) or even worse, just ignore 
them. One user is very aggressive in marking Spam and in keeping the Inbox 
clear of all spam.

I am of two minds. First, that everyone else would benefit from this user’s 
actions or, alternatively, that the user’s aggressive tagging will actually 
‘poison’ the bayes db for the other users who maybe do not think that endless 
emails from pinterest or some political candidate are actually spam.


if nobody trains his user specific bayes (like here) site-wide is the 
way to go, just because until a user has flagged 200 ham messages his 
bayes won#t get used regardless of the amount of spam marked ones


merge a users aggressive training site-wide means you need to trust 
that users actions - means: he needs to be careful and not just flag 
anything he don't want to see as spam


if it is really one or two users like here i would stay at a normal 
site-wide bayes, i realized that with IMAP shared folders where those 
users see a ham/spam folder to move messages there and are advised to be 
carfeul in case of ham samples not leak sensitive content


i review that stuff, save the eml messages to the training folders on 
the mailserver and call the sa-learn script, until now a nearly 100% 
result over 8 weeks production (99% spam catched, no false positives)




signature.asc
Description: OpenPGP digital signature


Re: Site-wide bayes and individual bayes

2014-10-12 Thread Ted Mittelstaedt



On 10/12/2014 9:59 AM, LuKreme wrote:

On 10 Oct 2014, at 06:49 , RWrwmailli...@googlemail.com  wrote:

And, if not, is it generally better to do sitewide?


It's hard to say, there are advantages and disadvantages either
way.


OK, so specific example then.

Small server with a few dozen email users spread over several
domains. Almost none of these users does any spam training at all,
the rest just delete unwanted messages (not even marking them as
junk) or even worse, just ignore them. One user is very aggressive in
marking Spam and in keeping the Inbox clear of all spam.

I am of two minds. First, that everyone else would benefit from this
user’s actions or, alternatively, that the user’s aggressive tagging
will actually ‘poison’ the bayes db for the other users who maybe do
not think that endless emails from pinterest or some political
candidate are actually spam.



For starters your problem isn't SPAM it's HAM.

You can get all the spam you want.  Just parse the mail log file every
day for a few weeks, looking for delivery attempts to nonexistent 
mailboxes.  When you see repeated delivery attempts to a specific 
mailbox then create an email address on that nonexistent mailbox and 
redirect all the email into it into a spam box


My experience is that once spammers think they have discovered an
email address they will never leave it alone, they will send increasing
amounts of spam to that address.

If you are lucky enough to never have spammers trying to probe your
server, you can create your honeypot email addresses, just make them up,
and then take these email addresses and post them into the Unsubscribe 
links on spam.  That is a good way to contaminate spammers mailing lists

with honeypot addresses.  A legitimate mailsender will ignore these, a
spammer will happily pull addresses out of unsubscribe replies.

That's your centralized spam source.  Do this for a couple dozen 
nonexistent email addresses on your server domains and you will have

all the input you want for the Bayes learner.

By definition ANY email to a nonexistent address (not an old address
that was closed down years ago) is unsolicited, AKA SPAM.

As for desired political mail, on my servers I classify all of it as
spam, I can think of maybe only 2 users over the last decade who have
complained about not getting it and for those it's easy to do an
all_spam_to to them and then tell them they will have to do their own
spam filtering.

Since overwhelmingly the political email I have seen coming in is the
offensive conservative anti-women, anti-blacks, anti-latinos, beg for
more money email, I have to say that I'm not particularly concerned 
about the wishes of customers who WANT that kind of mail - I'm quite

happy if they go find another provider.

And, naturally, that kind of email is never ever appropriate for a
business and no employee in a business is ever going to dare complain to 
their bosses that they aren't getting it.


If the politicos want to drown people in hate mail, they have paper
mail to do it - might as well make them help reduce my taxes by
subsidizing the US Post Office with their hate mail, that's about the 
only thing that's good about it.


Anyway, as I said HAM is the problem.  If you don't have large 
quantities of ham, Bayes won't work.  Of course, nothing is preventing

you from copying people's folders  (if they are using IMAP) into one
giant mailbox and using that as a HAM source.  You can probably assume
that if a user has gone to the trouble of saving mail to a folder that
it is ham.

Ted