Re: Spamassassin not capturing obvious Spam

Shivram Krishnan Tue, 31 May 2016 06:48:22 -0700

BTW I am using SA as an oracle for Blacklisting. Our research concerns with
combining multiple sources of blacklist and also consider the historical
importance of an IP in a blacklist to create a very effective master
blacklist.


Let me give you an example.
Suppose an IP address 1.2.3.4 appeared on Jan 1 ,2016 in Blacklist A.
1.2.3.4 stayed on Blacklist A for about 12 hours.

We have developed a system which assigns a score to 1.2.3.4. If the score
allocated to 1.2.3.4 is high we include it in our Master Blacklist.

To evaluate the performance of the master Blacklist in terms of hitrate and
false positives we plan to use SA.

On Tue, May 31, 2016 at 6:43 AM, Shivram Krishnan <rorryk...@gmail.com>
wrote:

> The data set which i use for bayes consists of both ham and spam. (
> https://www.cs.cmu.edu/~./enron/)
>
> Lets consider a scenario, where I have a domain and I point it to a
> mailserver. It might take a while for me to generate 50,000 mails a day (
> mailinator provides me this) . I need to embed multiple mail ids into
> several forums for the web scrapers to pick it up.
>
> I have tried to get hold of mails from my university - but it is a long
> and tedious process.
>
> I can try the method which Reindl suggested.
>
>
>
> On Tue, May 31, 2016 at 6:32 AM, Reindl Harald <h.rei...@thelounge.net>
> wrote:
>
>>
>>
>> Am 31.05.2016 um 15:28 schrieb Antony Stone:
>>
>>> 2. You should be aware (*especially* if using this stuff as the basis of
>>> a
>>> research project - any competent referee should pick up on something like
>>> this) that SA works best when the emails it is asked to process are from
>>> the
>>> same source as it has been trained with.  In other words, you shovel real
>>> emails through a real mail server and train SA using this spam and ham;
>>> you
>>> then use that trains SA to assess mail passing through that same mail
>>> server,
>>> for the same users.  Anything significantly varying from this is not
>>> going to
>>> work well, and is certainly not a good test of how well SA works.
>>>
>>
>> not true - i heard similar nonsense about "you can't re-use you MX bayes
>> database on a submission server" - i can, do and it works like a charm
>>
>> our current corpus is 90000 mails large, conatins samples in many
>> languages for many users (site-wide setup) and that bayes is shared with
>> another company for more than a year now and has the same results there as
>> here (96% hit quote)
>>
>>
>

Re: Spamassassin not capturing obvious Spam

Reply via email to