The data set which i use for bayes consists of both ham and spam. (
https://www.cs.cmu.edu/~./enron/)

Lets consider a scenario, where I have a domain and I point it to a
mailserver. It might take a while for me to generate 50,000 mails a day (
mailinator provides me this) . I need to embed multiple mail ids into
several forums for the web scrapers to pick it up.

I have tried to get hold of mails from my university - but it is a long and
tedious process.

I can try the method which Reindl suggested.



On Tue, May 31, 2016 at 6:32 AM, Reindl Harald <h.rei...@thelounge.net>
wrote:

>
>
> Am 31.05.2016 um 15:28 schrieb Antony Stone:
>
>> 2. You should be aware (*especially* if using this stuff as the basis of a
>> research project - any competent referee should pick up on something like
>> this) that SA works best when the emails it is asked to process are from
>> the
>> same source as it has been trained with.  In other words, you shovel real
>> emails through a real mail server and train SA using this spam and ham;
>> you
>> then use that trains SA to assess mail passing through that same mail
>> server,
>> for the same users.  Anything significantly varying from this is not
>> going to
>> work well, and is certainly not a good test of how well SA works.
>>
>
> not true - i heard similar nonsense about "you can't re-use you MX bayes
> database on a submission server" - i can, do and it works like a charm
>
> our current corpus is 90000 mails large, conatins samples in many
> languages for many users (site-wide setup) and that bayes is shared with
> another company for more than a year now and has the same results there as
> here (96% hit quote)
>
>

Reply via email to