Re: [Mailman-Developers] Architecture for extra profile info

2013-04-17 Thread Florian Fuchs
Hi,

some first thoughts on it:

1) It should be self-contained.
Meaning: It should not depend on any
mailman/mailman.client/postorius/hyperkitty packages.

2) Like the core, it should implement a Python-based webserver.
It doesn't need to run on Ports 80/443, so we don't have to care about
one of the popular web servers already listening to those ports.
Also, we definitely don't want admins to have to follow another "how
to setup mailman.users with Apache/mod_wsgi" howto.
It should just run if someone says "start" (Maybe it can even hook
into $ bin/mailman start? I know, that contradicts item 1. ...).

3) It doesn't need Django.
Since it will not deliver any HTML (except an oAuth login form -- see
5.) and it doesn't need to be integrated into any existing web site,
we can choose a more light-weight framework.

4) Adding new content types for user records should be easy, but
clearly defined.
We don't know what information applications need to store. Icons,
essays, avatars, IRC handles, Twiter names, ...
So we might think about using a schema-less database, but: We don't
want to make it possible to just manipulate the result JSON and POST
it back to the resource, possible deleting things other apps need. So
adding new information types should be a separate process.

5) It should implement an oAuth provider.
This could be used for API authenticaion and to log into
Postorius/Hyperkitty (even on other servers. Hint: Reputation!)
We won't need this for a first prototype though. For now we're
probably fine with 6)

6) Like the core it should be accessible with BasicAuth from localhost.
Ideally, in the future, it should be accessible both via BasicAuth
from localhost and via oAuth from the outside world...

Please correct me where I'm being stupid!

Florian





2013/4/18 Terri Oda :
> Background for folk new to this discussion:
>
> Currently, all user information is stored in Mailman core, but it's minimal:
> a real name, a set of email addresses, subscription info, and preferences.
> Barry suggests that it should stay minimal: only the things Mailman needs to
> know to correctly deliver mail (which actually doesn't include "real name"
> but let's leave that as a legacy item for the moment)
>
> It's pretty likely that future features of Mailman will want to attach extra
> information to users.  Some of it will be social-y stuff like user icons for
> HyperKitty to display in the archives. Other things include metrics like
> "when did this person last post to the list?" or "how many posts have they
> made over the lifetime of this list?"  One thing I know of is that Systers
> requires a short essay for all new subscribers, explaining why they want to
> join the list.   (And they're considering porting this feature to Postorius,
> which means we potentially want an answer to "where will the extra profile
> data get stored?" before their students start coding.)
>
> So...
>
> I think we've sort of agreed that it would be best if whatever we built just
> had a rest interface and hyperkitty/postorius/whatever would talk to it
> through there, and could share data that way, but we need a simple prototype
> that folk (particularly students) will be able to start using, and there's
> still some internal architecture decisions that need to be made.
>
> Does anyone have time to build such a thing or write up some short
> architectural documents so a student could build such a thing in relatively
> short order?  It doesn't have to be the perfect final design, but we
> probably need a basic starter api for adding, accessing, editing and
> possibly even removing profile data.
>
> Thoughts?
>
>  Terri
>
>
> ___
> Mailman-Developers mailing list
> Mailman-Developers@python.org
> http://mail.python.org/mailman/listinfo/mailman-developers
> Mailman FAQ: http://wiki.list.org/x/AgA3
> Searchable Archives:
> http://www.mail-archive.com/mailman-developers%40python.org/
> Unsubscribe:
> http://mail.python.org/mailman/options/mailman-developers/f%40state-of-mind.de
>
> Security Policy: http://wiki.list.org/x/QIA9
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Architecture for extra profile info

2013-04-17 Thread varun sharma
+1 for this, i am desperately waiting for someone to do do this :)


On Thu, Apr 18, 2013 at 5:51 AM, Terri Oda  wrote:

> Background for folk new to this discussion:
>
> Currently, all user information is stored in Mailman core, but it's
> minimal: a real name, a set of email addresses, subscription info, and
> preferences.  Barry suggests that it should stay minimal: only the things
> Mailman needs to know to correctly deliver mail (which actually doesn't
> include "real name" but let's leave that as a legacy item for the moment)
>
> It's pretty likely that future features of Mailman will want to attach
> extra information to users.  Some of it will be social-y stuff like user
> icons for HyperKitty to display in the archives. Other things include
> metrics like "when did this person last post to the list?" or "how many
> posts have they made over the lifetime of this list?"  One thing I know of
> is that Systers requires a short essay for all new subscribers, explaining
> why they want to join the list.   (And they're considering porting this
> feature to Postorius, which means we potentially want an answer to "where
> will the extra profile data get stored?" before their students start
> coding.)
>
> So...
>
> I think we've sort of agreed that it would be best if whatever we built
> just had a rest interface and hyperkitty/postorius/whatever would talk to
> it through there, and could share data that way, but we need a simple
> prototype that folk (particularly students) will be able to start using,
> and there's still some internal architecture decisions that need to be made.
>
> Does anyone have time to build such a thing or write up some short
> architectural documents so a student could build such a thing in relatively
> short order?  It doesn't have to be the perfect final design, but we
> probably need a basic starter api for adding, accessing, editing and
> possibly even removing profile data.
>
> Thoughts?
>
>  Terri
>
>
> __**_
> Mailman-Developers mailing list
> Mailman-Developers@python.org
> http://mail.python.org/**mailman/listinfo/mailman-**developers
> Mailman FAQ: http://wiki.list.org/x/AgA3
> Searchable Archives: http://www.mail-archive.com/**
> mailman-developers%40python.**org/
> Unsubscribe: http://mail.python.org/**mailman/options/mailman-**
> developers/varunsharmalive%**40gmail.com
>
> Security Policy: http://wiki.list.org/x/QIA9
>
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Avik Pal
thanks a lot Stephen for all the suggestions :)

Avik Pal
Bengal Engineering & Scieence University,Shibpur
github:https://github.com/avikpal
IRC:- irc://freenode/avikp,isnick
twitter:-https://twitter.com/avikpalme





On 17 April 2013 22:36, Stephen J. Turnbull  wrote:

> Avik Pal writes:
>
>  > Meanwhile It would be much appreciated if someone can direct me to
>  > an labeled dataset available on line.
>
> By "labelled" you mean pre-classified into spam vs ham?  I see you
> already found one, but you could also check the SpamBayes and
> SpamAssassin distributions.
>
>  > Here I have a suggestion, after submitting, whenever an email is
>  > classified as Spam, we store it in a separate archive and after the
>  > end of the day send them a mail telling "this is the digest for all
>  > the mails that Mailman thinks to be Spam" the subscriber may go
>  > there and can view them and also can mark them as not Spam,
>
> I suggest that you present this as an option for users who want to
> tune the filters, and as something that can be used pre-release to
> develop the initial parameters for the distributed classifier.
> Although Bayesian classifiers do offer the option to train or tune
> your personal classifier on a local corpus, most users just stick with
> the distribution parameters plus self-training.  It's pretty effective
> (surprisingly so to me).  I guess the logic is that spammers aren't
> terribly creative.
>
>  > Emails which stays as Spam will be dropped after a month
>
> Let's think carefully about that.  Everybody deletes the spam; that's
> why you started by asking for a labelled dataset, because nobody keeps
> one around.  Somebody really ought to do the public service of
> collecting a corpus.  Of course, if you do arrange to keep it around,
> it's going to need to be an option that sites and list owners can
> disable.
>
>
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


[Mailman-Developers] Architecture for extra profile info

2013-04-17 Thread Terri Oda

Background for folk new to this discussion:

Currently, all user information is stored in Mailman core, but it's 
minimal: a real name, a set of email addresses, subscription info, and 
preferences.  Barry suggests that it should stay minimal: only the 
things Mailman needs to know to correctly deliver mail (which actually 
doesn't include "real name" but let's leave that as a legacy item for 
the moment)


It's pretty likely that future features of Mailman will want to attach 
extra information to users.  Some of it will be social-y stuff like user 
icons for HyperKitty to display in the archives. Other things include 
metrics like "when did this person last post to the list?" or "how many 
posts have they made over the lifetime of this list?"  One thing I know 
of is that Systers requires a short essay for all new subscribers, 
explaining why they want to join the list.   (And they're considering 
porting this feature to Postorius, which means we potentially want an 
answer to "where will the extra profile data get stored?" before their 
students start coding.)


So...

I think we've sort of agreed that it would be best if whatever we built 
just had a rest interface and hyperkitty/postorius/whatever would talk 
to it through there, and could share data that way, but we need a simple 
prototype that folk (particularly students) will be able to start using, 
and there's still some internal architecture decisions that need to be 
made.


Does anyone have time to build such a thing or write up some short 
architectural documents so a student could build such a thing in 
relatively short order?  It doesn't have to be the perfect final design, 
but we probably need a basic starter api for adding, accessing, editing 
and possibly even removing profile data.


Thoughts?

 Terri


___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


[Mailman-Developers] Query regarding GSoC project

2013-04-17 Thread Adwait Sharma
Hello !

By way of introduction I am *Adwait Sharma,* a final year computer science
undergraduate from Bangalore, India who code a lot specially in Python.

The idea I found most interesting in GSoC 2013 is Boilerplate stripper
- 
http://wiki.list.org/display/DEV/Google+Summer+of+Code+2013#GoogleSummerofCode2013-Boilerplatestripper

I am planning to use regular expressions for this, obviously script will be
in python :)

Even though the project can be finished before the end of GSoC period. I
would like to add *some more features* to it.

1. I am interested in adding Natural Language Processing techniques which
will check for the word (attach, attached, attachment(s)) in the mail and
will cross check if there is any file attached to it. If not, it will come
up with a dialogue box saying, "*you forgot to attach a file*"

2. Apart from this, I would like to work on censorship as well which will
work as a *dirty word filter*.

and many more *Artificial Intelligence tools* in mailman.

About me : I have *4 years* of programming experience in *Python*, using
mailman extensively. I was a speaker at *PyCon India 2012 *
http://in.pycon.org/2012/funnel/pyconindia2012/36-artificial-intelligence-using-python.
I would like to thank PSF for sponsoring me to
*PyCon US 2013* where I took a lightning talk.

I'm hoping to work with Mailman in this year's GSoC as I already have a
experience of AI & NLP and this would serve as an great opportunity and
chance to work with Mailman formally and prove my skills.

I can be reached via email : *sharma(dot)adwait(at)gmail(dot)com*

Looking forward to your reply.

Kind regards,
Adwait

-- 
*Adwait Sharma ~NeO~*
Learner | Geek | Hacker | Blogger | Open Source Software Developer

Email: sharma.adw...@gmail.com
Site: http://www.ThePirado.com  [image: Blog
RSS]
Contact 






Phone: +91-9164665550
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Barry Warsaw
On Apr 17, 2013, at 12:43 AM, Avik Pal wrote:

>also I would like to propose an idea of my own. Many of us set the preference
>in mailman to get all the emails of a day batched together, but sometimes
>this means we miss important mails(though we get it at the end of the day but
>we miss the moment)important to the community, or my own interest,
>discussion on something I also have discussed upon in my previous mails,
>delivery of these mails instantly to the subscriber so that he can also join
>at that very moment may come out to be a very useful feature.

I'm skeptical that you'll be able to do automated classification of important
messages for individual users.  I'm not even sure I can do that as a human
reading a particular mailing list.  Also, what's important to me today may or
may not be important to me tomorrow, and vice versa.

Mailman has a (probably little known) feature where a list owner can send an
urgent message that bypasses digests.  If it has an Urgent: header with a
matching password, it goes straight through.

-Barry


signature.asc
Description: PGP signature
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Stephen J. Turnbull
Avik Pal writes:

 > Meanwhile It would be much appreciated if someone can direct me to
 > an labeled dataset available on line.

By "labelled" you mean pre-classified into spam vs ham?  I see you
already found one, but you could also check the SpamBayes and
SpamAssassin distributions.

 > Here I have a suggestion, after submitting, whenever an email is
 > classified as Spam, we store it in a separate archive and after the
 > end of the day send them a mail telling "this is the digest for all
 > the mails that Mailman thinks to be Spam" the subscriber may go
 > there and can view them and also can mark them as not Spam,

I suggest that you present this as an option for users who want to
tune the filters, and as something that can be used pre-release to
develop the initial parameters for the distributed classifier.
Although Bayesian classifiers do offer the option to train or tune
your personal classifier on a local corpus, most users just stick with
the distribution parameters plus self-training.  It's pretty effective
(surprisingly so to me).  I guess the logic is that spammers aren't
terribly creative.

 > Emails which stays as Spam will be dropped after a month

Let's think carefully about that.  Everybody deletes the spam; that's
why you started by asking for a labelled dataset, because nobody keeps
one around.  Somebody really ought to do the public service of
collecting a corpus.  Of course, if you do arrange to keep it around,
it's going to need to be an option that sites and list owners can
disable.

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Terri Oda
I'm glad you're somewhat aware of the issues.  I frequently encounter 
folk who aren't aware of the issues in machine learning, so your "don't 
lose hope" email set off all kinds of warning bells in my head.


Going back to GSoC-specific stuff:

- Enron is a very old data set
- If you're going to use it, you need to be prepared to defend that 
choice.  I'm not sure it's a choice that can be defended at all, knowing 
the field.  It's probably not only an old data set, but a completely 
counter-productive one given the space in which Mailman operates.


So here's some things to think about:

(1) I want some justification of how this is going to be relevant to the 
problem you're trying to solve, which is "helping classify spam emails 
sent to a mailing list that the MTA was unable to classify"


(2) Many existing classifiers that run at the MTA level have already 
used the enron data set, so chances are any features you learn will 
either already have been incorporated.  I have severe concerns that any 
new features you learn will result in over-fitting.  How can you believe 
that yet another classifier trained on the same data will be worth the 
processing overhead and resulting delays in mail delivery when it seems 
likely that any improvement will be incremental at best?


(3) Enron is not going to help you make use of any list-specific 
features.  How can you use this data set to produce something that is 
useful to Mailman, going beyond what any MTA-level spam filter can do?  
(Note that we've been telling people to do spam filtering at the MTA 
level for years and years and years; justifying this is not going to be 
an easy task)


(4) If you're going to do cross-validation with other data to make 
claims that the final classifier will be relevant to list data, how is 
that data going to be obtained, processed, and used?


(5) Unless you've got a plan for making extensive use of the fact that 
you're classifying mailing list data and not general email, you're 
pretty much wasting our time since we are only interested in projects 
relevant to Mailman.


To be completely honest, I'm still seeing "student project for data 
mining class" level thinking here, and that's not going to be good 
enough for us.  Especially considering that you didn't even know about 
the most common data sets for this problem, I'm concerned that you 
haven't yet reached the skill and experience necessary for us to 
seriously consider a classifier as even a small part of a GSoC project.  
We have to give priority to students who we are convinced can finish 
their projects, and it seems like there's too many chances of you 
getting stuck on finding data and using it correctly on a problem that 
is actually meaningful to Mailman and not just a general classification 
task.


 Terri


On 13-04-17 10:51 AM, Avik Pal wrote:
  ya I get your point, but see these are part of any machine learning 
project, and feature extraction has to be done considering the 
synthetic data set.



On 17 April 2013 22:05, Terri Oda > wrote:




Finding sources of spam (like that one) isn't that hard; it's
finding sources of legit email combined with spam and classified
and processed in the same way that's challenging.  As I said, you
can combine a spam source like this with a publicly available
mailing list to make a synthetic set, but scientifically speaking,
those aren't really preferred ways to handle data because they
come from multiple sources.


well in this regard the only thing I can do is keep looking, I am 
also aware that coming from different sources can make them skewed but 
again these things are never perfect and there are always scope for 
betterment, I think that our aim should be to implement a rudimentary 
classifier with fairly good performance to start with.


___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Avik Pal
  ya I get your point, but see these are part of any machine learning
project, and feature extraction has to be done considering the synthetic
data set.


On 17 April 2013 22:05, Terri Oda  wrote:

>
>
> Finding sources of spam (like that one) isn't that hard; it's finding
> sources of legit email combined with spam and classified and processed in
> the same way that's challenging.  As I said, you can combine a spam source
> like this with a publicly available mailing list to make a synthetic set,
> but scientifically speaking, those aren't really preferred ways to handle
> data because they come from multiple sources.
>
>
>
well in this regard the only thing I can do is keep looking, I am also
aware that coming from different sources can make them skewed but again
these things are never perfect and there are always scope for betterment, I
think that our aim should be to implement a rudimentary classifier with
fairly good performance to start with.
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Terri Oda


On 13-04-17 10:10 AM, Avik Pal wrote:
Don't lose hope Terri, after digging for a couple of hours came across 
this and its pretty much updated. http://untroubled.org/spam/


Finding sources of spam (like that one) isn't that hard; it's finding 
sources of legit email combined with spam and classified and processed 
in the same way that's challenging.  As I said, you can combine a spam 
source like this with a publicly available mailing list to make a 
synthetic set, but scientifically speaking, those aren't really 
preferred ways to handle data because they come from multiple sources.


The problem is that when you have multiple sources it sometimes becomes 
too easy for a classifier to classify on less-than-useful features for 
future use.  For example, one might classify on the fact that the list 
address won't appear in any of the To: or Cc: lines in the spam data 
because it comes from a different source, the fact that many of the 
spams will be from different time periods, the fact that the spam data 
is anonymized differently from any list data you might have, etc.  You 
will wind up doing a lot of work to normalize the data sets to avoid 
these classifiers (and we're talking weeks of really boring work here, 
potentially, that you need to start Right Now if you're going to be 
using such a set), and you run the risk of missing out on features that 
would have been useful in a single-source set that have been completely 
obliterated by the synthetic data set.


 Terri

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Avik Pal
> On 17 April 2013 21:02, Terri Oda  wrote:
>
>>
>> On 13-04-17 6:56 AM, Avik Pal wrote:
>>
>>>   Meanwhile It would be much appreciated if someone can direct
>>> me to
>>> an labeled dataset available on line.
>>>
>>>  Leaving aside entirely the question of whether we should (or will)
>> support any project that requires learning on this scale, as a former
>> anti-spam researcher, I can at least answer this question.
>>
>> Unfortunately, the answer is largely "good luck with that" -- good
>> labelled email data is surprisingly hard to come by, and that challenge is
>> one of the reasons I stopped doing research in that area.
>>
>>
>>
>
> Don't lose hope Terri, after digging for a couple of hours came
across this and its pretty much updated.
 http://untroubled.org/spam/
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Avik Pal
  thanks a lot Terri, I think I will go with the Enron email dataset and
they are to be cross validated against publicly available legitimate
mailing list mails and Spam and (hopefully) python's regular expressions
will help me a lot building the synthetic set.

Avik Pal
Bengal Engineering & Scieence University,Shibpur
github:https://github.com/avikpal
IRC:- irc://freenode/avikp,isnick
twitter:-https://twitter.com/avikpalme





On 17 April 2013 21:02, Terri Oda  wrote:

>
> On 13-04-17 6:56 AM, Avik Pal wrote:
>
>>   Meanwhile It would be much appreciated if someone can direct me
>> to
>> an labeled dataset available on line.
>>
>>  Leaving aside entirely the question of whether we should (or will)
> support any project that requires learning on this scale, as a former
> anti-spam researcher, I can at least answer this question.
>
> Unfortunately, the answer is largely "good luck with that" -- good
> labelled email data is surprisingly hard to come by, and that challenge is
> one of the reasons I stopped doing research in that area.
>
> When I was doing anti-spam research, the only viable public classified
> ham/spam set was the SpamAssassin one.  I don't believe it's been
> maintained with modern messages and at this point it may be useless.
>
> Shortly after I left the field, people started using the Enron data set,
> which is pretty well classified by now, but again, is pretty long in the
> tooth.
>
> Given that you're going to want to be classifying mailing list data, you
> may have to produce some synthetic data sets using information from
> publicly available mailing lists (e.g. the public archives of
> mailman-developers are available) and combining them with other data
> sources (e.g. publicly available collections of spam).  This won't have a
> whole lot of interesting sub-labels (some lists will have more than others,
> depending on their use of dlists/topics/pre-**classification by the
> sender) and a synthetic set is generally regarded as a poor information
> source for reproducible results, but it could be enough in a pinch given
> that you're adding a feature rather than publishing scientific work.
>
> Note that the GSoC timeline doesn't allow time for finding and creating
> such a set, so if you're going to use one, you should determine in advance
> what you'll be using and and be able to provide a link to the
> completely-ready-for-gsoc set in your proposal.
>
>  Terri
>
>
>
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] anti-spam filter

2013-04-17 Thread Terri Oda


On 13-04-16 10:31 PM, Stephen J. Turnbull wrote:

Pratik Sarkar writes:

  > I am working on the proposal.And how many slots are there for the filter
  > project?

There are no slots for the filter project as such.  The whole Mailman
project has slots, and they are somewhat fluid, since we operate under
the umbrella of the Python Software Foundation.  For further details,
like actual numbers, Terri Oda (PSF org admin) is authoritative.

Please be patient on this, as Terri is *very* busy right now.  Also,
you may not get hard numbers until the selection is actually made,
since the PSF has to balance many projects.

Stephen's right -- we don't pre-set our slots.  We're basically going to 
rank the applications we get in the order of "most want to mentor" to 
"least want to mentor" with some adjustments for the availability and 
interests of specific mentors so each mentor will get a project and 
student who suits them, then we'll accept the top X students.  Probably 
we will not use more than one slot for any particular project this year, 
but that discussion will happen among mentors during selection.


How many slots will we have?  Luckily for you, despite being busy, I've 
had to answer this question a few times already, so here's cut-and-paste 
from another email to one of the mentors who asked:


---
The short answer is "it depends on what Google will give us."

The PSF, in the past, has taken on around 35 students.  But this year, 
we've grown by at least 50% in terms of projects.  We may, by the time 
applications open, have as many as 19 sub-projects working under our 
umbrella.  I'm hoping to request and receive at least enough slots for 
each project to get two students if they have the mentors to support 
this, so the rule of thumb for now can probably be that you'll all get 
slots for up to 2 students.


For the projects participating for the first time this year, probably 
1-2 slots is the right number.  But some of our projects are veteran 
GSoC participants who have significantly more than the minimum 3 
required mentors.  Those projects, hopefully, will be able to get more 
slots since I know they can and will successfully support more 
students.  In previous years, Google's been great about letting us have 
all the slots we tell them we want and can support, but with the amount 
of growth we've had this year, I'm hesitant to make any promises.

---


Mailman is one of these veteran projects.  I believe we currently have 6 
mentors listed, but given my schedule for this year I'll be at most a 
co-mentor, and given his usually busy schedule Barry may do the same, so 
my guess is that at most, we'll be able to justify asking for 4 slots 
this year.


 Terri



___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Terri Oda


On 13-04-17 6:56 AM, Avik Pal wrote:

  Meanwhile It would be much appreciated if someone can direct me to
an labeled dataset available on line.

Leaving aside entirely the question of whether we should (or will) 
support any project that requires learning on this scale, as a former 
anti-spam researcher, I can at least answer this question.


Unfortunately, the answer is largely "good luck with that" -- good 
labelled email data is surprisingly hard to come by, and that challenge 
is one of the reasons I stopped doing research in that area.


When I was doing anti-spam research, the only viable public classified 
ham/spam set was the SpamAssassin one.  I don't believe it's been 
maintained with modern messages and at this point it may be useless.


Shortly after I left the field, people started using the Enron data set, 
which is pretty well classified by now, but again, is pretty long in the 
tooth.


Given that you're going to want to be classifying mailing list data, you 
may have to produce some synthetic data sets using information from 
publicly available mailing lists (e.g. the public archives of 
mailman-developers are available) and combining them with other data 
sources (e.g. publicly available collections of spam).  This won't have 
a whole lot of interesting sub-labels (some lists will have more than 
others, depending on their use of dlists/topics/pre-classification by 
the sender) and a synthetic set is generally regarded as a poor 
information source for reproducible results, but it could be enough in a 
pinch given that you're adding a feature rather than publishing 
scientific work.


Note that the GSoC timeline doesn't allow time for finding and creating 
such a set, so if you're going to use one, you should determine in 
advance what you'll be using and and be able to provide a link to the 
completely-ready-for-gsoc set in your proposal.


 Terri


___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Avik Pal
  thanks a lot for the information. Thing is that I don't think
that the Spam classifier by itself is going to be big enough so I came up
with this idea. Actually I also need to know what the community wants,
regarding the e-mail delivery. and regarding the classifier I don't think
that it is not going to be a problem at all( from my end, with my previous
experience in machine learning NLP, just we need a database for the
subscribers where classifier data for them is going to be stored) but the
most important thing is what you have pointed out "Is it the best use for
our limited resources (funding, mentor time, etc.)?" I am looking forward
to Barry, Terri in this regard.

 Meanwhile It would be much appreciated if someone can direct me to
an labeled dataset available on line.

 Also somebody was talking about legal aspects in some countries
and also the fact that the classification to be done in MTA only. Here I
have a suggestion, after submitting, whenever an email is classified as
Spam, we store it in a separate archive and after the end of the day send
them a mail telling "this is the digest for all the mails that Mailman
thinks to be Spam" the subscriber may go there and can view them and also
can mark them as not Spam, which will help the learning algorithm to work
on the decision boundary and also the precision recall are also to be found
out which upon adjusting the boundary or after being marked by majority(in
simple words) as not Spam will be incorporated back into the main archive
and will be sent as a part of the main digest then. Emails which stays as
Spam will be dropped after a month


Avik Pal
Bengal Engineering & Scieence University,Shibpur
github:https://github.com/avikpal
IRC:- irc://freenode/avikp,isnick
twitter:-https://twitter.com/avikpalme





On 17 April 2013 17:37, Richard Wackerbarth  wrote:

> In evaluating a proposal, we need to look at a number of factors:
>
> First, will it work? -- Does the proposed design accomplish the stated
> objective?
> Next: Is it useful?
> And: Can the candidate be expected to accomplish the task within the
> allotted time frame?
> Finally: Is it the best use for our limited resources (funding, mentor
> time, etc.)?
>
> If your presentation makes it easier to answer each of those questions in
> a positive manner, it will increase the likelihood that it will get funded.
>
>
> On Apr 17, 2013, at 6:16 AM, Avik Pal  wrote:
>
> for identifying an important message a classifier will be implemented. and
> thanks for pointing out the issue regarding the delivery of the message, if
> it is delivered twice then the existing implementation of delivery is
> sufficient, but if we want to deliver it only once then for each person we
> need to maintain a database of important mails/threads to him(or
> vice-versa) and while sending check against that database. but this is
> going to raise some normalization issues which are to be taken care of by
> careful designing.
>
> Avik Pal
> Bengal Engineering & Science University,Shibpur
> github:https://github.com/avikpal
> IRC:- irc://freenode/avikp,isnick
> twitter:-https://twitter.com/avikpalme
>
>
>
>
>
> On 17 April 2013 01:02, Richard Wackerbarth  wrote:
>
>> An interesting suggestion -- A couple of things to consider:
>>
>> How do you identify "important" messages?
>>
>> Will you deliver these messages twice -- first as important and then,
>> later, as a part of the digest ?
>>
>>
>> On Apr 16, 2013, at 2:13 PM, Avik Pal  wrote:
>> > also I would like to propose an idea of my own. Many of us set
>> the
>> > preference in mailman to get all the emails of a day batched together,
>> but
>> > sometimes this means we miss important mails(though we get it at the
>> end of
>> > the day but we miss the moment)important to the community, or my own
>> > interest, discussion on something I also have discussed upon in my
>> previous
>> > mails, delivery of these mails instantly to the subscriber so that he
>> can
>> > also join at that very moment may come out to be a very useful feature.
>> > Thus person gets to set two options
>> >1.receive batched mails only.
>> >2.receive batched mails with important mails delivered instantly.
>>
>>
>
>
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Richard Wackerbarth
In evaluating a proposal, we need to look at a number of factors:

First, will it work? -- Does the proposed design accomplish the stated 
objective?
Next: Is it useful?
And: Can the candidate be expected to accomplish the task within the allotted 
time frame?
Finally: Is it the best use for our limited resources (funding, mentor time, 
etc.)?

If your presentation makes it easier to answer each of those questions in a 
positive manner, it will increase the likelihood that it will get funded.


On Apr 17, 2013, at 6:16 AM, Avik Pal  wrote:

> for identifying an important message a classifier will be implemented. and 
> thanks for pointing out the issue regarding the delivery of the message, if 
> it is delivered twice then the existing implementation of delivery is 
> sufficient, but if we want to deliver it only once then for each person we 
> need to maintain a database of important mails/threads to him(or vice-versa) 
> and while sending check against that database. but this is going to raise 
> some normalization issues which are to be taken care of by careful designing.
> 
> Avik Pal
> Bengal Engineering & Science University,Shibpur
> github:https://github.com/avikpal
> IRC:- irc://freenode/avikp,isnick
> twitter:-https://twitter.com/avikpalme
> 
>   
> 
> 
> 
> On 17 April 2013 01:02, Richard Wackerbarth  wrote:
> An interesting suggestion -- A couple of things to consider:
> 
> How do you identify "important" messages?
> 
> Will you deliver these messages twice -- first as important and then, later, 
> as a part of the digest ?
> 
> 
> On Apr 16, 2013, at 2:13 PM, Avik Pal  wrote:
> > also I would like to propose an idea of my own. Many of us set the
> > preference in mailman to get all the emails of a day batched together, but
> > sometimes this means we miss important mails(though we get it at the end of
> > the day but we miss the moment)important to the community, or my own
> > interest, discussion on something I also have discussed upon in my previous
> > mails, delivery of these mails instantly to the subscriber so that he can
> > also join at that very moment may come out to be a very useful feature.
> > Thus person gets to set two options
> >1.receive batched mails only.
> >2.receive batched mails with important mails delivered instantly.
> 
> 

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] GSOC 2013 project discussion

2013-04-17 Thread Avik Pal
for identifying an important message a classifier will be implemented. and
thanks for pointing out the issue regarding the delivery of the message, if
it is delivered twice then the existing implementation of delivery is
sufficient, but if we want to deliver it only once then for each person we
need to maintain a database of important mails/threads to him(or
vice-versa) and while sending check against that database. but this is
going to raise some normalization issues which are to be taken care of by
careful designing.

Avik Pal
Bengal Engineering & Scieence University,Shibpur
github:https://github.com/avikpal
IRC:- irc://freenode/avikp,isnick
twitter:-https://twitter.com/avikpalme





On 17 April 2013 01:02, Richard Wackerbarth  wrote:

> An interesting suggestion -- A couple of things to consider:
>
> How do you identify "important" messages?
>
> Will you deliver these messages twice -- first as important and then,
> later, as a part of the digest ?
>
>
> On Apr 16, 2013, at 2:13 PM, Avik Pal  wrote:
> > also I would like to propose an idea of my own. Many of us set
> the
> > preference in mailman to get all the emails of a day batched together,
> but
> > sometimes this means we miss important mails(though we get it at the end
> of
> > the day but we miss the moment)important to the community, or my own
> > interest, discussion on something I also have discussed upon in my
> previous
> > mails, delivery of these mails instantly to the subscriber so that he can
> > also join at that very moment may come out to be a very useful feature.
> > Thus person gets to set two options
> >1.receive batched mails only.
> >2.receive batched mails with important mails delivered instantly.
>
>
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


[Mailman-Developers] Working on "Better User Settings Management" project idea

2013-04-17 Thread varun sharma
Hi,
I am working on "better user settings management" for past few weeks. As i
have worked on django earlier also, so i am enjoying it.I branched out
postorius few weeks back and sent a merge request but after having
discussion with terri and florian, i now know that extension of django's
User class in not a good idea, so i am now concentrating on client object
and its methods.

*I have few queries regarding the project:*
1. Should i concentrate on resolving the existing bugs or adding new
features right now ?
2. There are some menu links which do not have any pages linked with them,
should i create them and add the desired functionality ?


Thanks
Varun Sharma
___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9


Re: [Mailman-Developers] Regarding Authentication of REST API

2013-04-17 Thread Manish Gill
On 04/17/2013 02:43 AM, Florian Fuchs wrote:
> Hi Manish, hi everyone,
>
> 2013/4/10 Manish Gill :
>> For the GSoC REST API project, I've been wondering about how
>> authentication would work.
>>
>> OAuth is a way to go if we want authenticated/signed requests. I have a
>> few questions regarding that.
>>
>> - Will Mailman core become an OAuth provider, with Postorius/API being
>> the consumers?
> Probably not the core itself, but possibly another yet-to-be-written
> application that Postorius, Hyperkitty and other clients could use. We
> had a long discussion on this list whether to build a central
> application to store user data that can be accessed by the different
> Mailman-related applications. While we haven't decided yet whether or
> how to proceed, this would possibly be the right context for that.
That makes sense.
>
>> - If the answer to the above is no, is the plan to support populer OAuth
>> providers like Facebook/Twitter ?
> Like we discussed on IRC earlier, it would be nice if a site running
> Mailman could act as an oAuth provider. Especially since the thought
> of a FLOSS mailing list manager requiring an account with a commercial
> oAuth service provider to use its API might seem a little odd. But
> implementing both the provider as well as the client is probably way
> beyond the scope of this GSoC project. Especially since authentication
> is only one aspect of it.
Indeed! This could be made easy if we don't have to take care of the
provider implementation ourselves, like we discussed.
If a third party library exists that could be used to provide this
functionality, it would make things much easier. :)
>> (If not, can you guys please explain how would the authentication
>> protocol really work?)
>>
>> - Since Postorius is already using Mozilla Persona, can that also be
>> used to provide authentication to API clients?
> Probably not Persona, which is meant to be used in the context of a browser.
>
> But are we sure oAuth is our only option in an API context? Are there
> other opinions?
Hmm. I don't know much about it. I looked at Tastypie, and it provides
HTTP Basic Auth [1].
Much simpler, but probably much less secure as well.

[1] http://django-tastypie.readthedocs.org/en/latest/authentication.html
> BTW, the oauthlib documentation has a nice overview over the different
> oAuth workflows [1].
>
>
> Florian
>
> [1] https://oauthlib.readthedocs.org/en/latest/oauth_1_versus_oauth_2.html
>
>
Cool! :)

-- 
- 
Manish Gill
Naeblis on Freenode
@mgill25 on Twitter/Github 

___
Mailman-Developers mailing list
Mailman-Developers@python.org
http://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: 
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-developers/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9