Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2016-01-31 Thread Daniel Homola

Dear all,

I migrated my Python implementation of the Boruta algorithm to:
https://github.com/danielhomola/boruta_py

I also implemented 3 mutual information based feature selection (JMI, 
JMIM, MRMR) methods and wrapped them up in scikit-learn like interface:

https://github.com/danielhomola/mifs

Could you please have a look at it? I'm writing a blog post 
demonstrating their strengths against existing methods. Would you 
require anything else to possibly include these in the next release?


Thanks a lot,
Daniel

On 05/08/2015 08:22 PM, Andreas Mueller wrote:
It doesn't need to be super technical, and we try to keep the user 
guide "easy to understand". No bonus points for unnecessary latex ;)
The example should be as illustrative and fair as possible, and 
built-in datasets are preferred. It shouldn't be to heavy-weight, though.
If you like, you can show off some plots in the PR, that is always 
very welcome.



On 05/08/2015 03:15 PM, Daniel Homola wrote:

Hi Andy,

Thanks! Will definitely do a github pull request once Miron confirmed 
he benchmarked my implementation by running it on the datasets the 
method was published with.


I wrote a blog post about it, which explains the differences but in a 
quite casual an non rigorous way:

http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/

I guess a more technical write-up, with one of the built in datasets 
would be more useful for the sklearn audience.. I'm happy to do it if 
Miron says everything looks good.


Cheers,
Daniel

On 08/05/15 21:02, Andreas Mueller wrote:
Btw, an example that compares this against existing feature 
selection methods that explains differences and advantages would 
help users and convince us to merge ;)



On 05/08/2015 02:34 PM, Daniel Homola wrote:

Hi all,

I wrote a couple of weeks ago about implementing the Boruta 
all-relevant feature selection method algorithm in Python..


I think it's ready to go now. I wrote fit, transform and 
fit_transform methods for it to make it sklearn like.


Here it is:
https://bitbucket.org/danielhomola/boruta_py

Let me know what you think. If anyone thinks this might be worthy 
of adding it to the feature selection module, the original author 
Miron is happy to give his blessing, and I'm happy work on it further.


Cheers,
Daniel

On 15/04/15 11:03, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As 
I'm working with biological/medical data, where n < p or even n << 
p I started to read up on Random Forest based methods, as in my 
limited understanding RF copes pretty well with this suboptimal 
situation.


I came across an R package called 
Boruta:https://m2.icm.edu.pl/boruta/ 



After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I 
thought I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London 




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net

Re: [Scikit-learn-general] Contributing to Scikit-Learn(GSOC)

2016-01-11 Thread Andy

Hi Imaculate.

We have found that in recent years, we were quite limited in terms of 
mentoring resources.
Many of the core-devs are very busy, and we already have many 
contributions waiting for reviews.


If you are interested in working on scikit-learn as part of GSoC, I 
suggest you start contributing to the project now,

and see if you find a project and a mentor that are suitable.

I'm a bit confused how you could talk about neural networks and deep 
learning without talking about regression or classification...


Best,
Andy



On 01/10/2016 01:14 AM, Imaculate Mosha wrote:

Hi all,

I would like to contribute to scikit-learn even better for Google 
Summer of Code.
I'm a third year undergrad student. I did an introductory course to 
Machine Learning but after learning Scikit-Learn I realised we only 
scratched the surface, we did neural networks, reinforcement learning 
and deep learning, we didn't dig into regression or classification.


I have looked at the issues but some seem beyond me due to my limited 
knowledge.
What is the best way to catch up to effectively contribute? 
Alternatively, is there something specific that can be done in GSOC so 
that I can start digging up on it.


Thank you and looking forward to hearing from you!
Kind regards,
Imaculate Mosha.


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2016-01-10 Thread Raghav R V
Hi Antoine,

Welcome to scikit-learn! Please see if you find this issue interesting to
start with - https://github.com/scikit-learn/scikit-learn/issues/6149


Thanks

On Sat, Jan 9, 2016 at 6:42 PM, WENDLINGER Antoine <
antoinewendlin...@gmail.com> wrote:

> Hi everyone,
>
> Let me introduce myself : my name is Antoine, I'm a 21-years-old French
> student in Computer Science, and would love to contribute to scikit-learn.
> This would be my first contribution to an open-source project so I'm a bit
> lost and do not really know where to start. I read the pages about
> contributing on the website and on github. I started looking for issues
> labeled "easy", but it seems most of them are already taken care of. Is
> there anything a total newbie could do to help ?
>
>
> Regards,
>
> Antoine
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2015-09-10 Thread Rohit Shinde
Hi Gael,

Heeding your advice, I was looking over the possible bugs and I have
decided to solve this one:
https://github.com/scikit-learn/scikit-learn/issues/5229.

Any pointers on how to approach this one?

Thanks,
Rohit.

On Thu, Sep 10, 2015 at 10:27 AM, Gael Varoquaux <
gael.varoqu...@normalesup.org> wrote:

> I would strongly recommend to start with something easier, like issues
> labelled 'easy'. Starting with such a big project is most likely going to
> lead to you approaching the project in a way that is not well adapted to
> scikit-learn, and thus code that does not get merged.
>
> Cheers,
>
> Gaël
>
> On Thu, Sep 10, 2015 at 06:58:20AM +0530, Rohit Shinde wrote:
> > Hello everyone,
>
> > I have built scikit-learn and I am ready to start coding. Can I get some
> > pointers on how I could start contributing to the projects I mentioned
> in the
> > earlier mail?
>
> > Thanks,
> > Rohit.
>
> > On Mon, Sep 7, 2015 at 11:50 AM, Rohit Shinde <
> rohit.shinde12...@gmail.com>
> > wrote:
>
> > Hi Jacob,
>
> > I am interested in Global optimization based hyperparameter
> optimization
> > and Generalised Additive Models. However, I don't know what kind of
> > background would be needed and if mine would be sufficient for it. I
> would
> > like to know the prerequisites for it.
>
> > On Sun, Sep 6, 2015 at 9:58 PM, Jacob Schreiber <
> jmschreibe...@gmail.com>
> > wrote:
>
> > Hi Rohit
>
> > I'm glad you want to contribute to scikit-learn! Which idea were
> you
> > interested in working on? The metric learning and GMM code is
> currently
> > being worked on by GSOC students AFAIK.
>
> > Jacob
>
> > On Sun, Sep 6, 2015 at 8:18 AM, Rohit Shinde <
> > rohit.shinde12...@gmail.com> wrote:
>
> > Hello everyone,
>
> > I am Rohit. I am interested in contributing toward
> scikit-learn. I
> > am quite proficient in Python, Java, C++ and scheme. I have
> taken
> > undergrad courses in Machine Learning and data mining. I was
> also
> > part of this year's GSoC under The Opencog Foundation.
>
> > I was looking at the ideas list for GSoC and I would be
> interested
> > in working on one of those ideas. So, could I get some
> guidance?
>
> > Thank you,
> > Rohit Shinde.
>
> >
>  
> --
>
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> >
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> >
>  
> --
>
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> >
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
>
>
> >
> --
> > Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> > Get real-time metrics from all of your servers, apps and tools
> > in one place.
> > SourceForge users - Click here to start your Free Trial of Datadog now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
>
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> --
> Gael Varoquaux
> Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone:  ++ 33-1-69-08-79-68
> http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
>
>
> --
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2015-09-09 Thread Rohit Shinde
Hello everyone,

I have built scikit-learn and I am ready to start coding. Can I get some
pointers on how I could start contributing to the projects I mentioned in
the earlier mail?

Thanks,
Rohit.

On Mon, Sep 7, 2015 at 11:50 AM, Rohit Shinde 
wrote:

> Hi Jacob,
>
> I am interested in Global optimization based hyperparameter optimization
> and Generalised Additive Models. However, I don't know what kind of
> background would be needed and if mine would be sufficient for it. I would
> like to know the prerequisites for it.
>
> On Sun, Sep 6, 2015 at 9:58 PM, Jacob Schreiber 
> wrote:
>
>> Hi Rohit
>>
>> I'm glad you want to contribute to scikit-learn! Which idea were you
>> interested in working on? The metric learning and GMM code is currently
>> being worked on by GSOC students AFAIK.
>>
>> Jacob
>>
>> On Sun, Sep 6, 2015 at 8:18 AM, Rohit Shinde > > wrote:
>>
>>> Hello everyone,
>>>
>>> I am Rohit. I am interested in contributing toward scikit-learn. I am
>>> quite proficient in Python, Java, C++ and scheme. I have taken undergrad
>>> courses in Machine Learning and data mining. I was also part of this year's
>>> GSoC under The Opencog Foundation.
>>>
>>> I was looking at the ideas list for GSoC and I would be interested in
>>> working on one of those ideas. So, could I get some guidance?
>>>
>>> Thank you,
>>> Rohit Shinde.
>>>
>>>
>>> --
>>>
>>> ___
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>>
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2015-09-09 Thread Gael Varoquaux
I would strongly recommend to start with something easier, like issues
labelled 'easy'. Starting with such a big project is most likely going to
lead to you approaching the project in a way that is not well adapted to
scikit-learn, and thus code that does not get merged.

Cheers,

Gaël

On Thu, Sep 10, 2015 at 06:58:20AM +0530, Rohit Shinde wrote:
> Hello everyone,

> I have built scikit-learn and I am ready to start coding. Can I get some
> pointers on how I could start contributing to the projects I mentioned in the
> earlier mail?

> Thanks,
> Rohit.

> On Mon, Sep 7, 2015 at 11:50 AM, Rohit Shinde 
> wrote:

> Hi Jacob,

> I am interested in Global optimization based hyperparameter optimization
> and Generalised Additive Models. However, I don't know what kind of
> background would be needed and if mine would be sufficient for it. I would
> like to know the prerequisites for it.

> On Sun, Sep 6, 2015 at 9:58 PM, Jacob Schreiber 
> wrote:

> Hi Rohit

> I'm glad you want to contribute to scikit-learn! Which idea were you
> interested in working on? The metric learning and GMM code is 
> currently
> being worked on by GSOC students AFAIK.

> Jacob

> On Sun, Sep 6, 2015 at 8:18 AM, Rohit Shinde <
> rohit.shinde12...@gmail.com> wrote:

> Hello everyone,

> I am Rohit. I am interested in contributing toward scikit-learn. I
> am quite proficient in Python, Java, C++ and scheme. I have taken
> undergrad courses in Machine Learning and data mining. I was also
> part of this year's GSoC under The Opencog Foundation.

> I was looking at the ideas list for GSoC and I would be interested
> in working on one of those ideas. So, could I get some guidance?

> Thank you,
> Rohit Shinde.

> 
> --

> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




> 
> --

> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general






> --
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140

> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


-- 
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone:  ++ 33-1-69-08-79-68
http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux

--
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2015-09-07 Thread Rohit Shinde
Hi Jacob,

I am interested in Global optimization based hyperparameter optimization
and Generalised Additive Models. However, I don't know what kind of
background would be needed and if mine would be sufficient for it. I would
like to know the prerequisites for it.

On Sun, Sep 6, 2015 at 9:58 PM, Jacob Schreiber 
wrote:

> Hi Rohit
>
> I'm glad you want to contribute to scikit-learn! Which idea were you
> interested in working on? The metric learning and GMM code is currently
> being worked on by GSOC students AFAIK.
>
> Jacob
>
> On Sun, Sep 6, 2015 at 8:18 AM, Rohit Shinde 
> wrote:
>
>> Hello everyone,
>>
>> I am Rohit. I am interested in contributing toward scikit-learn. I am
>> quite proficient in Python, Java, C++ and scheme. I have taken undergrad
>> courses in Machine Learning and data mining. I was also part of this year's
>> GSoC under The Opencog Foundation.
>>
>> I was looking at the ideas list for GSoC and I would be interested in
>> working on one of those ideas. So, could I get some guidance?
>>
>> Thank you,
>> Rohit Shinde.
>>
>>
>> --
>>
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
>
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2015-09-06 Thread Jacob Schreiber
Hi Rohit

I'm glad you want to contribute to scikit-learn! Which idea were you
interested in working on? The metric learning and GMM code is currently
being worked on by GSOC students AFAIK.

Jacob

On Sun, Sep 6, 2015 at 8:18 AM, Rohit Shinde 
wrote:

> Hello everyone,
>
> I am Rohit. I am interested in contributing toward scikit-learn. I am
> quite proficient in Python, Java, C++ and scheme. I have taken undergrad
> courses in Machine Learning and data mining. I was also part of this year's
> GSoC under The Opencog Foundation.
>
> I was looking at the ideas list for GSoC and I would be interested in
> working on one of those ideas. So, could I get some guidance?
>
> Thank you,
> Rohit Shinde.
>
>
> --
>
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-05-08 Thread Andreas Mueller
Btw, an example that compares this against existing feature selection 
methods that explains differences and advantages would help users and 
convince us to merge ;)



On 05/08/2015 02:34 PM, Daniel Homola wrote:

Hi all,

I wrote a couple of weeks ago about implementing the Boruta 
all-relevant feature selection method algorithm in Python..


I think it's ready to go now. I wrote fit, transform and fit_transform 
methods for it to make it sklearn like.


Here it is:
https://bitbucket.org/danielhomola/boruta_py

Let me know what you think. If anyone thinks this might be worthy of 
adding it to the feature selection module, the original author Miron 
is happy to give his blessing, and I'm happy work on it further.


Cheers,
Daniel

On 15/04/15 11:03, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As I'm 
working with biological/medical data, where n  p or even n  p I 
started to read up on Random Forest based methods, as in my limited 
understanding RF copes pretty well with this suboptimal situation.


I came across an R package called 
Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I thought 
I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London 




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-05-08 Thread Andreas Mueller

Hi Daniel.
That looks cool.
Can you do a github pull request?
See the contributor docs:
http://scikit-learn.org/dev/developers/index.html

Thanks,
Andy

On 05/08/2015 02:34 PM, Daniel Homola wrote:

Hi all,

I wrote a couple of weeks ago about implementing the Boruta 
all-relevant feature selection method algorithm in Python..


I think it's ready to go now. I wrote fit, transform and fit_transform 
methods for it to make it sklearn like.


Here it is:
https://bitbucket.org/danielhomola/boruta_py

Let me know what you think. If anyone thinks this might be worthy of 
adding it to the feature selection module, the original author Miron 
is happy to give his blessing, and I'm happy work on it further.


Cheers,
Daniel

On 15/04/15 11:03, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As I'm 
working with biological/medical data, where n  p or even n  p I 
started to read up on Random Forest based methods, as in my limited 
understanding RF copes pretty well with this suboptimal situation.


I came across an R package called 
Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I thought 
I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London 




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-05-08 Thread Daniel Homola

Hi Andy,

Thanks! Will definitely do a github pull request once Miron confirmed he 
benchmarked my implementation by running it on the datasets the method 
was published with.


I wrote a blog post about it, which explains the differences but in a 
quite casual an non rigorous way:

http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/

I guess a more technical write-up, with one of the built in datasets 
would be more useful for the sklearn audience.. I'm happy to do it if 
Miron says everything looks good.


Cheers,
Daniel

On 08/05/15 21:02, Andreas Mueller wrote:
Btw, an example that compares this against existing feature selection 
methods that explains differences and advantages would help users and 
convince us to merge ;)



On 05/08/2015 02:34 PM, Daniel Homola wrote:

Hi all,

I wrote a couple of weeks ago about implementing the Boruta 
all-relevant feature selection method algorithm in Python..


I think it's ready to go now. I wrote fit, transform and 
fit_transform methods for it to make it sklearn like.


Here it is:
https://bitbucket.org/danielhomola/boruta_py

Let me know what you think. If anyone thinks this might be worthy of 
adding it to the feature selection module, the original author Miron 
is happy to give his blessing, and I'm happy work on it further.


Cheers,
Daniel

On 15/04/15 11:03, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As I'm 
working with biological/medical data, where n  p or even n  p I 
started to read up on Random Forest based methods, as in my limited 
understanding RF copes pretty well with this suboptimal situation.


I came across an R package called 
Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I thought 
I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London 




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-05-08 Thread Daniel Homola

Hi all,

I wrote a couple of weeks ago about implementing the Boruta all-relevant 
feature selection method algorithm in Python..


I think it's ready to go now. I wrote fit, transform and fit_transform 
methods for it to make it sklearn like.


Here it is:
https://bitbucket.org/danielhomola/boruta_py

Let me know what you think. If anyone thinks this might be worthy of 
adding it to the feature selection module, the original author Miron is 
happy to give his blessing, and I'm happy work on it further.


Cheers,
Daniel

On 15/04/15 11:03, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As I'm 
working with biological/medical data, where n  p or even n  p I 
started to read up on Random Forest based methods, as in my limited 
understanding RF copes pretty well with this suboptimal situation.


I came across an R package called Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I thought 
I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London 


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-05-08 Thread Andreas Mueller
It doesn't need to be super technical, and we try to keep the user guide 
easy to understand. No bonus points for unnecessary latex ;)
The example should be as illustrative and fair as possible, and built-in 
datasets are preferred. It shouldn't be to heavy-weight, though.
If you like, you can show off some plots in the PR, that is always very 
welcome.



On 05/08/2015 03:15 PM, Daniel Homola wrote:

Hi Andy,

Thanks! Will definitely do a github pull request once Miron confirmed 
he benchmarked my implementation by running it on the datasets the 
method was published with.


I wrote a blog post about it, which explains the differences but in a 
quite casual an non rigorous way:

http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/

I guess a more technical write-up, with one of the built in datasets 
would be more useful for the sklearn audience.. I'm happy to do it if 
Miron says everything looks good.


Cheers,
Daniel

On 08/05/15 21:02, Andreas Mueller wrote:
Btw, an example that compares this against existing feature selection 
methods that explains differences and advantages would help users and 
convince us to merge ;)



On 05/08/2015 02:34 PM, Daniel Homola wrote:

Hi all,

I wrote a couple of weeks ago about implementing the Boruta 
all-relevant feature selection method algorithm in Python..


I think it's ready to go now. I wrote fit, transform and 
fit_transform methods for it to make it sklearn like.


Here it is:
https://bitbucket.org/danielhomola/boruta_py

Let me know what you think. If anyone thinks this might be worthy of 
adding it to the feature selection module, the original author Miron 
is happy to give his blessing, and I'm happy work on it further.


Cheers,
Daniel

On 15/04/15 11:03, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As 
I'm working with biological/medical data, where n  p or even n  
p I started to read up on Random Forest based methods, as in my 
limited understanding RF copes pretty well with this suboptimal 
situation.


I came across an R package called 
Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I 
thought I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London 




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-17 Thread Gilles Louppe
Hi,

In general, I agree that we should at least add a way to compute feature
importances using permutations. This is an alternative, yet standard, way
to do it in comparison to what we do (mean decrease of impurity, which is
also standard).

Assuming we provide permutation importances as a building block, it wouldnt
be difficult for users to add contrast features and rank their features
against those, thereby implementing the algorithms you describe. What do
you think?

 this means it tries to find all features carrying information usable for
prediction, rather than finding a possibly compact subset of features on
which some classifier has a minimal error. Here is a paper with the details.

Yes! This is a very good point. And in fact, this can be achieved in
scikit-learn by using totally randomized trees instead
(ExtraTreesClassifier(max_features=1)).

Best,
Gilles



On 15 April 2015 at 18:16, Satrajit Ghosh sa...@mit.edu wrote:

 hi andy and dan,

 i've been using a similar heuristic with extra trees quite effectively. i
 have to look at the details of this R package and the paper, but in my case
 i add a feature that has very low correlation with my target class/value
 (depending on the problem) and choose features that have a higher feature
 importance than this feature. quite simple to implement with a few lines of
 code using extra trees. but stochastic in nature given how my control
 feature is generated (at present simply randn).

 since there are potential variations one can add to this idea, i never
 thought of it as a standalone feature transformer, but it could easily be
 implemented as one. i thought the variations might be good as a contrib
 package rather than a new feature selection module.

 cheers,

 satra

 On Wed, Apr 15, 2015 at 11:56 AM, Andreas Mueller t3k...@gmail.com
 wrote:

  Hi Dan.
 I saw that paper, but it is not well-cited.
 My question is more how different this is from what we already have.
 So it looks like some (5) random control features are added and the
 features importances are compared against the control.

 The question is whether the feature importance that is used is different
 from ours. Gilles?

 If not, this could be hard to add. If it is the same,  I think a
 meta-estimator would be a nice addition to the feature selection module.

 Cheers,
 Andy



 On 04/15/2015 11:32 AM, Daniel Homola wrote:

 Hi Andy,

 This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 79
 times according to Google Scholar.

 Regarding your second point, the first 3 questions of the FAQ on the
 Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/

1. *So, what's so special about Boruta?* It is an all relevant
feature selection method, while most other are minimal optimal; this means
it tries to find all features carrying information usable for prediction,
rather than finding a possibly compact subset of features on which some
classifier has a minimal error. Here is a paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors that
contribute to it, not just the bluntest signs of it in context of your
methodology (yes, minimal optimal set of features by definition depends on
your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in p≫n problems, one can usually
cherry-pick a nonsense subset of features which yields good or even 
 perfect
classification – minimal optimal methods can easily get deceived by that,
leaving you with an overfitted model and no sign that something is wrong.
See this or that for an example.

 I'm not an ML expert by any means but it seemed reasonable to me. Any
 thoughts?

 Cheers,
 Dan



 On 15/04/15 16:23, Andreas Mueller wrote:

 Hi Daniel.
 That sounds potentially interesting.
 Is there a widely cited paper for this?
 I didn't read the paper, but it looks very similar to
 RFE(RandomForestClassifier()).
 Is it qualitatively different from that? Does it use a different feature
 importance?

 btw: your mail is flagged as spam as your link is broken and links to
 some imperial college internal page.

 Cheers,
 Andy

 On 04/15/2015 05:03 AM, Daniel Homola wrote:

 Hi all,

 I needed a multivariate feature selection method for my work. As I'm
 working with biological/medical data, where n  p or even n  p I started
 to read up on Random Foretst based methods, as in my limited understanding
 RF copes pretty well with this suboptimal situation.

 I came across an R package called Boruta: https://m2.icm.edu.pl/boruta/
 https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f

 After reading the paper and checking some of the pretty impressive
 citations I thought I'd try it, but it was really 

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Andreas Mueller

Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to 
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different feature 
importance?


btw: your mail is flagged as spam as your link is broken and links to 
some imperial college internal page.


Cheers,
Andy

On 04/15/2015 05:03 AM, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As I'm 
working with biological/medical data, where n  p or even n  p I 
started to read up on Random Foretst based methods, as in my limited 
understanding RF copes pretty well with this suboptimal situation.


I came across an R package called Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I thought 
I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Daniel Homola

Hi Andy,

This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 79 
times according to Google Scholar.


Regarding your second point, the first 3 questions of the FAQ on the 
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/


1. *So, what's so special about Boruta?* It is an all relevant feature
   selection method, while most other are minimal optimal; this means
   it tries to find all features carrying information usable for
   prediction, rather than finding a possibly compact subset of
   features on which some classifier has a minimal error. Here is a
   paper with the details.
2. *Why should I care?* For a start, when you try to understand the
   phenomenon that made your data, you should care about all factors
   that contribute to it, not just the bluntest signs of it in context
   of your methodology (yes, minimal optimal set of features by
   definition depends on your classifier choice).
3. *But I only care about good classification accuracy!* So you also
   care about having a robust model; in p≫n problems, one can usually
   cherry-pick a nonsense subset of features which yields good or even
   perfect classification – minimal optimal methods can easily get
   deceived by that, leaving you with an overfitted model and no sign
   that something is wrong. See this or that for an example.

I'm not an ML expert by any means but it seemed reasonable to me. Any 
thoughts?


Cheers,
Dan




On 15/04/15 16:23, Andreas Mueller wrote:

Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to 
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different 
feature importance?


btw: your mail is flagged as spam as your link is broken and links to 
some imperial college internal page.


Cheers,
Andy

On 04/15/2015 05:03 AM, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As I'm 
working with biological/medical data, where n  p or even n  p I 
started to read up on Random Foretst based methods, as in my limited 
understanding RF copes pretty well with this suboptimal situation.


I came across an R package called 
Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I thought 
I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Satrajit Ghosh
hi andy and dan,

i've been using a similar heuristic with extra trees quite effectively. i
have to look at the details of this R package and the paper, but in my case
i add a feature that has very low correlation with my target class/value
(depending on the problem) and choose features that have a higher feature
importance than this feature. quite simple to implement with a few lines of
code using extra trees. but stochastic in nature given how my control
feature is generated (at present simply randn).

since there are potential variations one can add to this idea, i never
thought of it as a standalone feature transformer, but it could easily be
implemented as one. i thought the variations might be good as a contrib
package rather than a new feature selection module.

cheers,

satra

On Wed, Apr 15, 2015 at 11:56 AM, Andreas Mueller t3k...@gmail.com wrote:

  Hi Dan.
 I saw that paper, but it is not well-cited.
 My question is more how different this is from what we already have.
 So it looks like some (5) random control features are added and the
 features importances are compared against the control.

 The question is whether the feature importance that is used is different
 from ours. Gilles?

 If not, this could be hard to add. If it is the same,  I think a
 meta-estimator would be a nice addition to the feature selection module.

 Cheers,
 Andy



 On 04/15/2015 11:32 AM, Daniel Homola wrote:

 Hi Andy,

 This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 79
 times according to Google Scholar.

 Regarding your second point, the first 3 questions of the FAQ on the
 Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/

1. *So, what's so special about Boruta?* It is an all relevant feature
selection method, while most other are minimal optimal; this means it tries
to find all features carrying information usable for prediction, rather
than finding a possibly compact subset of features on which some classifier
has a minimal error. Here is a paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors that
contribute to it, not just the bluntest signs of it in context of your
methodology (yes, minimal optimal set of features by definition depends on
your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in p≫n problems, one can usually
cherry-pick a nonsense subset of features which yields good or even perfect
classification – minimal optimal methods can easily get deceived by that,
leaving you with an overfitted model and no sign that something is wrong.
See this or that for an example.

 I'm not an ML expert by any means but it seemed reasonable to me. Any
 thoughts?

 Cheers,
 Dan



 On 15/04/15 16:23, Andreas Mueller wrote:

 Hi Daniel.
 That sounds potentially interesting.
 Is there a widely cited paper for this?
 I didn't read the paper, but it looks very similar to
 RFE(RandomForestClassifier()).
 Is it qualitatively different from that? Does it use a different feature
 importance?

 btw: your mail is flagged as spam as your link is broken and links to some
 imperial college internal page.

 Cheers,
 Andy

 On 04/15/2015 05:03 AM, Daniel Homola wrote:

 Hi all,

 I needed a multivariate feature selection method for my work. As I'm
 working with biological/medical data, where n  p or even n  p I started
 to read up on Random Foretst based methods, as in my limited understanding
 RF copes pretty well with this suboptimal situation.

 I came across an R package called Boruta: https://m2.icm.edu.pl/boruta/
 https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f

 After reading the paper and checking some of the pretty impressive
 citations I thought I'd try it, but it was really slow. So I thought I'll
 reimplement it in Python, because I hoped (based on this
 http://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
 https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn)
 that it will be faster. And it is :) I mean a LOT faster..

 I was wondering if this would be something that you would consider
 incorporating into the feature selection module of scikit-learn?

 If yes, do you have a tutorial or some sort of guidance about how should I
 prepare the code, what conventions should I follow, etc?

 Cheers,

 Daniel Homola

 STRATiGRAD PhD Programme
 Imperial College London

 --
 BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
 Develop your own process in accordance 

Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Andreas Mueller

Hi Dan.
I saw that paper, but it is not well-cited.
My question is more how different this is from what we already have.
So it looks like some (5) random control features are added and the 
features importances are compared against the control.


The question is whether the feature importance that is used is different 
from ours. Gilles?


If not, this could be hard to add. If it is the same,  I think a 
meta-estimator would be a nice addition to the feature selection module.


Cheers,
Andy


On 04/15/2015 11:32 AM, Daniel Homola wrote:

Hi Andy,

This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 
79 times according to Google Scholar.


Regarding your second point, the first 3 questions of the FAQ on the 
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/


 1. *So, what's so special about Boruta?* It is an all relevant
feature selection method, while most other are minimal optimal;
this means it tries to find all features carrying information
usable for prediction, rather than finding a possibly compact
subset of features on which some classifier has a minimal error.
Here is a paper with the details.
 2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors
that contribute to it, not just the bluntest signs of it in
context of your methodology (yes, minimal optimal set of features
by definition depends on your classifier choice).
 3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in p≫n problems, one can usually
cherry-pick a nonsense subset of features which yields good or
even perfect classification – minimal optimal methods can easily
get deceived by that, leaving you with an overfitted model and no
sign that something is wrong. See this or that for an example.

I'm not an ML expert by any means but it seemed reasonable to me. Any 
thoughts?


Cheers,
Dan




On 15/04/15 16:23, Andreas Mueller wrote:

Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to 
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different 
feature importance?


btw: your mail is flagged as spam as your link is broken and links to 
some imperial college internal page.


Cheers,
Andy

On 04/15/2015 05:03 AM, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As I'm 
working with biological/medical data, where n  p or even n  p I 
started to read up on Random Foretst based methods, as in my limited 
understanding RF copes pretty well with this suboptimal situation.


I came across an R package called 
Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I thought 
I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of guidance about how 
should I prepare the code, what conventions should I follow, etc?


Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London


--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF



Re: [Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

2015-04-15 Thread Daniel Homola

Hi Andy,

So at each iteration the x predictor matrix (n by m) is practically 
copied and each column is shuffled in the copied version. This shuffled 
matrix is then copied next to the original (n by 2m) and fed into the 
RF, to get the feature importances.
Also at the start of the method, a vector with length m is initialized 
with zeros, called hitReg.
After the RF training, each feature's importance in x is checked against 
the maximum of the shuffled ones. Those that are higher, are recorded by 
increasing their index in the vector hitReg.
At each iteration the method checks which feature is doing better than 
expected by random chance. So if we are in the 10th iteration, and 
feature F was better than the max of the shuffled ones, 8 times, we get 
p= .01 with sp.stats.binom.sf(8, 10, .5). We correct for multiple 
testing, and if the feature is still significant, we record it as a 
confirmed or important one. Conversely if feature F was only better 
once (sp.stats.binom.cdf(1, 10, .5)), we reject it and delete it from 
the x matrix. The method ends if  all features are either rejected or 
confirmed or if the number of iterations reaches the user set max.


Cheers,
Dan



On 15/04/15 16:56, Andreas Mueller wrote:

Hi Dan.
I saw that paper, but it is not well-cited.
My question is more how different this is from what we already have.
So it looks like some (5) random control features are added and the 
features importances are compared against the control.


The question is whether the feature importance that is used is 
different from ours. Gilles?


If not, this could be hard to add. If it is the same,  I think a 
meta-estimator would be a nice addition to the feature selection module.


Cheers,
Andy


On 04/15/2015 11:32 AM, Daniel Homola wrote:

Hi Andy,

This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 
79 times according to Google Scholar.


Regarding your second point, the first 3 questions of the FAQ on the 
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/


 1. *So, what's so special about Boruta?* It is an all relevant
feature selection method, while most other are minimal optimal;
this means it tries to find all features carrying information
usable for prediction, rather than finding a possibly compact
subset of features on which some classifier has a minimal error.
Here is a paper with the details.
 2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors
that contribute to it, not just the bluntest signs of it in
context of your methodology (yes, minimal optimal set of features
by definition depends on your classifier choice).
 3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in p≫n problems, one can
usually cherry-pick a nonsense subset of features which yields
good or even perfect classification – minimal optimal methods can
easily get deceived by that, leaving you with an overfitted model
and no sign that something is wrong. See this or that for an example.

I'm not an ML expert by any means but it seemed reasonable to me. Any 
thoughts?


Cheers,
Dan




On 15/04/15 16:23, Andreas Mueller wrote:

Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to 
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different 
feature importance?


btw: your mail is flagged as spam as your link is broken and links 
to some imperial college internal page.


Cheers,
Andy

On 04/15/2015 05:03 AM, Daniel Homola wrote:

Hi all,

I needed a multivariate feature selection method for my work. As 
I'm working with biological/medical data, where n  p or even n  
p I started to read up on Random Foretst based methods, as in my 
limited understanding RF copes pretty well with this suboptimal 
situation.


I came across an R package called 
Boruta:https://m2.icm.edu.pl/boruta/ 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f


After reading the paper and checking some of the pretty impressive 
citations I thought I'd try it, but it was really slow. So I 
thought I'll reimplement it in Python, because I hoped (based on 
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn 
https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn) 
that it will be faster. And it is :) I mean a LOT faster..


I was wondering if this would be something that you would consider 
incorporating into the feature selection module of scikit-learn?


If yes, do you have a tutorial or some sort of 

Re: [Scikit-learn-general] Contributing to Scikit

2014-02-02 Thread Olivier Grisel
2014/2/2 Jitesh Khandelwal jk231...@gmail.com:
 Hi,

 I have used scikit-learn for academic purposes and I like it very much.

 I want to contribute to it. I have gone through the developers documentation
 and setup my local working directory.

 As suggested in the developers documentation, it did look for some EASY
 tagged issues in the issue tracker but it seems that most of them are being
 worked upon by others.

 How do I find a bug to start with ? Can somebody help ?

Feel free to fetch, build and test and benchmark other's development
branches in pull requests [1] and report test failures, missing tests,
missing or unclear documentation / docstrings naming conventions.

Please also feel free to read the reference papers of the methods
impacted by the PR (should be referenced in the docstrings) and try to
see of the implementations diverge from the original method from a
mathematical standpoint and check whether we could add more tests to
check theoretical invariant of the method (e.g. a cost function that
should be monotonic with the training set size, invariance to input
samples ordering and stuff like this).

It is therefore recommended that you start by reviewing pull requests
that are somehow related to your domain of expertise.

Running benchmarks of new methods on you own research data when it
makes sense is also very much appreciated

[1] https://github.com/scikit-learn/scikit-learn/pulls

Note: the name of the project is scikit-learn, not scikit or SciKit
nor sci-kit learn.

Cheers,

-- 
Olivier

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to Scikit

2014-02-02 Thread Andy
On 02/02/2014 12:06 PM, Olivier Grisel wrote:
 Note: the name of the project is scikit-learn, not scikit or SciKit 
 nor sci-kit learn. Cheers, 
I should make this my signature from now on. Also including 
pronounciation (sy-kit learn)

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] contributing to scikit

2014-02-02 Thread Andy
On 02/01/2014 10:42 PM, Robert Layton wrote:

 Finally, when choosing classifiers, it's our preference to focus on 
 heavily used classifiers, rather than state of the art. Many of the 
 core devs (and myself) have coded classifiers that are scikit-learn 
 compatible, but not in the library because they aren't commonly used. 
 These are usually in private repositories.
Where private usually means public, but not in the project.

I think some of the core contributors agreed with me that we want to 
slow the addition of new estimators.
There are many open issues and pull requests. Working on these would 
help us way more than new estimators.


--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to Scikit

2014-02-02 Thread Vlad Niculae
I've heard stchee-kit once, along with stchee-pee and num-pee.

Vlad

On Sun Feb  2 18:39:58 2014, Hadayat Seddiqi wrote:
 i always said skikit


 On Sun, Feb 2, 2014 at 12:20 PM, Andy t3k...@gmail.com
 mailto:t3k...@gmail.com wrote:

 On 02/02/2014 12:06 PM, Olivier Grisel wrote:
  Note: the name of the project is scikit-learn, not scikit or SciKit
  nor sci-kit learn. Cheers,
 I should make this my signature from now on. Also including
 pronounciation (sy-kit learn)

 
 --
 WatchGuard Dimension instantly turns raw network data into actionable
 security intelligence. It gives you real-time visual feedback on key
 security issues and trends.  Skip the complicated setup - simply
 import
 a virtual appliance and go from zero to informed in seconds.
 
 http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 mailto:Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 WatchGuard Dimension instantly turns raw network data into actionable
 security intelligence. It gives you real-time visual feedback on key
 security issues and trends.  Skip the complicated setup - simply import
 a virtual appliance and go from zero to informed in seconds.
 http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk


 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to Scikit

2014-02-02 Thread Andy
On 02/02/2014 06:39 PM, Hadayat Seddiqi wrote:
 i always said skikit

Many people do ;)
sci as in science =)

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to Scikit

2014-02-02 Thread Andy
On 02/02/2014 07:41 PM, Vlad Niculae wrote:
 I've heard stchee-kit once, along with stchee-pee and num-pee.

We should have an FAQ.
It should include

What is the project name? scikit-learn, not scikit or SciKit nor sci-kit 
learn.

How do you pronounce the project name? sy-kit learn. sci stands for science!

Do you want to add this awesome new algorithm? No.

Do you want to add this awesome, widely-used and established algorithm 
that is within the current scope of scikit-learn? PR welcome.

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to Scikit

2014-02-02 Thread Juan Nunez-Iglesias
On Mon, Feb 3, 2014 at 5:49 AM, Andy t3k...@gmail.com wrote:

 We should have an FAQ.
 It should include

 What is the project name? scikit-learn, not scikit or SciKit nor sci-kit
 learn.

 How do you pronounce the project name? sy-kit learn. sci stands for
 science!

 Do you want to add this awesome new algorithm? No.

 Do you want to add this awesome, widely-used and established algorithm
 that is within the current scope of scikit-learn? PR welcome.


+1! That is an awesome FAQ. =)
--
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] contributing to scikit

2014-02-01 Thread Joseph Perla
Is this the right place to ask? I'm just going to send in a pull
request if nobody has any suggestions.
j

On Fri, Jan 31, 2014 at 7:10 PM, Joseph Perla jos...@jperla.com wrote:
 I love SciKit and I'm going to contribute an SGD classifier for
 semi-supervised problems.

 I already read through all the contributor documentation and I've read
 many of the docs.

 I'm asking the list if I should model my code off of the style/quality
 of the SGDClassifier class or if there is a better example to model my
 code style from in sklearn.

 Thank you,
 Joseph Perla

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] contributing to scikit

2014-02-01 Thread Robert Layton
Hi Joseph,
In theory, you should be able to take any classifier in sklearn and base
your implementation off that. That said, there are a few caveats. Some
classifiers are older, before coding was more formalised. Others have a lot
of cython code hooks, and can be difficult to read.  That all said, I had a
look at SGDClassifier, and the coding seems up to date, making it a good
candidate to base your code off.


Also, as an FYI, adding a new classifier to scikit-learn has a number of
components that need to be added to, including:
- unit testing
- doc strings
- narrative docs

Finally, when choosing classifiers, it's our preference to focus on heavily
used classifiers, rather than state of the art. Many of the core devs (and
myself) have coded classifiers that are scikit-learn compatible, but not in
the library because they aren't commonly used. These are usually in private
repositories.

Thanks,

Robert




On 2 February 2014 07:16, Joseph Perla jos...@jperla.com wrote:

 Is this the right place to ask? I'm just going to send in a pull
 request if nobody has any suggestions.
 j

 On Fri, Jan 31, 2014 at 7:10 PM, Joseph Perla jos...@jperla.com wrote:
  I love SciKit and I'm going to contribute an SGD classifier for
  semi-supervised problems.
 
  I already read through all the contributor documentation and I've read
  many of the docs.
 
  I'm asking the list if I should model my code off of the style/quality
  of the SGDClassifier class or if there is a better example to model my
  code style from in sklearn.
 
  Thank you,
  Joseph Perla


 --
 WatchGuard Dimension instantly turns raw network data into actionable
 security intelligence. It gives you real-time visual feedback on key
 security issues and trends.  Skip the complicated setup - simply import
 a virtual appliance and go from zero to informed in seconds.

 http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2013-10-14 Thread Olivier Grisel
Please have a look at the contributors guide:

http://scikit-learn.org/stable/developers/#contributing-code

In particular this doc mentions [Easy] tagged issues:

https://github.com/scikit-learn/scikit-learn/issues?labels=Easy

But in general the best way to contribute is to actually use the
library for your own projects, identify bugs or painpoints, report
them and start working on a PR if there is no other related PR under
way.

-- 
Olivier

--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134071iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to Scikit-Learn

2013-10-02 Thread Olivier Grisel
2013/10/2 Manoj Kumar manojkumarsivaraj...@gmail.com:
 Hi,

 I am Manoj Kumar, a junior undergrad from Birla Institute of Technology and
 Science.

 I've just completed my Google Summer of Code under SymPy. So I have a good
 programming background in Python.

 Regarding my Machine Learning background, I've taken an informal Coursera
 course, under Andrew Ng. I thought that the best way, to improve my
 knowledge and skills would be to contribute (or at-least try) to an existing
 Machine Learning library. And Scikit-learn was my first choice.

 Can someone point me a list to of existing bugs / docs needed to be written?
 And is there anything I need to learn / prerequisite before trying to fix
 any of them? Because I am relatively new to Machine Learning, though I can
 grasp things quickly.

Hi,

You should find all the answers to your questions in the contributors guide:

http://scikit-learn.org/stable/developers/index.html

If not please feel free to ask again on the mailing list.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60134791iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] contributing to scikit-learn

2013-08-01 Thread Gael Varoquaux
On Thu, Aug 01, 2013 at 03:40:05PM +0200, Eustache DIEMERT wrote:
 Here is a little post that tells the story : 
 http://stochastics.komodo.re/posts/contributing-to-sklearn.html

Cool! Glad you enjoyed it. I tweeted you :)
https://twitter.com/GaelVaroquaux/status/362934648302616576

Thanks a lot for your excellent-quality work.

Gaël

--
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] contributing to scikit-learn

2013-08-01 Thread Andreas Mueller

Hey Eustache.
Nice write-up.
So who are the tinkerers and who are the prophets ? ;)

Cheers,
Andy

On 08/01/2013 03:40 PM, Eustache DIEMERT wrote:

Hi list,

Not so long ago I had my first PR merged into sklearn.

Overall it was a very cool experience, thanks to many of you :)

Here is a little post that tells the story : 
http://stochastics.komodo.re/posts/contributing-to-sklearn.html


Cheers,

Eustache


--
Get your SQL database under version control now!
Version control is standard for application code, but databases havent
caught up. So what steps can you take to put your SQL databases under
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-07 Thread Andreas Mueller

Hi everybody!
David, it's your project, I'm just trying to help along ;)
About 2): Afaik there is nothing in sklearn at the moment
that can deal with missing variables and I feel the MLP
is one of the estimators where dealing with missing values
is hardest.
@David: I wouldn't keep you from trying but it seems a bit
out of the scope of the MLP. I think the idea for missing data
was to provide an additional mask as input that says
which values are missing. Dealing with this is much more natural
in naive Bayes or tree based methods than in the MLP I think.

@Vandana: For dealing with missing data, one easy way is to
set the missing variables to their mean over the dataset.
Usually for MLPs the input should be zero mean, unit variance.
So the missing variable would be just set to 0.
Do you know of any better way of dealing with missing values
in MLPs?

Cheers,
Andy


On 06/05/2012 07:51 PM, David Marek wrote:
I think you sent this mail only to me, please send all mails to 
mailling list. Btw. Andreas is my mentor, so he is the one in charge 
here :-)


Ad 1) Afaik all you need is one hidden layer, it's certainly possible 
to add the possibility, but I think we decided that it's not a priority.


Ad 2) Good idea

David

-- Forwarded message --
From: *Vandana Bachani* vandana@gmail.com 
mailto:vandana@gmail.com

Date: Tue, Jun 5, 2012 at 6:59 PM
Subject: Re: [Scikit-learn-general] Contributing to scikit-learn
To: h4wk...@gmail.com mailto:h4wk...@gmail.com


Hi David,
I think we can add the following also to the to do list:
1. Any number of hidden layers and hidden units should be supported.
2. Missing data should be handled (several UCI datasets have missing 
data).


I will look at the code and then send you a mail about my thoughts on 
the same.


If you would like to have a look at my project report, I am attaching 
the same.


Thanks,
Vandana



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-07 Thread LI Wei
Intuitively maybe we can set the missing values using the average over the
nearest neighbors calculated using these existing features? Not sure
whether it is the correct way to do it :-)

Cheers,
LI, Wei

On Thu, Jun 7, 2012 at 12:25 PM, Andreas Mueller
amuel...@ais.uni-bonn.dewrote:

  Hi everybody!
 David, it's your project, I'm just trying to help along ;)
 About 2): Afaik there is nothing in sklearn at the moment
 that can deal with missing variables and I feel the MLP
 is one of the estimators where dealing with missing values
 is hardest.
 @David: I wouldn't keep you from trying but it seems a bit
 out of the scope of the MLP. I think the idea for missing data
 was to provide an additional mask as input that says
 which values are missing. Dealing with this is much more natural
 in naive Bayes or tree based methods than in the MLP I think.

 @Vandana: For dealing with missing data, one easy way is to
 set the missing variables to their mean over the dataset.
 Usually for MLPs the input should be zero mean, unit variance.
 So the missing variable would be just set to 0.
 Do you know of any better way of dealing with missing values
 in MLPs?

 Cheers,
 Andy



 On 06/05/2012 07:51 PM, David Marek wrote:

 I think you sent this mail only to me, please send all mails to mailling
 list. Btw. Andreas is my mentor, so he is the one in charge here :-)

 Ad 1) Afaik all you need is one hidden layer, it's certainly possible to
 add the possibility, but I think we decided that it's not a priority.

 Ad 2) Good idea

 David

 -- Forwarded message --
 From: Vandana Bachani vandana@gmail.com
 Date: Tue, Jun 5, 2012 at 6:59 PM
 Subject: Re: [Scikit-learn-general] Contributing to scikit-learn
 To: h4wk...@gmail.com


 Hi David,
 I think we can add the following also to the to do list:
 1. Any number of hidden layers and hidden units should be supported.
 2. Missing data should be handled (several UCI datasets have missing data).

  I will look at the code and then send you a mail about my thoughts on
 the same.

  If you would like to have a look at my project report, I am attaching
 the same.

  Thanks,
 Vandana



 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-07 Thread David Warde-Farley
On Thu, Jun 07, 2012 at 03:09:11PM +, LI Wei wrote:
 Intuitively maybe we can set the missing values using the average over the
 nearest neighbors calculated using these existing features? Not sure
 whether it is the correct way to do it :-)

That's known as imputation (or in a particular variant, k-NN impute).

In general how you treat missing values will depend a lot on your statistical
assumptions, and thus it would be very unwise to have a one size fits all
approach to handling missing data, at least without qualifying it as based
on one assumption or another.

Like the independent-and-identically-distributed assumption, the relevant
assumptions are missing at random (where the assumption is that the
probability of observing a feature is independent of that feature's value)
and missing completely at random (where the assumption is that the
probability of observing a given feature is independent of ALL the features
observed for that training case).

In the case of neural networks, for MAR or MCAR data, simply setting the
feature to zero is not completely crazy, especially when doing stochastic
gradient descent, as the weights update will get multiplied by that zero for
that specific training case. In fact, artificially introducing zeros
(masking noise) is a neat way to encourage robustness for some problems
even when you don't have missing data.  For not-missing-at-random data you'd
need to modify the cost function to incorporate your model of how frequently
and when things drop out, and probably estimate the parameters of that model
simultaneously with the MLP parameters -- not something you can really
prepackage.

David


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-07 Thread Vandana Bachani
Hi Andreas,

I agree missing data is not specific to MLP.
We dealt it with pretty simple as u mentioned by taking mean over the
dataset for continuous-valued attributes.
Another thing that I feel is not adequately explored in the scikit
implementations is the discrete attributes.
Classification problems with discrete input features or a mix of discrete
and continuous features cannot be handled well. Many UCI datasets have a
mix of discrete and continuous attributes.
For discrete attributes we consider the missing values as another kind of
discrete value namely 'UNKNOWN'.

And I mentioned about allowing for multiple hidden layers because its just
a flexibility we would like to give to more advanced users of MLP who might
like to experiment with different number of hidden units in case of
difficult problems.

Thanks,
Vandana

On Thu, Jun 7, 2012 at 10:16 AM, eat e.antero.ta...@gmail.com wrote:

 Hi,

 On Thu, Jun 7, 2012 at 6:09 PM, LI Wei li...@ee.cuhk.edu.hk wrote:

 Intuitively maybe we can set the missing values using the average over
 the nearest neighbors calculated using these existing features? Not sure
 whether it is the correct way to do it :-)

 I think the key question is: how reliable manner one can estimate the mean
 (and variance) here.

 With data sets containing both missing values and outliers, I doubt that
 there exists any simple, generally accepted. way to both detect outliers
 (so that their impact on mean and variance is counted for) and same time
 impute missing values.

 However it might be possible to incorporate some domain specific
 knowledge in order to move on. So, in summary, what kind of schemes there
 exists to add (ad hoc) domain specific knowledge systematic manner into the
 modeling process?


 My 2 cents,
 -eat


 Cheers,
 LI, Wei


 On Thu, Jun 7, 2012 at 12:25 PM, Andreas Mueller 
 amuel...@ais.uni-bonn.de wrote:

  Hi everybody!
 David, it's your project, I'm just trying to help along ;)
 About 2): Afaik there is nothing in sklearn at the moment
 that can deal with missing variables and I feel the MLP
 is one of the estimators where dealing with missing values
 is hardest.
 @David: I wouldn't keep you from trying but it seems a bit
 out of the scope of the MLP. I think the idea for missing data
 was to provide an additional mask as input that says
 which values are missing. Dealing with this is much more natural
 in naive Bayes or tree based methods than in the MLP I think.

 @Vandana: For dealing with missing data, one easy way is to
 set the missing variables to their mean over the dataset.
 Usually for MLPs the input should be zero mean, unit variance.
 So the missing variable would be just set to 0.
 Do you know of any better way of dealing with missing values
 in MLPs?

 Cheers,
 Andy



 On 06/05/2012 07:51 PM, David Marek wrote:

 I think you sent this mail only to me, please send all mails to mailling
 list. Btw. Andreas is my mentor, so he is the one in charge here :-)

 Ad 1) Afaik all you need is one hidden layer, it's certainly possible to
 add the possibility, but I think we decided that it's not a priority.

 Ad 2) Good idea

 David

 -- Forwarded message --
 From: Vandana Bachani vandana@gmail.com
 Date: Tue, Jun 5, 2012 at 6:59 PM
 Subject: Re: [Scikit-learn-general] Contributing to scikit-learn
 To: h4wk...@gmail.com


 Hi David,
 I think we can add the following also to the to do list:
 1. Any number of hidden layers and hidden units should be supported.
 2. Missing data should be handled (several UCI datasets have missing
 data).

  I will look at the code and then send you a mail about my thoughts on
 the same.

  If you would like to have a look at my project report, I am attaching
 the same.

  Thanks,
 Vandana



 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-07 Thread David Warde-Farley
On Thu, Jun 07, 2012 at 10:40:32AM -0700, Vandana Bachani wrote:
 Hi Andreas,
 
 I agree missing data is not specific to MLP.
 We dealt it with pretty simple as u mentioned by taking mean over the
 dataset for continuous-valued attributes.
 Another thing that I feel is not adequately explored in the scikit
 implementations is the discrete attributes.
 Classification problems with discrete input features or a mix of discrete
 and continuous features cannot be handled well. Many UCI datasets have a
 mix of discrete and continuous attributes.
 For discrete attributes we consider the missing values as another kind of
 discrete value namely 'UNKNOWN'.

How are you encoding the discrete features? As one-hot vectors?

In that case, a natural encoding for unknown is a zero-vector, as the
stochastic gradient step will represent a no-op with respect to all of the
weights for every possible value of that feature. Whether it's sensible
to do *only* this depends, again, on whether the data is assumed
missing-at-random or not.

David

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-07 Thread Vandana Bachani
Hi David,
Yes I use one-hot encoding, but my understanding of one-hot encoding says
that each discrete attribute can be represented as a bit pattern. So the
node corresponding to that input attribute is actually a set of nodes
representing that bit pattern. An unknown just means that the bit for
unknown value is set to one and rest are set to 0. At any instance the
nodes corresponding to an input attribute will have atleast one node with a
value of 1. The downside of using one hot encoding is that it bloats up the
weight space and the number of input units but I guess thats ok as this is
one of the best ways of doing discrete attribute classification if we are
to use MLPs.

Thanks,
Vandana

On Thu, Jun 7, 2012 at 11:12 AM, David Warde-Farley 
warde...@iro.umontreal.ca wrote:

 On Thu, Jun 07, 2012 at 10:40:32AM -0700, Vandana Bachani wrote:
  Hi Andreas,
 
  I agree missing data is not specific to MLP.
  We dealt it with pretty simple as u mentioned by taking mean over the
  dataset for continuous-valued attributes.
  Another thing that I feel is not adequately explored in the scikit
  implementations is the discrete attributes.
  Classification problems with discrete input features or a mix of discrete
  and continuous features cannot be handled well. Many UCI datasets have a
  mix of discrete and continuous attributes.
  For discrete attributes we consider the missing values as another kind of
  discrete value namely 'UNKNOWN'.

 How are you encoding the discrete features? As one-hot vectors?

 In that case, a natural encoding for unknown is a zero-vector, as the
 stochastic gradient step will represent a no-op with respect to all of the
 weights for every possible value of that feature. Whether it's sensible
 to do *only* this depends, again, on whether the data is assumed
 missing-at-random or not.

 David


 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Vandana Bachani
Graduate Student, MSCE
Computer Science  Engineering Department
Texas AM University, College Station
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-06 Thread David Warde-Farley
On 2012-06-05, at 1:51 PM, David Marek h4wk...@gmail.com wrote:

 1) Afaik all you need is one hidden layer,

The universal approximator theorem says that any continuous function can be 
approximated arbitrarily well if you have one hidden layer with enough hidden 
units, but it says nothing about the ease of finding that solution, nor about 
the efficiency of the solution (you can prove that certain functions that can 
be compactly represented by a deep network require exponentially many more 
hidden units if you're restricted to one layer).

However, with purely supervised training deeper networks are harder to fit (you 
can get to about 2 hidden layers if you're careful but beyond that it gets 
quite hard), so I wouldn't worry about it. In a black box implementation for 
scikit-learn, where the user isn't expected to be an expert in training neural 
nets, a single hidden layer is probably plenty.

David
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-06 Thread xinfan meng
Deep learning literature said that the more layers you have, the less
hidden nodes in one layer you need. But I agree one hidden layer would be
sufficient now.

On Thu, Jun 7, 2012 at 11:12 AM, David Warde-Farley 
warde...@iro.umontreal.ca wrote:

 On 2012-06-05, at 1:51 PM, David Marek h4wk...@gmail.com wrote:

  1) Afaik all you need is one hidden layer,

 The universal approximator theorem says that any continuous function can
 be approximated arbitrarily well if you have one hidden layer with enough
 hidden units, but it says nothing about the ease of finding that solution,
 nor about the efficiency of the solution (you can prove that certain
 functions that can be compactly represented by a deep network require
 exponentially many more hidden units if you're restricted to one layer).

 However, with purely supervised training deeper networks are harder to fit
 (you can get to about 2 hidden layers if you're careful but beyond that it
 gets quite hard), so I wouldn't worry about it. In a black box
 implementation for scikit-learn, where the user isn't expected to be an
 expert in training neural nets, a single hidden layer is probably plenty.

 David

 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Best Wishes

Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science  Technology
School of Electronic Engineering  Computer Science
Peking University
Beijing, 100871
China
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-05 Thread Shreyas Karkhedkar
Hi Gael,

Thanks for the response. Vandana and I are really excited about
contributing to scikits.

I will go through the GMM code and will put in suggestions for refactoring
- and if possible implement some new features.

Once again, on behalf of Vandana and I, thanks for the reply.

Looking forward to work with you.

Cheers,
Shreyas

On Mon, Jun 4, 2012 at 10:27 PM, Gael Varoquaux 
gael.varoqu...@normalesup.org wrote:

 Hi Vandana and Shreyas,

 Welcome and thanks for the interest,

 With regards to MLP (multi-layer perceptrons), David Marek is right now
 working on such feature:
 https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp
 you can probably pitch in with him: 4 eyes are always better than only 2.

 With regard to EM for GMM, the scikit-learn has an implementation of this
 class of algorithms in sklearn/mixture/gmm.py. This code is a little bit
 outdated and can probably be improved in terms of readability, speed and
 feature set.

 Cheers,

 Gaėl

 On Mon, Jun 04, 2012 at 04:31:26PM -0700, Vandana Bachani wrote:
 Hi,
 Me and my friend Shreyas want to contribute to the scikit-learn code.
 I want to add code for neural networks (Multi-layer Perceptrons) and
 Shreyas has some ideas for the Expecation-Maximization algorithm and
 Gaussian Mixture Models. Please let us know how we can contribute to
 the
 code and if we can discuss our ideas with someone on the scikit team
 so
 that we are not reinventing something that is already there.
 About me: I am a Computer Science Masters student at Texas AM
 University
 currently interning at Google. I have a basic version of neural
 networks
 for classification implemented in python as part of my machine
 learning
 class project (works well for UCI Datasets). I am planning to extend
 it
 for regression and optimize it to make it public.
 Shreyas is a PhD student at University of Texas at El Paso and is
 currently interning at Google.
 We are planning to pair program on these ideas to make them scikit
 worthy.
 Thanks,




-- 
Shreyas Ashok Karkhedkar
PhD Candidate
Computer Science
University of Texas at El Paso

email:
sakarkhed...@miners.utep.edu
karkhedkar.shre...@gmail.com

Phone:
+1-240-494-6362
--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-05 Thread Andreas Mueller

Hi Shreyas.
In particular, the VBGMM and DPGMM might need some attention.
Once you are a bit familiar with the GMM code, you could have a look
at issue 393 https://github.com/scikit-learn/scikit-learn/issues/393.
Any help would be much appreciated :)

Cheers,
Andy


Am 05.06.2012 08:07, schrieb Shreyas Karkhedkar:

Hi Gael,

Thanks for the response. Vandana and I are really excited about 
contributing to scikits.


I will go through the GMM code and will put in suggestions for 
refactoring - and if possible implement some new features.


Once again, on behalf of Vandana and I, thanks for the reply.

Looking forward to work with you.

Cheers,
Shreyas

On Mon, Jun 4, 2012 at 10:27 PM, Gael Varoquaux 
gael.varoqu...@normalesup.org mailto:gael.varoqu...@normalesup.org 
wrote:


Hi Vandana and Shreyas,

Welcome and thanks for the interest,

With regards to MLP (multi-layer perceptrons), David Marek is
right now
working on such feature:
https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp
you can probably pitch in with him: 4 eyes are always better than
only 2.

With regard to EM for GMM, the scikit-learn has an implementation
of this
class of algorithms in sklearn/mixture/gmm.py. This code is a
little bit
outdated and can probably be improved in terms of readability,
speed and
feature set.

Cheers,

Gaėl

On Mon, Jun 04, 2012 at 04:31:26PM -0700, Vandana Bachani wrote:
Hi,
Me and my friend Shreyas want to contribute to the
scikit-learn code.
I want to add code for neural networks (Multi-layer
Perceptrons) and
Shreyas has some ideas for the Expecation-Maximization
algorithm and
Gaussian Mixture Models. Please let us know how we can
contribute to the
code and if we can discuss our ideas with someone on the
scikit team so
that we are not reinventing something that is already there.
About me: I am a Computer Science Masters student at Texas
AM University
currently interning at Google. I have a basic version of
neural networks
for classification implemented in python as part of my
machine learning
class project (works well for UCI Datasets). I am planning to
extend it
for regression and optimize it to make it public.
Shreyas is a PhD student at University of Texas at El Paso and is
currently interning at Google.
We are planning to pair program on these ideas to make them
scikit worthy.
Thanks,




--
Shreyas Ashok Karkhedkar
PhD Candidate
Computer Science
University of Texas at El Paso

email:
sakarkhed...@miners.utep.edu mailto:sakarkhed...@miners.utep.edu
karkhedkar.shre...@gmail.com mailto:karkhedkar.shre...@gmail.com

Phone:
+1-240-494-6362


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Contributing to scikit-learn

2012-06-04 Thread Gael Varoquaux
Hi Vandana and Shreyas,

Welcome and thanks for the interest,

With regards to MLP (multi-layer perceptrons), David Marek is right now
working on such feature:
https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp
you can probably pitch in with him: 4 eyes are always better than only 2.

With regard to EM for GMM, the scikit-learn has an implementation of this
class of algorithms in sklearn/mixture/gmm.py. This code is a little bit
outdated and can probably be improved in terms of readability, speed and
feature set.

Cheers,

Gaël

On Mon, Jun 04, 2012 at 04:31:26PM -0700, Vandana Bachani wrote:
Hi,
Me and my friend Shreyas want to contribute to the scikit-learn code.
I want to add code for neural networks (Multi-layer Perceptrons) and
Shreyas has some ideas for the Expecation-Maximization algorithm and
Gaussian Mixture Models. Please let us know how we can contribute to the
code and if we can discuss our ideas with someone on the scikit team so
that we are not reinventing something that is already there.
About me: I am a Computer Science Masters student at Texas AM University
currently interning at Google. I have a basic version of neural networks
for classification implemented in python as part of my machine learning
class project (works well for UCI Datasets). I am planning to extend it
for regression and optimize it to make it public.
Shreyas is a PhD student at University of Texas at El Paso and is
currently interning at Google.
We are planning to pair program on these ideas to make them scikit worthy.
Thanks,

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general