[Scikit-learn-general] My personal suggestion regarding topics for GSoC (and my official application :-) )

2015-03-06 Thread Luca Puggini
Thanks a lot for the material provided on randomized pca and random forest
it would for sure help me in my research.

I talked with my supervisor and he said that I am free to apply for this
summer project.

I used quiet a lot GAM and I did some work related to high dimensional
fault detection system and so to metrics but apparently these topics are
already taken.

My understanding from the previous emails is that nipals PCA may be of
interest. On the same topic I have a couple of algorithms that I think
could be useful.

1- Sparse principal component analysis via regularized low rank matrix
approximation.
http://www.sciencedirect.com/science/article/pii/S0047259X07000887
This is basically the equivalent of the nipals algorithm for SPCA. It is
more efficient for high dimensional problem. It is pretty useful because it
is possible to avoid the initial SVD.

2- Feature Subset Selection and Ranking for Data Dimensionality Reduction
http://eprints.whiterose.ac.uk/1947/1/weihl3.pdf .

This is a method to do unsupervised features selection. Similar to SPCA but
it is optimized in order to maximize the percentage of explained variance
respect to the number of selected variables.


If these topics are not of interest I will be happy to work on
- improve GMM  or -Global optimization based Hyperparameter optimization

I am not familiar with these 2 topics but they are close to my research
area so I will be happy to study them.


Now my understanding is that the staff should contact me to discuss further
the various arguments.  Please fill free to contact me to my private email
and I am happy to share my cv and my python code (research quality code )


Thanks a lot,
Luca
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC2015 topics

2015-03-06 Thread Andreas Mueller

Thanks for trying to make some time :)


On 03/06/2015 03:42 AM, Arnaud Joly wrote:

Hi,

Sadly this year, I won’t have time for mentoring.
However, I will try to find some spare time for reviewing!

Best regards,
Arnaud



On 05 Mar 2015, at 22:43, Andreas Mueller > wrote:


Hi Wei Xue.
Thanks for your interest.
For the GMM project being familiar with DPGMM and VB should be enough.
We don't want to use Gibbs sampling in the DP. If you feel 
comfortable implementing

a given derivation and have some understanding, that should be fine.

For hyper-parameter optimization, the idea would be to implement our 
own version based on
our tree implementation (which is actually also done in spearmint) or 
using the new GP.


HTH,
Andreas

On 03/05/2015 04:32 PM, Wei Xue wrote:

Hi, all

I am a graduate student studying machine learning, and will probably 
apply GSOC project this year. I just took a loot at the wiki, and 
found two interesting topics for me.


  * Improve GMM
  * Global optimization based Hyper-parameter optimization

For the GMM topic, I studied DP years ago, and implemented a toy 
DPGMM using Gibbs sampling on Matlab. I am also familiar with VB. My 
question is that does gsoc projects require students fully 
understand the theory of DP?


For the hyper-parameter optimization topic, since there are already 
two python packages /spearmint/ and /Hyperopt, /the goal of this 
topic is to implement our own modules or to build interfaces for 
other packages?



Thanks,
Wei Xue


2015-03-05 14:25 GMT-05:00 Andreas Mueller >:


Thanks for volunteering to assist, I updated the wiki
accordingly :)



On 03/05/2015 01:21 PM, Michael Eickenberg wrote:

I unfortunately cannot lead any gsoc project this year, but can
help out with code review and mentoring if sb else takes the
lead. The two projects I can be of use for are CCA/PLS
rethinking and additive models.

Michael

On Thursday, March 5, 2015, Andreas Mueller mailto:t3k...@gmail.com>> wrote:

Can all would-be mentors please register on Melange?
The list of possible mentors lists Arnaud, probably a C&P
from last year.
Arnaud, are you up for mentoring again? Otherwise I'll
remove you from
the list.

Then we'd currently have

Gaël Varoquaux (not sure if you have time?), Vlad Niculae,
Olivier
Grisel,Alexandre Gramfort, Michael Eickenberg
and me.
Any other volunteers?




On 02/24/2015 09:48 AM, Andy wrote:
> Hey Everybody.
>
> Here is my somewhat consolidated list of ideas with minor
comments.
> If anything is missing, please let me know. Also, I don't
think people
> who want to mentor spoke up yet.
> I'll remove all people listed on the wiki as they were
copy and pasted
> from last year, and I'd rather have actual confirmation.
>
> Topics:
> DPGMM / VBGMM:  need to be reimplemented using more standard
> variational updates. The GMM is actually fine atm (after
a couple of
> pending PRs)
>
> spearmint : Using random forest (they actually use ours) for
> hyperparameter optimization. I need to mull this over but
I think this
> should be easy enough and pretty helpful.
>
> Online low-rank matrix completion : this is from last
year and I'm not
> sure if it is still desirable / don't know the state of
the PR
>
> Multiple metric support : This is somewhat API heavy but
I think
>
> PLS/CCA : They need love so very much, but I'm not sure
we have a
> mentor (if there is one, please speak up!):q
>
> Ensemble Clusters : Proposed by a possible student
(Milton) but I
> think it is interesting.
>
> Semi-Supervised Learning : Meta-estimator for self-taught
learning.
> Not sure if there is actually much demand for it, but
would be nice.
>
> Additive models: Proposed by ragv, but I'm actually not
that sold. We
> could include pyearth, but I'm not sure how valuable the
other methods
> are. Including a significant amount of algorithms just for
> completeness is not something I feel great about.
>
>
> That being said, ragv has put in a tremendous amount of
great work and
> I feel we should definitely find a project for him (as he
seems
> interested).
>
>
> Things that I think shouldn't be GSOC projects:
>
> GPs : Jan Hendrik is doing an awesome job there.
> MLP : Will be finished soon, either by me or possibly by ragv
> data-independent cross-validation : already a bunch of
people working

Re: [Scikit-learn-general] feature names after OneHotEncoder

2015-03-06 Thread Andreas Mueller
I thought you just wanted to mask some features, but I guess that was 
not you intend.
You could make your code robust to future changes by using the 
feature_indices_ attribute,
while assuming that the result first has all categorical, and then all 
numerical values.
Btw, you might have an easier time using pandas dummy variables instead 
of using the one hot encoder.



On 03/06/2015 03:01 AM, Eustache DIEMERT wrote:


2015-03-05 16:57 GMT+01:00 Andy >:


Well, the columns after the OneHotEncoder correspond to feature
values, not feature names, right?


Well, for the categorical ones this is right, except that not all my 
features are categorical (hence the categorical_features=...) and they 
are intertwined.


So my problem is more to keep track of which categorical features got 
projected into which columns (1->N) and which numerical ones have been 
just copied and where (1->1).


Re-reading your answer I'm wondering if you suggest to just separate 
the input columns by feature types and apply the encoder to the 
categorical ones only ?



There is ``feature_indices_`` which maps each feature to a range
of features in the encoded matrix.
The features in the input matrix don't really have names in
scikit-learn, as they are represented only as numpy matrices.
So you need to keep track of the indices of each feature. That
shouldn't be too hard, though.

Why don't you select the features before the encoding? Or do you
want to exclude some values?



On 03/05/2015 05:55 AM, Eustache DIEMERT wrote:

Hi list,

I have a X (np.array) with some columns containing ids. I also
have a list of column names. Then I want to transform the
relevant columns to be used by a logistic regression model using
OneHotEncoder:

>>> X = np.loadtxt(...) # from a CSV
>>> col_names = ... # from CSV header
>>> e = OneHotEncoder(categorical_features=id_columns)
>>> Xprime = e.fit_transform(X)

But then I don't know how to deduce the names of the columns in
the new matrix :(

Ideally I would want the same as DictVectorizer which has a
feature_names_ member.

Anyone already had this problem ?

Eustache



--
Dive into the World of Parallel Programming The Go Parallel Website, 
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for 
all
things parallel software development, from weekly thought leadership blogs 
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.http://goparallel.sourceforge.net/


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net  

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
Dive into the World of Parallel Programming The Go Parallel
Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your
hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and
join the
conversation now. http://goparallel.sourceforge.net/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] My personal suggestion regarding topics for GSoC

2015-03-06 Thread Gilles Louppe
Yes, in fact I did something similar in my thesis. See section 7.2 for
a discussion about this. Figure 7.5 is similar to what you describe in
your sample code. By varying the depth, you can basically control the
bias.
http://orbi.ulg.ac.be/bitstream/2268/170309/1/thesis.pdf

On 6 March 2015 at 13:50, Luca Puggini  wrote:
> After a little simulated study I agree with the previous comment.
> With the Extra trees classifier it is possible to reduce the bias.
>
> Despite that the result is still biased.
>
> Here the sample code:
> http://jpst.it/x9Mv
>
> Here a possible reference:
> http://www.biomedcentral.com/1471-2105/8/25
>
> Please tell me if you are aware of other better papers regarding the topic
> :-)
>
> I hope this can help.
>
> Best,
> Luca
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] My personal suggestion regarding topics for GSoC

2015-03-06 Thread Luca Puggini
After a little simulated study I agree with the previous comment.
With the Extra trees classifier it is possible to reduce the bias.

Despite that the result is still biased.

Here the sample code:
http://jpst.it/x9Mv

Here a possible reference:
http://www.biomedcentral.com/1471-2105/8/25

Please tell me if you are aware of other better papers regarding the topic
:-)

I hope this can help.

Best,
Luca
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] My personal suggestion regarding topics for GSoC

2015-03-06 Thread Luca Puggini
Hi,
thanks a lot I was not aware of the randomized PCA.

Regarding random forest is there any paper or resource that you can suggest
me?
I tried to use the forest with max_features=1 but it was still biased.

I did not try with a limited depth.

Thanks a lot,
Luca
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] My personal suggestion regarding topics for GSoC

2015-03-06 Thread Gilles Louppe
Hi Luca,

On 6 March 2015 at 11:09, Luca Puggini  wrote:
> Hi,
> It seems to me that you are discussing topics that can be introduced in
> sklearn with GSoC.
>
> I use sklearn quiet a lot and there are a couple of things that I really
> miss in this library:
>
> 1- Nipals PCA.
> The current version of PCA is too low for high dimensional dataset.  Suppose
> to have p=1 variables and be interested in only the first 10 principal
> components. In a situation like this nipals PCA is much more efficient.
> Also other algorithms like PLS can increase their computational performance
> with nipals PCA
>
> 2- Something to rank the variables
> At the moment it seems to me that the only way to rank the variables is the
> Random Forest importance. This method is known to be very very biased. I
> suggest something like the method implemented in the R library party.

Just commenting on this: The bias is only dependent on how you
construct the forest. If you build a forest of totally randomized
trees and limit their depth (e.g.,
ExtraTreesClassifier(max_features=1, max_depth=5), then you will fix
for most of the biases in the resulting importances.

Gilles

>
>
> I hope that these comments can help.
> I may decide to apply for GSoC as well :-)
>
> Best,
> Luca
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] My personal suggestion regarding topics for GSoC

2015-03-06 Thread Michael Eickenberg
On Fri, Mar 6, 2015 at 11:09 AM, Luca Puggini  wrote:

> Hi,
> It seems to me that you are discussing topics that can be introduced in
> sklearn with GSoC.
>
> I use sklearn quiet a lot and there are a couple of things that I really
> miss in this library:
>
> 1- Nipals PCA.
> The current version of PCA is too low for high dimensional dataset.
> Suppose to have p=1 variables and be interested in only the first 10
> principal components. In a situation like this nipals PCA is much more
> efficient.  Also other algorithms like PLS can increase their computational
> performance with nipals PCA
>
>
PCA does an SVD, whose complexity depends on the shorter side of the
matrix. If you have n=100, p=1, the complexity is O(n^2 * p). However,
if both dimensions are high, it is true that a decomposition that only
calculates the required number of components becomes necessary.
RandomizedPCA
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py#L468
does this using random projection, nipals would be an alternative.

PLS already uses nipals
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_decomposition/pls_.py#L22

In the context of a refactoring of PLS/CCA, it there could also be an
evaluation of the existing nipals in PCA.


2- Something to rank the variables
> At the moment it seems to me that the only way to rank the variables is
> the Random Forest importance. This method is known to be very very biased.
> I suggest something like the method implemented in the R library party.
>
>
Could you elaborate?


>
> I hope that these comments can help.
> I may decide to apply for GSoC as well :-)
>
> Best,
> Luca
>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] My personal suggestion regarding topics for GSoC

2015-03-06 Thread Luca Puggini
Hi,
It seems to me that you are discussing topics that can be introduced in
sklearn with GSoC.

I use sklearn quiet a lot and there are a couple of things that I really
miss in this library:

1- Nipals PCA.
The current version of PCA is too low for high dimensional dataset.
Suppose to have p=1 variables and be interested in only the first 10
principal components. In a situation like this nipals PCA is much more
efficient.  Also other algorithms like PLS can increase their computational
performance with nipals PCA

2- Something to rank the variables
At the moment it seems to me that the only way to rank the variables is the
Random Forest importance. This method is known to be very very biased. I
suggest something like the method implemented in the R library party.


I hope that these comments can help.
I may decide to apply for GSoC as well :-)

Best,
Luca
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] feature names after OneHotEncoder

2015-03-06 Thread Eustache DIEMERT
Well after a bit of tinkering it seems that OneHotEncoder has simple rules
to affect columns to the output:
1) first do the categorical, in the order given by the argument, creating
columns as needed by the values
2) then the numerical

So a piece of code like that seems to work:

>>> fn = []
>>> fc = []
>>> for c in df.columns.values:
>>> if is_categorical(c):
>>> fn += sorted(['%s=%s' % (c, v) for v in df[c].unique()])
>>> else:
>>> fc += [c]
>>> fn += fc

assuming the original data is in a pandas DataFrame (df) and you have a
list of categorical feature names (is_categorical).

Of course it's pretty retro-engineering the OHE and may break in the future
though.

E/


2015-03-06 9:01 GMT+01:00 Eustache DIEMERT :

>
> 2015-03-05 16:57 GMT+01:00 Andy :
>
>>  Well, the columns after the OneHotEncoder correspond to feature values,
>> not feature names, right?
>>
>
> Well, for the categorical ones this is right, except that not all my
> features are categorical (hence the categorical_features=...) and they
> are intertwined.
>
> So my problem is more to keep track of which categorical features got
> projected into which columns (1->N) and which numerical ones have been just
> copied and where (1->1).
>
> Re-reading your answer I'm wondering if you suggest to just separate the
> input columns by feature types and apply the encoder to the categorical
> ones only ?
>
>
>
>> There is ``feature_indices_`` which maps each feature to a range of
>> features in the encoded matrix.
>> The features in the input matrix don't really have names in scikit-learn,
>> as they are represented only as numpy matrices.
>> So you need to keep track of the indices of each feature. That shouldn't
>> be too hard, though.
>>
>> Why don't you select the features before the encoding? Or do you want to
>> exclude some values?
>>
>>
>>
>> On 03/05/2015 05:55 AM, Eustache DIEMERT wrote:
>>
>> Hi list,
>>
>>  I have a X (np.array) with some columns containing ids. I also have a
>> list of column names. Then I want to transform the relevant columns to be
>> used by a logistic regression model using OneHotEncoder:
>>
>>  >>> X = np.loadtxt(...) # from a CSV
>> >>> col_names = ... # from CSV header
>>  >>> e = OneHotEncoder(categorical_features=id_columns)
>> >>> Xprime = e.fit_transform(X)
>>
>>  But then I don't know how to deduce the names of the columns in the new
>> matrix :(
>>
>>  Ideally I would want the same as DictVectorizer which has a
>> feature_names_ member.
>>
>>  Anyone already had this problem ?
>>
>>  Eustache
>>
>>
>> --
>> Dive into the World of Parallel Programming The Go Parallel Website, 
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>> all
>> things parallel software development, from weekly thought leadership blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>>
>>
>>
>> ___
>> Scikit-learn-general mailing 
>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> --
>> Dive into the World of Parallel Programming The Go Parallel Website,
>> sponsored
>> by Intel and developed in partnership with Slashdot Media, is your hub
>> for all
>> things parallel software development, from weekly thought leadership
>> blogs to
>> news, videos, case studies, tutorials and more. Take a look and join the
>> conversation now. http://goparallel.sourceforge.net/
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] GSoC2015 topics

2015-03-06 Thread Arnaud Joly
Hi,

Sadly this year, I won’t have time for mentoring.
However, I will try to find some spare time for reviewing!

Best regards,
Arnaud



> On 05 Mar 2015, at 22:43, Andreas Mueller  wrote:
> 
> Hi Wei Xue.
> Thanks for your interest.
> For the GMM project being familiar with DPGMM and VB should be enough.
> We don't want to use Gibbs sampling in the DP. If you feel comfortable 
> implementing
> a given derivation and have some understanding, that should be fine.
> 
> For hyper-parameter optimization, the idea would be to implement our own 
> version based on
> our tree implementation (which is actually also done in spearmint) or using 
> the new GP.
> 
> HTH,
> Andreas
> 
> On 03/05/2015 04:32 PM, Wei Xue wrote:
>> Hi, all
>> 
>> I am a graduate student studying machine learning, and will probably apply 
>> GSOC project this year. I just took a loot at the wiki, and found two 
>> interesting topics for me.
>> Improve GMM
>> Global optimization based Hyper-parameter optimization
>> For the GMM topic, I studied DP years ago, and implemented a toy DPGMM using 
>> Gibbs sampling on Matlab. I am also familiar with VB. My question is that 
>> does gsoc projects require students fully understand the theory of DP?
>> 
>> For the hyper-parameter optimization topic, since there are already two 
>> python packages spearmint and Hyperopt, the goal of this topic is to 
>> implement our own modules or to build interfaces for other packages? 
>> 
>> 
>> Thanks,
>> Wei Xue
>> 
>> 
>> 2015-03-05 14:25 GMT-05:00 Andreas Mueller > >:
>> Thanks for volunteering to assist, I updated the wiki accordingly :)
>> 
>> 
>> 
>> On 03/05/2015 01:21 PM, Michael Eickenberg wrote:
>>> I unfortunately cannot lead any gsoc project this year, but can help out 
>>> with code review and mentoring if sb else takes the lead. The two projects 
>>> I can be of use for are CCA/PLS rethinking and additive models.
>>> 
>>> Michael
>>> 
>>> On Thursday, March 5, 2015, Andreas Mueller >> > wrote:
>>> Can all would-be mentors please register on Melange?
>>> The list of possible mentors lists Arnaud, probably a C&P from last year.
>>> Arnaud, are you up for mentoring again? Otherwise I'll remove you from
>>> the list.
>>> 
>>> Then we'd currently have
>>> 
>>> Gaël Varoquaux (not sure if you have time?), Vlad Niculae, Olivier
>>> Grisel,Alexandre Gramfort, Michael Eickenberg
>>> and me.
>>> Any other volunteers?
>>> 
>>> 
>>> 
>>> 
>>> On 02/24/2015 09:48 AM, Andy wrote:
>>> > Hey Everybody.
>>> >
>>> > Here is my somewhat consolidated list of ideas with minor comments.
>>> > If anything is missing, please let me know. Also, I don't think people
>>> > who want to mentor spoke up yet.
>>> > I'll remove all people listed on the wiki as they were copy and pasted
>>> > from last year, and I'd rather have actual confirmation.
>>> >
>>> > Topics:
>>> > DPGMM / VBGMM:  need to be reimplemented using more standard
>>> > variational updates. The GMM is actually fine atm (after a couple of
>>> > pending PRs)
>>> >
>>> > spearmint : Using random forest (they actually use ours) for
>>> > hyperparameter optimization. I need to mull this over but I think this
>>> > should be easy enough and pretty helpful.
>>> >
>>> > Online low-rank matrix completion : this is from last year and I'm not
>>> > sure if it is still desirable / don't know the state of the PR
>>> >
>>> > Multiple metric support : This is somewhat API heavy but I think
>>> >
>>> > PLS/CCA : They need love so very much, but I'm not sure we have a
>>> > mentor (if there is one, please speak up!):q
>>> >
>>> > Ensemble Clusters : Proposed by a possible student (Milton) but I
>>> > think it is interesting.
>>> >
>>> > Semi-Supervised Learning : Meta-estimator for self-taught learning.
>>> > Not sure if there is actually much demand for it, but would be nice.
>>> >
>>> > Additive models: Proposed by ragv, but I'm actually not that sold. We
>>> > could include pyearth, but I'm not sure how valuable the other methods
>>> > are. Including a significant amount of algorithms just for
>>> > completeness is not something I feel great about.
>>> >
>>> >
>>> > That being said, ragv has put in a tremendous amount of great work and
>>> > I feel we should definitely find a project for him (as he seems
>>> > interested).
>>> >
>>> >
>>> > Things that I think shouldn't be GSOC projects:
>>> >
>>> > GPs : Jan Hendrik is doing an awesome job there.
>>> > MLP : Will be finished soon, either by me or possibly by ragv
>>> > data-independent cross-validation : already a bunch of people working
>>> > on that, I don't think we should make it GSOC.
>>> >
>>> > Feedback welcome.
>>> >
>>> > Andy
>>> >
>>> 
>>> 
>>> --
>>> Dive into the World of Parallel Programming The Go Parallel Website, 
>>> sponsored
>>> by Intel and developed in partnership with Slashdot Media, is your hub for 
>>> all
>>> things parall

Re: [Scikit-learn-general] feature names after OneHotEncoder

2015-03-06 Thread Eustache DIEMERT
2015-03-05 16:57 GMT+01:00 Andy :

>  Well, the columns after the OneHotEncoder correspond to feature values,
> not feature names, right?
>

Well, for the categorical ones this is right, except that not all my
features are categorical (hence the categorical_features=...) and they are
intertwined.

So my problem is more to keep track of which categorical features got
projected into which columns (1->N) and which numerical ones have been just
copied and where (1->1).

Re-reading your answer I'm wondering if you suggest to just separate the
input columns by feature types and apply the encoder to the categorical
ones only ?



> There is ``feature_indices_`` which maps each feature to a range of
> features in the encoded matrix.
> The features in the input matrix don't really have names in scikit-learn,
> as they are represented only as numpy matrices.
> So you need to keep track of the indices of each feature. That shouldn't
> be too hard, though.
>
> Why don't you select the features before the encoding? Or do you want to
> exclude some values?
>
>
>
> On 03/05/2015 05:55 AM, Eustache DIEMERT wrote:
>
> Hi list,
>
>  I have a X (np.array) with some columns containing ids. I also have a
> list of column names. Then I want to transform the relevant columns to be
> used by a logistic regression model using OneHotEncoder:
>
>  >>> X = np.loadtxt(...) # from a CSV
> >>> col_names = ... # from CSV header
>  >>> e = OneHotEncoder(categorical_features=id_columns)
> >>> Xprime = e.fit_transform(X)
>
>  But then I don't know how to deduce the names of the columns in the new
> matrix :(
>
>  Ideally I would want the same as DictVectorizer which has a
> feature_names_ member.
>
>  Anyone already had this problem ?
>
>  Eustache
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
>
>
> ___
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> --
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general