Re: [Scikit-learn-general] normalize makes no difference to Lasso

2013-06-01 Thread Alexandre Gramfort
hi, try this: --- from sklearn import datasets, linear_model d = datasets.load_diabetes() print linear_model.Lasso(normalize=True).fit(d['data'], d['target']).coef_ print linear_model.Lasso(normalize=False).fit(2. * d['data'], d['target']).coef_ returns: [ 0. -0.

[Scikit-learn-general] normalize makes no difference to Lasso

2013-06-01 Thread o m
Alexandre, my bad completely, and I apologize for taking up your time. I was mixing up normalize with standardize, which is why none of it made sense. Thanks. Best Regards. -- Get 100% visibility into Java/.NET code

Re: [Scikit-learn-general] GridSearch with sample_weights

2013-06-01 Thread Joel Nothman
I haven't seen any patch for this precisely, though it's a known issue (even if it doesn't seem to be explicitly ticketed; it's closest to https://github.com/scikit-learn/scikit-learn/issues/1179). There are various tricky cases not currently supported for which it's easiest to roll your own

[Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread o m
I've been playing around with Lasso and Lars, but there's something that bothers me about standardization. If I don't standardize to N(0, 1), these procedures indicate that a certain set of variables are the most important. Yet, if I standardize, I get a completely different set of variables. As

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Gilles Louppe
Hi, The main question is, what is your definition of an important variable? Gilles On 1 June 2013 14:22, o m oda...@gmail.com wrote: I've been playing around with Lasso and Lars, but there's something that bothers me about standardization. If I don't standardize to N(0, 1), these procedures

Re: [Scikit-learn-general] GridSearch with sample_weights

2013-06-01 Thread Andreas Mueller
On 06/01/2013 01:03 PM, Joel Nothman wrote: I haven't seen any patch for this precisely, though it's a known issue (even if it doesn't seem to be explicitly ticketed; it's closest to https://github.com/scikit-learn/scikit-learn/issues/1179). There are various tricky cases not currently

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Gael Varoquaux
Hi, Unfortunately, statistics is not magic, and they are many situation in which l1 recovery is not garanteed to work. I cannot give magic answers, and I suggest that you think a lot about how you can validate any findings using external sources. That said, I would suggest, in general, to

[Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread o m
The main question is, what is your definition of an important variable? Gilles That's a good question;-) Seriously. I would define it - with many closely related variables - as a member of a set that gives you the best predictability. LARS and LASSO with cross validation provide a good story

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Andreas Mueller
On 06/01/2013 07:51 PM, o m wrote: The main question is, what is your definition of an important variable? Gilles That's a good question;-) Seriously. I would define it - with many closely related variables - as a member of a set that gives you the best predictability. LARS and LASSO

[Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Christian Jauvin
Hi, I asked a (perhaps too vague?) question about the use of Random Forests with a mix of categorical and lexical features on two ML forums (stats.SE and MetaOp), but since it has received no attention, I figured that it might work better on this list (I'm using sklearn's RF of course): I'm

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Philipp Singer
Hi Christian, Some time ago I had similar problems. I.e., I wanted to use additional features to my lexical features and simple concatanation didn't work that well for me even though both feature sets on their own performed pretty well. You can follow the discussion about my problem here [1]

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Andreas Mueller
On 06/01/2013 08:30 PM, Christian Jauvin wrote: Hi, I asked a (perhaps too vague?) question about the use of Random Forests with a mix of categorical and lexical features on two ML forums (stats.SE and MetaOp), but since it has received no attention, I figured that it might work better on

[Scikit-learn-general] Clustering of Text Documents

2013-06-01 Thread Harold Nguyen
Hi all, I was wondering if anyone can point me to a tutorial on clustering text documents, but then also displaying the results in a graph ? I see some examples on clustering text documents, but I'd like to be able to visualize the clusters. Any help would be appreciated! Thank you, Harold

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread o m
Andy, on reading your tip, and reflecting on what I do, I'm tempted to claim that standardization is very important, regardless ... Assume x0 is very important but has a tiny range (-1/100, 1/100) - all other variables being significantly larger in range. Lars/Lasso will drop x0 until the end,

Re: [Scikit-learn-general] GridSearch with sample_weights

2013-06-01 Thread Joel Nothman
Ahh and I'd forgotten that 1574 included support in grid search. I should perhaps take a look at that. On Sun, Jun 2, 2013 at 1:10 AM, Andreas Mueller amuel...@ais.uni-bonn.dewrote: On 06/01/2013 01:03 PM, Joel Nothman wrote: I haven't seen any patch for this precisely, though it's a known

[Scikit-learn-general] How to present parameter search results

2013-06-01 Thread Joel Nothman
TL;DR: a list of `namedtuple`s is a poor solution for parameter search results; here I suggest better alternatives. I would like to draw some attention to #1787 which proposes that structured arrays be used to return parameter search (e.g. GridSearchCV) results. A few proposals have sought

Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

2013-06-01 Thread Joel Nothman
On Sun, Jun 2, 2013 at 1:35 PM, Mathieu Blondel math...@mblondel.orgwrote: Sorry for the late answer. It's hard for me to keep track of all the design-related discussions lately. No worries. Thanks for the reply! For me, the advantages of the sequences of sequences format are: - they are