Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

2013-06-01 Thread Mathieu Blondel
On Sun, Jun 2, 2013 at 1:44 PM, Joel Nothman wrote: > > From the sounds of things, it would be easier and probably more efficient > to just always convert to dense binarized matrices, unless we have a good > case for requiring sparse handling of labels. In particular, scipy.sparse > does not curre

Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

2013-06-01 Thread Joel Nothman
On Sun, Jun 2, 2013 at 1:35 PM, Mathieu Blondel wrote: > Sorry for the late answer. It's hard for me to keep track of all the > design-related discussions lately. > No worries. Thanks for the reply! > For me, the advantages of the sequences of sequences format are: > - they are quite natural fr

Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

2013-06-01 Thread Mathieu Blondel
Hi Joel, Sorry for the late answer. It's hard for me to keep track of all the design-related discussions lately. For me, the advantages of the sequences of sequences format are: - they are quite natural from a user point of view (although, as you said, an array of sets would be technically better

[Scikit-learn-general] How to present parameter search results

2013-06-01 Thread Joel Nothman
TL;DR: a list of `namedtuple`s is a poor solution for parameter search results; here I suggest better alternatives. I would like to draw some attention to #1787 which proposes that structured arrays be used to return parameter search (e.g. GridSearchCV) results. A few proposals have sought additio

Re: [Scikit-learn-general] GridSearch with sample_weights

2013-06-01 Thread Joel Nothman
Ahh and I'd forgotten that 1574 included support in grid search. I should perhaps take a look at that. On Sun, Jun 2, 2013 at 1:10 AM, Andreas Mueller wrote: > On 06/01/2013 01:03 PM, Joel Nothman wrote: > > I haven't seen any patch for this precisely, though it's a known issue > > (even if it d

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread o m
Andy, on reading your tip, and reflecting on what I do, I'm tempted to claim that standardization is very important, regardless ... Assume x0 is very important but has a tiny range (-1/100, 1/100) - all other variables being significantly larger in range. Lars/Lasso will drop x0 until the end, a

[Scikit-learn-general] Clustering of Text Documents

2013-06-01 Thread Harold Nguyen
Hi all, I was wondering if anyone can point me to a tutorial on clustering text documents, but then also displaying the results in a graph ? I see some examples on clustering text documents, but I'd like to be able to visualize the clusters. Any help would be appreciated! Thank you, Harold

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Andreas Mueller
On 06/01/2013 08:30 PM, Christian Jauvin wrote: > Hi, > > I asked a (perhaps too vague?) question about the use of Random > Forests with a mix of categorical and lexical features on two ML > forums (stats.SE and MetaOp), but since it has received no attention, > I figured that it might work better

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Philipp Singer
Hi Christian, Some time ago I had similar problems. I.e., I wanted to use additional features to my lexical features and simple concatanation didn't work that well for me even though both feature sets on their own performed pretty well. You can follow the discussion about my problem here [1] i

[Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Christian Jauvin
Hi, I asked a (perhaps too vague?) question about the use of Random Forests with a mix of categorical and lexical features on two ML forums (stats.SE and MetaOp), but since it has received no attention, I figured that it might work better on this list (I'm using sklearn's RF of course): "I'm work

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Andreas Mueller
On 06/01/2013 07:51 PM, o m wrote: > > The main question is, what is your definition of an "important" variable? > > > > Gilles > That's a good question;-) Seriously. > > I would define it - with many closely related variables - as a member of a > set that gives you the best predictability. > LARS

[Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread o m
> The main question is, what is your definition of an "important" variable? > > Gilles That's a good question;-) Seriously. I would define it - with many closely related variables - as a member of a set that gives you the best predictability. LARS and LASSO with cross validation provide a good s

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Gael Varoquaux
Hi, Unfortunately, statistics is not magic, and they are many situation in which l1 recovery is not garanteed to work. I cannot give magic answers, and I suggest that you think a lot about how you can validate any findings using external sources. That said, I would suggest, in general, to standar

Re: [Scikit-learn-general] GridSearch with sample_weights

2013-06-01 Thread Andreas Mueller
On 06/01/2013 01:03 PM, Joel Nothman wrote: > I haven't seen any patch for this precisely, though it's a known issue > (even if it doesn't seem to be explicitly ticketed; it's closest to > https://github.com/scikit-learn/scikit-learn/issues/1179). There are > various tricky cases not currently s

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread Gilles Louppe
Hi, The main question is, what is your definition of an "important" variable? Gilles On 1 June 2013 14:22, o m wrote: > I've been playing around with Lasso and Lars, but there's something that > bothers me about standardization. > > If I don't standardize to N(0, 1), these procedures indicate t

[Scikit-learn-general] To standardize is the question ...

2013-06-01 Thread o m
I've been playing around with Lasso and Lars, but there's something that bothers me about standardization. If I don't standardize to N(0, 1), these procedures indicate that a certain set of variables are the most important. Yet, if I standardize, I get a completely different set of variables. As e

Re: [Scikit-learn-general] My talk has been accepted at PyCon AU!

2013-06-01 Thread Robert Layton
Updated, new link at: https://docs.google.com/file/d/0B8FUzd86yYa1SWJXTlkyUF9idlU/edit?usp=sharing Only the updates here have been changed. On 27 May 2013 01:03, Lars Buitinck wrote: > 2013/5/26 Robert Layton > >> I've updated the slides for my talk at pycon AU and put them on my Google >> Dr

Re: [Scikit-learn-general] GridSearch with sample_weights

2013-06-01 Thread Joel Nothman
I haven't seen any patch for this precisely, though it's a known issue (even if it doesn't seem to be explicitly ticketed; it's closest to https://github.com/scikit-learn/scikit-learn/issues/1179). There are various tricky cases not currently supported for which it's easiest to roll your own search

[Scikit-learn-general] normalize makes no difference to Lasso

2013-06-01 Thread o m
Alexandre, my bad completely, and I apologize for taking up your time. I was mixing up normalize with standardize, which is why none of it made sense. Thanks. Best Regards. -- Get 100% visibility into Java/.NET code with

Re: [Scikit-learn-general] normalize makes no difference to Lasso

2013-06-01 Thread Alexandre Gramfort
hi, try this: --- from sklearn import datasets, linear_model d = datasets.load_diabetes() print linear_model.Lasso(normalize=True).fit(d['data'], d['target']).coef_ print linear_model.Lasso(normalize=False).fit(2. * d['data'], d['target']).coef_ returns: [ 0. -0. 36