Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-21 Thread Christian Jauvin
What I had in mind (for the LB) was an option to reserve an extra column at the LB creation, which could then be used to map all the unknown values further encountered by transform. This column would obviously be all zeros in the matrix returned by fit_transform (i.e. could only contain 1s in

Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-17 Thread Christian Jauvin
I think the encoders should all be able to deal with unknown labels. The thing about the extra single value is that you don't have a column to map it to. How would you use the extra value in LabelBinarizer or OneHotEncoder? You're right, and this points to a difference between what PR #3243

[Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Christian Jauvin
Hi, I have noticed a change with the LabelBinarizer between version 0.15 and those before. Prior 0.15, this worked: lb = LabelBinarizer() lb.fit_transform(['a', 'b', 'c']) array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) lb.transform(['a', 'd', 'e']) array([[1, 0, 0], [0, 0, 0],

Re: [Scikit-learn-general] LabelBinarizer change between 0.14 and 0.15

2014-07-16 Thread Christian Jauvin
could easily add one with some numpy operations np.hstack([y, y.sum(axis=1,keepdims=True) == 0]) Best regards, Arnaud On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote: Hi, I have noticed a change with the LabelBinarizer between version 0.15 and those before. Prior 0.15

Re: [Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

2014-06-30 Thread Christian Jauvin
If I understand you correctly, one way to reconcile the difference between the two interpretations (multinomial vs binomial) would be to binarize first my boolean input variable: Just for the sake of clarity: I meant to add the complement to my input variable (i.e. as a second feature), rather

[Scikit-learn-general] Difference between sklearn.feature_selection.chi2 and scipy.stats.chi2_contingency

2014-06-29 Thread Christian Jauvin
Hi, Suppose I wanted to test the independence of two boolean variables using Chi-Square: X = numpy.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] * 33)) X.shape (100, 2) I'd like to understand the difference between doing: sklearn.feature_selection.chi2(X[:,[0]], X[:,1]) (array([

Re: [Scikit-learn-general] Similarity in a vector space model with skewed feature distribution

2014-04-23 Thread Christian Jauvin
similarity measure, or dealing with large quantities of sparse data in a memory efficient way? If it is the latter, you can look into feature hashing: http://en.wikipedia.org/wiki/Feature_hashing regards shankar. On Wed, Apr 23, 2014 at 9:59 AM, Christian Jauvin cjau...@gmail.comwrote: Hi

[Scikit-learn-general] Similarity in a vector space model with skewed feature distribution

2014-04-22 Thread Christian Jauvin
Hi, I want to compute the pairwise cosine similarity of items in a vector space of a very high dimensionality . My input matrix is very sparse, but the number of nonzero elements per item follows a very skewed distribution (i.e. power law-ish, with very few items having lots of features, and

Re: [Scikit-learn-general] LabelEncoder with never seen before values

2014-01-11 Thread Christian Jauvin
is what I assume because it can be considered I guess as a form of data leakage), what is the standard way to solve the issue of test values (for a categorical variable) that have never been encountered in the training set? On 9 January 2014 15:21, Christian Jauvin cjau...@gmail.com wrote: Hi

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Christian Jauvin
I believe more in my results than in my expertise - and so should you :-) +1! There's very very few examples of theory trumping data in history... And a bajillion of the converse. I guess I didn't express myself clearly: I didn't mean to say that I mistrust my results per se.. I'm not that

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Christian Jauvin
Many thanks to all for your help and detailed answers, I really appreciate it. So I wanted to test the discussion's takeaway, namely, what Peter suggested: one-hot encode the categorical features with small cardinality, and leave the others in their ordinal form. So from the same dataset I

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
Hi Andreas, Btw, you do encode the categorical variables using one-hot, right? The sklearn trees don't really support categorical variables. I'm rather perplexed by this.. I assumed that sklearn's RF only required its input to be numerical, so I only used a LabelEncoder up to now. My

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
Sklearn does not implement any special treatment for categorical variables. You can feed any float. The question is if it would work / what it does. I think I'm confused about a couple of aspects (that's what happens I guess when you play with algorithms for which you don't have a complete and

[Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Christian Jauvin
Hi, I asked a (perhaps too vague?) question about the use of Random Forests with a mix of categorical and lexical features on two ML forums (stats.SE and MetaOp), but since it has received no attention, I figured that it might work better on this list (I'm using sklearn's RF of course): I'm

Re: [Scikit-learn-general] reverse feature engineering (or something vague like that)

2012-10-02 Thread Christian Jauvin
on the math side). What do you think? [0] http://jmlr.csail.mit.edu/papers/volume11/baehrens10a/baehrens10a.pdf On 2 October 2012 14:34, Christian Jauvin cjau...@gmail.com wrote: * Advice for applying Machine Learning [1] gives general recommendations on how to diagnose trained models Thanks

Re: [Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-24 Thread Christian Jauvin
anymore. But I'd be curious to know if there are any mechanism I could use to allow a Random Forest classifier to work with bigger datasets (than what simply fits in memory)? Thanks! On 22 September 2012 16:18, Olivier Grisel olivier.gri...@ensta.org wrote: 2012/9/22 Christian Jauvin cjau

[Scikit-learn-general] threading error when training a RFC on a big dataset

2012-09-22 Thread Christian Jauvin
Hi, I have been doing multiple experiments using a RandomForestClassifier (trained with the parallel code option) recently, without encountering any particular problem. However as soon as I began using a much bigger dataset (with the exact same code), I got this threading error: Exception in

[Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Christian Jauvin
I have a classifier which derives from RandomForestClassifier, in order to implement a custom score method. This obviously affects scoring results obtained with cross-validation, but I observed that it seems to also affect the actual predictions. In other words, the same RF classifier with two

Re: [Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Christian Jauvin
Hi Gilles, Are you sure the RF classifier is the same in both case? (have you set the random state to the same value?) You're right, I forgot about that! I just tested it, and both classifiers indeed produce identical predictions with the same random_state value. Thanks, Christian

Re: [Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Christian Jauvin
Hi Andreas, You mean that I could use cross_val_score's score_func argument? I tried it once, and it didn't work for some reason, and so I sticked with the inheritance solution, which is really a 3 line modification anyway. Is there another way? Best, Christian On 21 September 2012 15:36,

Re: [Scikit-learn-general] link between a classifier's score and fit methods

2012-09-21 Thread Christian Jauvin
Hi Andreas, Yes, the score_func option is intended exactly for this purpose. The problem I have with it is that my score function is defined in terms of the probabilistic outcome of the classifier (i.e. predict_proba) whereas the score_func's caller pass it the predicted class (i.e. the outcome

Re: [Scikit-learn-general] computing the sample weights

2012-09-12 Thread Christian Jauvin
May I ask why you think you need this? It was my naive assumption of how to tackle class imbalance with an SGD classifier, but as Olivier already suggested, using class_weight makes more sense for this. Is there another mechanism or strategy that I should be aware of you think?

Re: [Scikit-learn-general] computing the sample weights

2012-09-12 Thread Christian Jauvin
Thanks, that's very helpful! On 12 September 2012 11:47, Peter Prettenhofer peter.prettenho...@gmail.com wrote: 2012/9/12 Peter Prettenhofer peter.prettenho...@gmail.com: [..] AFAIK Fabian has some scikit-learn code for that as well. here is the code https://gist.github.com/2071994 --

[Scikit-learn-general] Two problems with SGDClassifier.predict_proba()

2012-09-12 Thread Christian Jauvin
(1) When I try to use it with a sparse matrix I get (for a binary problem): -- 585 proba = np.ones((len(X), 2), dtype=np.float64) -- 175 raise TypeError(sparse matrix length is ambiguous; use getnnz() 176 or shape[0]) (2) When I try to use it for a

[Scikit-learn-general] Memory explosion with GridSearchCV

2012-09-10 Thread Christian Jauvin
Hi, I'm working on a text classification problem, and the strategy I'm currently studying is based on this example: http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html When I replace the data component by my own, I have found that the memory requirement explodes