What I had in mind (for the LB) was an option to reserve an extra
column at the LB creation, which could then be used to map all the
unknown values further encountered by transform. This column would
obviously be all zeros in the matrix returned by fit_transform (i.e.
could only contain 1s in
I think the encoders should all be able to deal with unknown labels.
The thing about the extra single value is that you don't have a column
to map it to.
How would you use the extra value in LabelBinarizer or OneHotEncoder?
You're right, and this points to a difference between what PR #3243
Hi,
I have noticed a change with the LabelBinarizer between version 0.15
and those before.
Prior 0.15, this worked:
lb = LabelBinarizer()
lb.fit_transform(['a', 'b', 'c'])
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
lb.transform(['a', 'd', 'e'])
array([[1, 0, 0],
[0, 0, 0],
could easily add one with
some numpy operations
np.hstack([y, y.sum(axis=1,keepdims=True) == 0])
Best regards,
Arnaud
On 16 Jul 2014, at 19:24, Christian Jauvin cjau...@gmail.com wrote:
Hi,
I have noticed a change with the LabelBinarizer between version 0.15
and those before.
Prior 0.15
If I understand you correctly, one way to reconcile the difference
between the two interpretations (multinomial vs binomial) would be to
binarize first my boolean input variable:
Just for the sake of clarity: I meant to add the complement to my
input variable (i.e. as a second feature), rather
Hi,
Suppose I wanted to test the independence of two boolean variables using
Chi-Square:
X = numpy.vstack(([[0,0]] * 18, [[0,1]] * 7, [[1,0]] * 42, [[1,1]] *
33))
X.shape
(100, 2)
I'd like to understand the difference between doing:
sklearn.feature_selection.chi2(X[:,[0]], X[:,1])
(array([
similarity measure, or dealing with
large quantities of sparse data in a memory efficient way? If it is the
latter, you can look into feature hashing:
http://en.wikipedia.org/wiki/Feature_hashing
regards
shankar.
On Wed, Apr 23, 2014 at 9:59 AM, Christian Jauvin cjau...@gmail.comwrote:
Hi
Hi,
I want to compute the pairwise cosine similarity of items in a vector
space of a very high dimensionality .
My input matrix is very sparse, but the number of nonzero elements per
item follows a very skewed distribution (i.e. power law-ish, with very
few items having lots of features, and
is what I assume because it can be
considered I guess as a form of data leakage), what is the standard
way to solve the issue of test values (for a categorical variable)
that have never been encountered in the training set?
On 9 January 2014 15:21, Christian Jauvin cjau...@gmail.com wrote:
Hi
I believe more in my results than in my expertise - and so should you :-)
+1! There's very very few examples of theory trumping data in history... And
a bajillion of the converse.
I guess I didn't express myself clearly: I didn't mean to say that I
mistrust my results per se.. I'm not that
Many thanks to all for your help and detailed answers, I really appreciate it.
So I wanted to test the discussion's takeaway, namely, what Peter
suggested: one-hot encode the categorical features with small
cardinality, and leave the others in their ordinal form.
So from the same dataset I
Hi Andreas,
Btw, you do encode the categorical variables using one-hot, right?
The sklearn trees don't really support categorical variables.
I'm rather perplexed by this.. I assumed that sklearn's RF only
required its input to be numerical, so I only used a LabelEncoder up
to now.
My
Sklearn does not implement any special treatment for categorical variables.
You can feed any float. The question is if it would work / what it does.
I think I'm confused about a couple of aspects (that's what happens I
guess when you play with algorithms for which you don't have a
complete and
Hi,
I asked a (perhaps too vague?) question about the use of Random
Forests with a mix of categorical and lexical features on two ML
forums (stats.SE and MetaOp), but since it has received no attention,
I figured that it might work better on this list (I'm using sklearn's
RF of course):
I'm
on the
math side). What do you think?
[0] http://jmlr.csail.mit.edu/papers/volume11/baehrens10a/baehrens10a.pdf
On 2 October 2012 14:34, Christian Jauvin cjau...@gmail.com wrote:
* Advice for applying Machine Learning [1] gives general recommendations
on how
to diagnose trained models
Thanks
anymore.
But I'd be curious to know if there are any mechanism I could use to
allow a Random Forest classifier to work with bigger datasets (than
what simply fits in memory)?
Thanks!
On 22 September 2012 16:18, Olivier Grisel olivier.gri...@ensta.org wrote:
2012/9/22 Christian Jauvin cjau
Hi,
I have been doing multiple experiments using a RandomForestClassifier
(trained with the parallel code option) recently, without encountering
any particular problem. However as soon as I began using a much bigger
dataset (with the exact same code), I got this threading error:
Exception in
I have a classifier which derives from RandomForestClassifier, in
order to implement a custom score method. This obviously affects
scoring results obtained with cross-validation, but I observed that it
seems to also affect the actual predictions. In other words, the same
RF classifier with two
Hi Gilles,
Are you sure the RF classifier is the same in both case? (have you set
the random state to the same value?)
You're right, I forgot about that!
I just tested it, and both classifiers indeed produce identical
predictions with the same random_state value.
Thanks,
Christian
Hi Andreas,
You mean that I could use cross_val_score's score_func argument? I
tried it once, and it didn't work for some reason, and so I sticked
with the inheritance solution, which is really a 3 line modification
anyway. Is there another way?
Best,
Christian
On 21 September 2012 15:36,
Hi Andreas,
Yes, the score_func option is intended exactly for this purpose.
The problem I have with it is that my score function is defined in
terms of the probabilistic outcome of the classifier (i.e.
predict_proba) whereas the score_func's caller pass it the predicted
class (i.e. the outcome
May I ask why you think you need this?
It was my naive assumption of how to tackle class imbalance with an
SGD classifier, but as Olivier already suggested, using class_weight
makes more sense for this. Is there another mechanism or strategy that
I should be aware of you think?
Thanks, that's very helpful!
On 12 September 2012 11:47, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
2012/9/12 Peter Prettenhofer peter.prettenho...@gmail.com:
[..]
AFAIK Fabian has some scikit-learn code for that as well.
here is the code https://gist.github.com/2071994
--
(1) When I try to use it with a sparse matrix I get (for a binary problem):
-- 585 proba = np.ones((len(X), 2), dtype=np.float64)
-- 175 raise TypeError(sparse matrix length is ambiguous;
use getnnz()
176 or shape[0])
(2) When I try to use it for a
Hi,
I'm working on a text classification problem, and the strategy I'm
currently studying is based on this example:
http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html
When I replace the data component by my own, I have found that the
memory requirement explodes
25 matches
Mail list logo