Dear Gilles,
sorry to jump into that discussion, but it raised my interest..
In the R RandomForest package, MeanDecreaseGini can be calculated.
Does scikit-learn somehow scale MeanDecreaseGini to the percentage scale.
Please find attached the variable importance as compute by scikit-learn's
RF
Dear Gilles,
sorry to jump into that discussion, but it raised my interest..
In the R RandomForest package, MeanDecreaseGini can be calculated.
Does scikit-learn somehow scale MeanDecreaseGini to the percentage scale.
Please find attached the variable importance as compute by scikit-learn's
RF
> > You should use predict_proba with caution if what you want is a level
> > of confidence with respect to the true values. If your trees are fully
> > developed, then predict_proba is rather a level of agreement between
> > the trees, no matter they are right or wrong with the true values. It
>
I need the random matrix to evaluate the predictiveness of a particular
model - this time, not in terms of a "domain of applicability" :)
Just to clarify the question/put us on the same line: please see the
attached PDF for the question I'm having in mind:
Cheers & Thanks,
Paul
>
> What do y
The term "domain of applicability" is frequently used in the field of
cheminformatics to judge the reliability of predictions for new (unseen)
compounds.
=> the model should perform not very well for very dissimilar compounds
In my simple understanding, the predict_proba function gives me at le
Dear SciKitLearners,
does anyone have experience in using RandomForest's predict_proba function
as estimate for the domain of applicability?`
The situation is the following:
- data set contains 694 samples, each of which is defined by 94 features
- data has 2 classes: class0 and class1
- split i
> Having both margin fixed is an unlikely situation, especially for
> confusion matrices. Your case looks like the one margin fixed. Could
> you elaborate more on the final goal of your attempt?
I would like to investigate more on Cohen's kappa (
http://en.wikipedia.org/wiki/Cohens_kappa):
It's no
Dear ScikitLearners,
I hope that I'm not too much off topic...
Given a confusion matrix (trained in scikit-learn):
[[186 187]
[119 997]]
I calculate these variables:
exp_class0 = conf_matrix[0].sum()
exp_class1 = conf_matrix[1].sum()
pred_class0 = conf_matrix[:,0].sum()
pred_class1 = conf_matri
Dear SciKitters,
how does a sklearn RF actually compute the feature importances? Can I
assume that the feature contribution (the final leaf of a single model) is
averaged over all trained single models and then outputted?
How does it compare to the "R way"?
Here, I see that the "MeanDecreaseGin
> >
> > BTW: When doing a RandomizedPCA, the explained variance of the first
> > component increase to 78%
> > * Turning whiten on or off has more or less no influence on the
explained
> > variance.
> >
> > * However, plotting with class labels on => again no clear
differentiation
> > between the
Dear Andy,
> When now performing the PCA, the explained variance of first two
> components of the PCA are 0.197 and 0.057
> => My interpretation of this result: for my binary classification
> problem ("active" and "inactive") of my samples set, the features
> make no clear distinction between
Dear Andy & the rest,
by "StandardScaler" => are you talking about the "Scaler" class of the
"preprocessing" module?
In my case, I used the "preprocessing.scale" routine:
"
X = preprocessing.scale(dataDescrs_array)
"
This should call the same routine, at least that's the way I understood
the d
Sorry for the confusion, guys.
But I did not scale my features - they contain a wild mixture of values:
- floats ranging from 0 to 1200
- floats ranging from 0 to 60
- integers between 0 and 25
and so on...
My fault!
BTW, I tried to re-run the IRIS example (
http://scikit-learn.org/stable/auto
> > I fear that I mixed up my syntax...
>
> Syntax looks good.
>
> If there is one largely predominant component in the data, you should be
> able to see it with your naked eye: all the features should have series
> that look similar to a scaling.
>
> G
Do you mean that if you compare the compo
Dear SciKitters,
when running a PCA on a rather small dataset, I end up in the situation
that the first principal component is predominant.
My dataset contains 694 samples with 177 features each.
Here comes my code
"
X = dataDescrs_array
y = dataActs_array
target_names = ['inactive','active']
p
Dear Andreas,
thanks for your reply.
Strangely enough, I'm getting different results after loading in the
model.
Below are my code snippets. The data source is of course identical.
I will investigate it on that issue further, but maybe someone already
sees a bug in my code.
Cheers & Thanks,
Dear SciKitters,
there is one thing I don't understand when cPickling a model. Here is my
code to pickle a model:
"
clf_kNN = KNeighborsClassifier()
clf_kNN = clf_kNN.fit(dataDescrs_array,dataActs_array)
cPickle.dump(clf_kNN,file("clf_kNN.descr.pk","wb+"))
"
And this is the way to ex
Thanks a lot, Andy, it did the job!
Cheers & Thanks,
Paul
> Hi Paul.
> You didn't set verbosity.
> The script you linked to set verbosity=1 but I think there is some more
> output on higher verbosity levels.
> Hth,
> Andy
This message and any attachment are confidential and may be privileged
Dear SciKitters,
inspired by this script:
http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html
=> Very appealing appealing that the remaining computing time is
outputted!
However, when adapting the script to my purposes, there is no such output
in my case:
"
sc
Dear SciKitters,
I want to do BernoulliNaiveBayes prediction.
y_test is an array of integer values, the same holds true for y_predict,
here comes my code snippet:
"
clf_NB = BernoulliNB()
clf_NB = clf_NB.fit(X_train,y_train)
y_predict= clf_NB.predict(X_test)
accuracy = clf_NB
> 2012/11/12 :
> > Actually I would like to name the features. Sorry for the confusion!
>
> What do you want to use it for? If you have a sample x (a row from X)
> and a list of n_features names for your features, then
>
> zip(names_of_features, x)
>
> will give you a list of (name, value) pa
Dear Andy,
> Hi Paul.
> I am a bit confused. Do you want to name targets or features?
Actually I would like to name the features. Sorry for the confusion!
But since you mentioned it: Naming the targets would also be very helpful!
>
> Also, what does it mean to dump out the leaves of an RF? What
Dear SciKitters,
given an array of (n_samples,n_features) -> How do I assign target_names in
a concluding step?
The target_names are stored in a list and, of course, have the same order
as the n_features vector.
In a next step, I would like to dump out the importance of the most
relevant featur
> > However, after splitting into test and train:
> > "
> > sklearn.cross_validation import train_test_split
> > X_train,X_test,y_train,y_test = train_test_split
> > (dataDescrs_array,dataActs_array,test_size=.4)
> > "
>
> What does np.unique(y_train) look like?
array(['0', '1'],
dtype='ht
Dear SciKitters,
given a dataset (2200 sample, 90 features), I want to train a RF but run
into an interesting issue.
My array containing the labels (dataActs_array) only shows 2 classes:
"
from collections import defaultdict
d = defaultdict(int)
for elt in dataActs_array:
d[elt] += 1
print d
Dear Andreas,
Dear Gilles,
Dear SciKitters,
> Hi Paul
> Tuning min_samples_split is a good idea but not related to imbalanced
> classes.
> First, you should specify what you want to optimize. Accuracy is usually
> not a good measure for imbalanced classes. Maybe F-score?
How would one do that?
I j
Dear SciKitters,
given a dataset of 622 samples and 177 features each, I want to classify
those given an experimental classification stating "0" or "1".
After splitting up into training and test set, I trained a RandomForest the
following way:
"
from sklearn.ensemble import RandomForestClassifie
> b) You shouldn't set max_depth=5. Instead, build fully developed trees
> (max_depth=None) or rather tune min_samples_split using
> cross-validation.
Dear Gilles,
I have set up a grid search:
"
tuned_parameters = [{'min_samples_split': [1,2,3,4,5,6,7,8,9]}]
scores = [('precision', precision_sc
Dear Gilles,
> Hi Paul,
>
> a) Scaling has no effect on decision trees.
Thanks!
>
> b) You shouldn't set max_depth=5. Instead, build fully developed trees
> (max_depth=None) or rather tune min_samples_split using
> cross-validation.
Do fully developed trees make sense for rather small datasets?
ear SciKitters,
given a rather unbalanced data set (454 samples with classification "0" and
168 samples with classification "1"), I would like to train a RandomForest.
For my data set, I have calculated 177 features per sample.
In a first step, I have preprocessed my data set:
"
dataDescrs_array
Dear RDKitters,
> > However, I found it strange that "X_train.shape" gives (373, 177) -
> > shouldn't be the second bit be the number of classes, i.e. 2?
>
> [snip]
>
> > 177 corresponds, BTW, to the number of features..
>
> And that's exactly what this is supposed to represent. The number of
> > given a list of of features - e.g. dataDescrs[0] = (140.0, 2, 0.5 -
and a
> > list of experimental observations - e.g. data_activities[0] = 0 - how
do I
> > transform these lists to the scikit-learn nomenclature?
>
> Depends on what these things represent, but if all tuples in
> dataDescrs h
Dear Scikitters,
given a list of of features - e.g. dataDescrs[0] = (140.0, 2, 0.5 - and a
list of experimental observations - e.g. data_activities[0] = 0 - how do I
transform these lists to the scikit-learn nomenclature?
Cheers & Thanks,
Paul
This message and any attachment are confidential
> When you grid search C for rbf with non linear kernels (such as RBF)
> you should always also grid search for the optimal value of the kernel
> parameters (e.g. gamma for RBF kernels). The zero scores you get
> probably stem from a default value of gamma that does not work at all
> for your data
I think that I have to re-phrase my post, since I discovered an awkward
behavior using SVC and the linear kernel - exactly THIS kernel takes ages
on my dataset.
E.g. the "RBF" kernel runs perfect, and so does the GridSearch! :)
Exclusion of the "linear" kernel from GridSearch gives now the followi
Dear Gael,
> > My problem is that the job without parallisation takes ages, whereas on
a
> > single CPU, it runs on 0.66 seconds.
>
> You mean 'with parallisation takes ages', right?
Yep.
>
> When you say 'takes ages', do you see it finish?
>
> What OS are you on? I suspect that you are on window
Dear SciKitters,
> >
> > I was wondering if I properly defined the grid search in the case of a
SVM:
> >
> > "
> > # code snippet
> > tuned_parameters = [{'kernel': ['linear'],'C': [1,10,100,1000]}]
> > scores = [ ('precision', precision_score), ('recall', recall_score),]
> > for score_name, score
Dear SciKitters,
I'm rather new to this wonderful toolkit and starting to use it in the
cheminformatics environment.
I was wondering if I properly defined the grid search in the case of a SVM:
"
# code snippet
tuned_parameters = [{'kernel': ['linear'],'C': [1,10,100,1000]}]
scores = [ ('precisi
38 matches
Mail list logo