Re: [Scikit-learn-general] SO question for the tree growers

2013-04-17 Thread Paul . Czodrowski
Dear Gilles, sorry to jump into that discussion, but it raised my interest.. In the R RandomForest package, MeanDecreaseGini can be calculated. Does scikit-learn somehow scale MeanDecreaseGini to the percentage scale. Please find attached the variable importance as compute by scikit-learn's RF

Re: [Scikit-learn-general] SO question for the tree growers

2013-04-05 Thread Paul . Czodrowski
Dear Gilles, sorry to jump into that discussion, but it raised my interest.. In the R RandomForest package, MeanDecreaseGini can be calculated. Does scikit-learn somehow scale MeanDecreaseGini to the percentage scale. Please find attached the variable importance as compute by scikit-learn's RF

Re: [Scikit-learn-general] domain of appicability - RandomForest, predict_proba function

2013-03-20 Thread Paul . Czodrowski
> > You should use predict_proba with caution if what you want is a level > > of confidence with respect to the true values. If your trees are fully > > developed, then predict_proba is rather a level of agreement between > > the trees, no matter they are right or wrong with the true values. It >

Re: [Scikit-learn-general] generation of a "random" confusion matrix

2013-03-20 Thread Paul . Czodrowski
I need the random matrix to evaluate the predictiveness of a particular model - this time, not in terms of a "domain of applicability" :) Just to clarify the question/put us on the same line: please see the attached PDF for the question I'm having in mind: Cheers & Thanks, Paul > > What do y

Re: [Scikit-learn-general] domain of appicability - RandomForest, predict_proba function

2013-03-20 Thread Paul . Czodrowski
The term "domain of applicability" is frequently used in the field of cheminformatics to judge the reliability of predictions for new (unseen) compounds. => the model should perform not very well for very dissimilar compounds In my simple understanding, the predict_proba function gives me at le

[Scikit-learn-general] domain of appicability - RandomForest, predict_proba function

2013-03-19 Thread Paul . Czodrowski
Dear SciKitLearners, does anyone have experience in using RandomForest's predict_proba function as estimate for the domain of applicability?` The situation is the following: - data set contains 694 samples, each of which is defined by 94 features - data has 2 classes: class0 and class1 - split i

Re: [Scikit-learn-general] generation of a "random" confusion matrix

2013-03-15 Thread Paul . Czodrowski
> Having both margin fixed is an unlikely situation, especially for > confusion matrices. Your case looks like the one margin fixed. Could > you elaborate more on the final goal of your attempt? I would like to investigate more on Cohen's kappa ( http://en.wikipedia.org/wiki/Cohens_kappa): It's no

[Scikit-learn-general] generation of a "random" confusion matrix

2013-03-15 Thread Paul . Czodrowski
Dear ScikitLearners, I hope that I'm not too much off topic... Given a confusion matrix (trained in scikit-learn): [[186 187] [119 997]] I calculate these variables: exp_class0 = conf_matrix[0].sum() exp_class1 = conf_matrix[1].sum() pred_class0 = conf_matrix[:,0].sum() pred_class1 = conf_matri

[Scikit-learn-general] feature_importances in RF models

2013-01-15 Thread Paul . Czodrowski
Dear SciKitters, how does a sklearn RF actually compute the feature importances? Can I assume that the feature contribution (the final leaf of a single model) is averaged over all trained single models and then outputted? How does it compare to the "R way"? Here, I see that the "MeanDecreaseGin

Re: [Scikit-learn-general] PCA: first component too dominant?

2013-01-11 Thread Paul . Czodrowski
> > > > BTW: When doing a RandomizedPCA, the explained variance of the first > > component increase to 78% > > * Turning whiten on or off has more or less no influence on the explained > > variance. > > > > * However, plotting with class labels on => again no clear differentiation > > between the

Re: [Scikit-learn-general] PCA: first component too dominant?

2013-01-11 Thread Paul . Czodrowski
Dear Andy, > When now performing the PCA, the explained variance of first two > components of the PCA are 0.197 and 0.057 > => My interpretation of this result: for my binary classification > problem ("active" and "inactive") of my samples set, the features > make no clear distinction between

Re: [Scikit-learn-general] PCA: first component too dominant?

2013-01-11 Thread Paul . Czodrowski
Dear Andy & the rest, by "StandardScaler" => are you talking about the "Scaler" class of the "preprocessing" module? In my case, I used the "preprocessing.scale" routine: " X = preprocessing.scale(dataDescrs_array) " This should call the same routine, at least that's the way I understood the d

Re: [Scikit-learn-general] PCA: first component too dominant?

2013-01-10 Thread Paul . Czodrowski
Sorry for the confusion, guys. But I did not scale my features - they contain a wild mixture of values: - floats ranging from 0 to 1200 - floats ranging from 0 to 60 - integers between 0 and 25 and so on... My fault! BTW, I tried to re-run the IRIS example ( http://scikit-learn.org/stable/auto

Re: [Scikit-learn-general] PCA: first component too dominant?

2013-01-10 Thread Paul . Czodrowski
> > I fear that I mixed up my syntax... > > Syntax looks good. > > If there is one largely predominant component in the data, you should be > able to see it with your naked eye: all the features should have series > that look similar to a scaling. > > G Do you mean that if you compare the compo

[Scikit-learn-general] PCA: first component too dominant?

2013-01-10 Thread Paul . Czodrowski
Dear SciKitters, when running a PCA on a rather small dataset, I end up in the situation that the first principal component is predominant. My dataset contains 694 samples with 177 features each. Here comes my code " X = dataDescrs_array y = dataActs_array target_names = ['inactive','active'] p

Re: [Scikit-learn-general] cpickle a model

2013-01-04 Thread Paul . Czodrowski
Dear Andreas, thanks for your reply. Strangely enough, I'm getting different results after loading in the model. Below are my code snippets. The data source is of course identical. I will investigate it on that issue further, but maybe someone already sees a bug in my code. Cheers & Thanks,

[Scikit-learn-general] cpickle a model

2013-01-04 Thread Paul . Czodrowski
Dear SciKitters, there is one thing I don't understand when cPickling a model. Here is my code to pickle a model: " clf_kNN = KNeighborsClassifier() clf_kNN = clf_kNN.fit(dataDescrs_array,dataActs_array) cPickle.dump(clf_kNN,file("clf_kNN.descr.pk","wb+")) " And this is the way to ex

Re: [Scikit-learn-general] logging during gridsearch

2012-12-29 Thread Paul . Czodrowski
Thanks a lot, Andy, it did the job! Cheers & Thanks, Paul > Hi Paul. > You didn't set verbosity. > The script you linked to set verbosity=1 but I think there is some more > output on higher verbosity levels. > Hth, > Andy This message and any attachment are confidential and may be privileged

[Scikit-learn-general] logging during gridsearch

2012-12-28 Thread Paul . Czodrowski
Dear SciKitters, inspired by this script: http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html => Very appealing appealing that the remaining computing time is outputted! However, when adapting the script to my purposes, there is no such output in my case: " sc

[Scikit-learn-general] Computing the accuracy value

2012-11-21 Thread Paul . Czodrowski
Dear SciKitters, I want to do BernoulliNaiveBayes prediction. y_test is an array of integer values, the same holds true for y_predict, here comes my code snippet: " clf_NB = BernoulliNB() clf_NB = clf_NB.fit(X_train,y_train) y_predict= clf_NB.predict(X_test) accuracy = clf_NB

Re: [Scikit-learn-general] set target_names / importance of features in a trained model

2012-11-12 Thread Paul . Czodrowski
> 2012/11/12 : > > Actually I would like to name the features. Sorry for the confusion! > > What do you want to use it for? If you have a sample x (a row from X) > and a list of n_features names for your features, then > > zip(names_of_features, x) > > will give you a list of (name, value) pa

Re: [Scikit-learn-general] set target_names / importance of features in a trained model

2012-11-12 Thread Paul . Czodrowski
Dear Andy, > Hi Paul. > I am a bit confused. Do you want to name targets or features? Actually I would like to name the features. Sorry for the confusion! But since you mentioned it: Naming the targets would also be very helpful! > > Also, what does it mean to dump out the leaves of an RF? What

[Scikit-learn-general] set target_names / importance of features in a trained model

2012-11-12 Thread Paul . Czodrowski
Dear SciKitters, given an array of (n_samples,n_features) -> How do I assign target_names in a concluding step? The target_names are stored in a list and, of course, have the same order as the n_features vector. In a next step, I would like to dump out the importance of the most relevant featur

Re: [Scikit-learn-general] RandomForest outputs 4 classes, but only 2 classes are inside

2012-11-09 Thread Paul . Czodrowski
> > However, after splitting into test and train: > > " > > sklearn.cross_validation import train_test_split > > X_train,X_test,y_train,y_test = train_test_split > > (dataDescrs_array,dataActs_array,test_size=.4) > > " > > What does np.unique(y_train) look like? array(['0', '1'], dtype='ht

[Scikit-learn-general] RandomForest outputs 4 classes, but only 2 classes are inside

2012-11-09 Thread Paul . Czodrowski
Dear SciKitters, given a dataset (2200 sample, 90 features), I want to train a RF but run into an interesting issue. My array containing the labels (dataActs_array) only shows 2 classes: " from collections import defaultdict d = defaultdict(int) for elt in dataActs_array: d[elt] += 1 print d

Re: [Scikit-learn-general] RandomForest - optimisation of min_samples_split

2012-11-07 Thread Paul . Czodrowski
Dear Andreas, Dear Gilles, Dear SciKitters, > Hi Paul > Tuning min_samples_split is a good idea but not related to imbalanced > classes. > First, you should specify what you want to optimize. Accuracy is usually > not a good measure for imbalanced classes. Maybe F-score? How would one do that? I j

[Scikit-learn-general] RandomForest - optimisation of min_samples_split

2012-11-07 Thread Paul . Czodrowski
Dear SciKitters, given a dataset of 622 samples and 177 features each, I want to classify those given an experimental classification stating "0" or "1". After splitting up into training and test set, I trained a RandomForest the following way: " from sklearn.ensemble import RandomForestClassifie

Re: [Scikit-learn-general] RF optimisation - class weights etc.

2012-11-06 Thread Paul . Czodrowski
> b) You shouldn't set max_depth=5. Instead, build fully developed trees > (max_depth=None) or rather tune min_samples_split using > cross-validation. Dear Gilles, I have set up a grid search: " tuned_parameters = [{'min_samples_split': [1,2,3,4,5,6,7,8,9]}] scores = [('precision', precision_sc

Re: [Scikit-learn-general] RF optimisation - class weights etc.

2012-11-06 Thread Paul . Czodrowski
Dear Gilles, > Hi Paul, > > a) Scaling has no effect on decision trees. Thanks! > > b) You shouldn't set max_depth=5. Instead, build fully developed trees > (max_depth=None) or rather tune min_samples_split using > cross-validation. Do fully developed trees make sense for rather small datasets?

[Scikit-learn-general] RF optimisation - class weights etc.

2012-11-06 Thread Paul . Czodrowski
ear SciKitters, given a rather unbalanced data set (454 samples with classification "0" and 168 samples with classification "1"), I would like to train a RandomForest. For my data set, I have calculated 177 features per sample. In a first step, I have preprocessed my data set: " dataDescrs_array

Re: [Scikit-learn-general] data preprocessing

2012-11-01 Thread Paul . Czodrowski
Dear RDKitters, > > However, I found it strange that "X_train.shape" gives (373, 177) - > > shouldn't be the second bit be the number of classes, i.e. 2? > > [snip] > > > 177 corresponds, BTW, to the number of features.. > > And that's exactly what this is supposed to represent. The number of

Re: [Scikit-learn-general] data preprocessing

2012-11-01 Thread Paul . Czodrowski
> > given a list of of features - e.g. dataDescrs[0] = (140.0, 2, 0.5 - and a > > list of experimental observations - e.g. data_activities[0] = 0 - how do I > > transform these lists to the scikit-learn nomenclature? > > Depends on what these things represent, but if all tuples in > dataDescrs h

[Scikit-learn-general] data preprocessing

2012-11-01 Thread Paul . Czodrowski
Dear Scikitters, given a list of of features - e.g. dataDescrs[0] = (140.0, 2, 0.5 - and a list of experimental observations - e.g. data_activities[0] = 0 - how do I transform these lists to the scikit-learn nomenclature? Cheers & Thanks, Paul This message and any attachment are confidential

Re: [Scikit-learn-general] n_jobs in GridSearch

2012-10-26 Thread Paul . Czodrowski
> When you grid search C for rbf with non linear kernels (such as RBF) > you should always also grid search for the optimal value of the kernel > parameters (e.g. gamma for RBF kernels). The zero scores you get > probably stem from a default value of gamma that does not work at all > for your data

Re: [Scikit-learn-general] n_jobs in GridSearch

2012-10-26 Thread Paul . Czodrowski
I think that I have to re-phrase my post, since I discovered an awkward behavior using SVC and the linear kernel - exactly THIS kernel takes ages on my dataset. E.g. the "RBF" kernel runs perfect, and so does the GridSearch! :) Exclusion of the "linear" kernel from GridSearch gives now the followi

Re: [Scikit-learn-general] n_jobs in GridSearch

2012-10-26 Thread Paul . Czodrowski
Dear Gael, > > My problem is that the job without parallisation takes ages, whereas on a > > single CPU, it runs on 0.66 seconds. > > You mean 'with parallisation takes ages', right? Yep. > > When you say 'takes ages', do you see it finish? > > What OS are you on? I suspect that you are on window

Re: [Scikit-learn-general] n_jobs in GridSearch

2012-10-26 Thread Paul . Czodrowski
Dear SciKitters, > > > > I was wondering if I properly defined the grid search in the case of a SVM: > > > > " > > # code snippet > > tuned_parameters = [{'kernel': ['linear'],'C': [1,10,100,1000]}] > > scores = [ ('precision', precision_score), ('recall', recall_score),] > > for score_name, score

[Scikit-learn-general] n_jobs in GridSearch

2012-10-26 Thread Paul . Czodrowski
Dear SciKitters, I'm rather new to this wonderful toolkit and starting to use it in the cheminformatics environment. I was wondering if I properly defined the grid search in the case of a SVM: " # code snippet tuned_parameters = [{'kernel': ['linear'],'C': [1,10,100,1000]}] scores = [ ('precisi