Re: [scikit-learn] why the modification in the df-idf formula?

2024-05-28 Thread Sebastian Raschka
implementation). As far as I remember, the sklearn version addressed some instability issues for certain edge cases. I am not sure if that helps, but I have briefly compared the textbook vs the sklearn tf-idf here:  https://github.com/rasbt/machine-learning-book/blob/main/ch08/ch08.ipynb Best, Sebastian

Re: [scikit-learn] New core developer: Tim Head

2023-03-08 Thread Sebastian Raschka
Awesome news! Congrats Tim! Cheers, Sebastian On Mar 8, 2023, 8:35 AM -0600, Ruchika Nayyar , wrote: > Congratulations Tim! Good to see you virtually :) > > Thanks, > Ruchika > > > Dr. Ruchika Nayyar > Data Scientist, Greene Tweed & Co. >

Re: [scikit-learn] [ANNOUNCEMENT] scikit-learn 1.0 release

2021-09-24 Thread Sebastian Raschka
A 1.0 release is huge, and this is really awesome news! Very exciting! Congrats to the scikit-learn team and everyone who helped making this possible! Cheers, Sebastian On Sep 24, 2021, 11:40 AM -0500, Adrin , wrote: > Hi everyone, > > We're happy to announce the 1.0 release

Re: [scikit-learn] Regarding negative value of sklearn.metrics.r2_score and sklearn.metrics.explained_variance_score

2021-08-12 Thread Sebastian Raschka
book probably didn't cover applying a model to an independent data or test set, hence the [0, 1] suggestion. Cheers, Sebastian On Aug 12, 2021, 2:20 PM -0500, Samir K Mahajan , wrote: > > Dear Christophe Pallier,  Reshama Saikh and Tromek Drabas, > > Thank you for your kind respo

Re: [scikit-learn] Can I install Python ML library such as XGBoost without pip?

2021-04-06 Thread Sebastian Gurovich
Could a Virtual Machine be an option for you? Good luck On Tue, 6 Apr 2021, 7:00 pm C W, wrote: > Thanks David. Those discussion boards are indeed very helpful. > > Thanks for providing the lead. > > Best, > > Mike > > On Mon, Apr 5, 2021 at 12:06 PM David Nicholson > wrote: > >> You might fin

Re: [scikit-learn] Presented scikit-learn to the French President

2020-12-05 Thread Sebastian Raschka
Best, Sebastian > On Dec 5, 2020, at 9:28 AM, Jitesh Khandelwal wrote: > > Amazing, inspiring! Kudos to the sklearn team. > > On Sat, Dec 5, 2020, 4:30 AM Gael Varoquaux > wrote: > Hi scikit-learn community, > > Today, I presented some efforts in digital health to the Fr

Re: [scikit-learn] make_classification question

2020-08-12 Thread Sebastian Raschka
will be the informative ones. Best, Sebastian > On Aug 12, 2020, at 8:35 AM, Anna Jenul wrote: > > Hi! > I am generating own datasets with sklearn.datasets.make_classification. > Unfortunately, I cannot figure out which of the generated features are the > informative ones. I

Re: [scikit-learn] The exact formula used to compute the tf-idf

2020-02-01 Thread Sebastian Raschka
cikit-learn.ipynb (I remember that we used it to write portions of the documentation in sklearn later) Best, Sebastian > On Feb 1, 2020, at 12:53 PM, Peng Yu wrote: > > Hi, > > I am trying to understand the exact formula for tf-idf. > > vectorizer = TfidfVectorizer(ngram_r

Re: [scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Sebastian Raschka
Hi Peng, check out https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py Best, Sebastian > On Jan 27, 2020, at 2:30 PM, Peng Yu wrote: > > Hi, > > I don't see what stopwords are used by CountVectorizer with > stop

Re: [scikit-learn] scikit-learn twitter account

2019-11-04 Thread Sebastian Raschka
what they are doing with @PyTorch. That would be super nice. Best. Sebastian > On Nov 4, 2019, at 8:04 AM, Guillaume Lemaître wrote: > > +1 for outreach / -1 for support > > FWIW we have several persons asking us how they could know about future > sprints at the Man AHL s

Re: [scikit-learn] Can we say stochastic gradient descent as an ML model?

2019-10-28 Thread Sebastian Raschka
Hi Bulbul, I would rather say SGD is a method for optimizing the objective function of certain ML models, or optimize the loss function of certain ML models / learn the parameters of certain ML models. Best, Sebastian > On Oct 28, 2019, at 4:00 PM, Bulbul Ahmmed via scikit-learn >

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-06 Thread Sebastian Raschka
igure?). > > > On 10/6/19 10:40 AM, Sebastian Raschka wrote: >> Sure, I just ran an example I made with graphviz via plot_tree, and it looks >> like there's an issue with overlapping boxes if you use class (and/or >> feature) names. I made a reproducible example here so

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-06 Thread Sebastian Raschka
_tree/tree-demo-1.ipynb Happy to add this to the sklearn issue list if there's no issue filed for that yet. Best, Sebastian > On Oct 6, 2019, at 9:10 AM, Andreas Mueller wrote: > > > > On 10/4/19 11:28 PM, Sebastian Raschka wrote: >> The docs show a way such that yo

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
Tue: https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb Best, Sebastian > On Oct 4, 2019, at 10:09 PM, C W wrote: > > On a separate note, what do you use for plotting? > > I found graphviz, but you have to first save it as a pn

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
Yeah, think of it more as a computational workaround for achieving the same thing more efficiently (although it looks inelegant/weird)-- something like that wouldn't be mentioned in textbooks. Best, Sebastian > On Oct 4, 2019, at 6:33 PM, C W wrote: > > Thanks Sebastian, I

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
right child node else left child node Instead, what it does is if x >= 0.5 then right child node else left child node These are basically equivalent as you can see when you just plug in values 0 and 1 for x. Best, Sebastian > On Oct 4, 2019, at 5:34 PM, C W wrote: > > I don&

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
as car_Audi=0 if car_Audi < 0.5 or, it may be treat as car_Audi=1 if car_Audi > 0.5 treat as car_Audi=0 if car_Audi <= 0.5 (Forgot which one sklearn is using, but either way. it will be fine.) Best, Sebastian > On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > > >>

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
nal variable, so you have to do the onehot encoding before you give the data to the decision tree. Best, Sebastian > On Oct 4, 2019, at 11:48 AM, C W wrote: > > I'm getting some funny results. I am doing a regression decision tree, the > response variables are assigned to le

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-09-13 Thread Sebastian Raschka
ation does not support categorical variables for > > now". we discussed via the previous email was referring to feature variables. Whether you choose the DT regressor or classifier depends on the format of your target variable. Best, Sebastian > On Sep 13, 2019, at 11:41 PM, C W

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-09-13 Thread Sebastian Raschka
, you will end up with a large number of binary variables, and they may dominate in the resulting tree over other feature variables). In any case, I guess this is what > "scikit-learn implementation does not support categorical variables for now". means ;). Best, Sebastian >

Re: [scikit-learn] No convergence warning in logistic regression

2019-08-30 Thread Sebastian Raschka
;lbfgs')? Best, Sebastian > On Aug 30, 2019, at 9:52 AM, Benoît Presles > wrote: > > Dear all, > > I compared the logistic regression of statsmodels (Logit) with the logistic > regression of sklearn (LogisticRegression). As I do not do regularization, I > use the

Re: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML

2019-04-10 Thread Sebastian Raschka
Hm, weird that their platform seems to be so picky about it. Have you tried to just make the output of the pipeline dense? I.e., (model.predict(X)).toarray() Best, Sebastian > On Apr 10, 2019, at 1:10 PM, Liam Geron wrote: > > Hi Sebastian, > > Thanks for the advice! The

Re: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML

2019-04-10 Thread Sebastian Raschka
;tfidf', TfidfVectorizer()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) Best, Sebastian > On Apr 10, 2019, at 12:25 PM, Liam Geron wrote: > > Hi all, > > I was hoping to get some guidance re: changing the result of th

Re: [scikit-learn] GridsearchCV returns worse scoring the broader parameter space gets

2019-03-31 Thread Sebastian Raschka
, it looks like you are computing the performance manually: > simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr) on the whole training set. Instead, I would take a look at the simple_tree.best_score_ attribute after fitting. If you do Best, Sebastian > On Mar 31, 2019, at 5:15 AM, Andreas Tos

Re: [scikit-learn] What theory cause SGDRegressor can partial_fit but RandomForestRegressor can't?

2019-03-13 Thread Sebastian Raschka
7;s less natural and not a common thing to do, which is why it's probably not implemented in scikit-learn. Best, Sebastian > On Mar 13, 2019, at 10:45 PM, lampahome wrote: > > As title, I'm confused why some algo can partial_fit and some algo can't. > >

Re: [scikit-learn] AUCROC/MAP confidence intervals in scikit

2019-02-07 Thread Sebastian Raschka
Still haven't had a chance to read it, but ROC for binary classification anyway? Also, i.i.d. assumptions are typical for the learning algorithms as well. Best, Sebastian > On Feb 7, 2019, at 10:15 AM, josef.p...@gmail.com wrote: > > Just a skeptical comment from a bystande

Re: [scikit-learn] AUCROC/MAP confidence intervals in scikit

2019-02-06 Thread Sebastian Raschka
u have. In large datasets, binomial approximation intervals may be sufficient and bootstrapping too expensive etc. Thanks for sharing that paper btw, will have a look. Best, Sebastian > On Feb 6, 2019, at 11:28 AM, Stuart Reynolds > wrote: > > https://papers.nips.cc/paper/2645-co

Re: [scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread Sebastian Raschka
ier's decision rule is fixed. I think the following could work if the estimators_ support partial_fit: voter = VotingClassifier(...) voter.fit(...) For further training: for i in len(estimators_): voter.estimators_[i].partial_fit(...) Best, Sebastian > On Feb 1, 2019, at

Re: [scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread Sebastian Raschka
Hi there, if you call the "fit" method, the learning will essentially start from scratch. So no, it doesn't consider previous training results. However, certain algorithms are implemented with an additional partial_fit method that would consider previous training rounds. Best,

Re: [scikit-learn] LogisticRegression coef_ greater than n_features?

2019-01-08 Thread Sebastian Raschka
t = ohe.fit_transform(x) xt.todense() matrix([[1., 0., 1., 0., 0.], [0., 1., 0., 1., 0.], [1., 0., 0., 0., 1.]]) Best, Sebastian > On Jan 8, 2019, at 9:33 AM, pisymbol wrote: > > Also Sebastian, I have binary classes but they are strings: > > clf.classes_: &g

Re: [scikit-learn] LogisticRegression coef_ greater than n_features?

2019-01-07 Thread Sebastian Raschka
E.g, if you have a feature with values 'a' , 'b', 'c', then applying the one hot encoder will transform this into 3 features. Best, Sebastian > On Jan 7, 2019, at 11:02 PM, pisymbol wrote: > > > > On Mon, Jan 7, 2019 at 11:50 PM pisymbol wro

Re: [scikit-learn] LogisticRegression coef_ greater than n_features?

2019-01-07 Thread Sebastian Raschka
Maybe check a) if the actual labels of the training examples don't start at 0 b) if you have gaps, e.g,. if your unique training labels are 0, 1, 4, ..., 23 Best, Sebastian > On Jan 7, 2019, at 10:50 PM, pisymbol wrote: > > According to the doc (0.20.2) the coef_ variables are

Re: [scikit-learn] How GridSearchCV to get best_params?

2019-01-03 Thread Sebastian Raschka
I think it refers to the test folds via the k-fold cross-validation that is internally used via the `cv` parameter of GridSearchCV (or the test folds of an alternative cross validation scheme that you may pass as an iterator to cv) Best, Sebastian > On Jan 3, 2019, at 9:44 PM, lampahome wr

Re: [scikit-learn] Any way to tune the parameters better than GridSearchCV?

2018-12-24 Thread Sebastian Raschka
more trees and see if you notice any significant different in the cross-validation performance. Next, I would use the model and fit it to the whole training set with those best hyperparameters and evaluate the performance on the independent test set. Best, Sebastian > On Dec 24, 2018, at

Re: [scikit-learn] time complexity of tree-based model?

2018-12-20 Thread Sebastian Raschka
tiply the number of decision trees in the forest Best, Sebastian > On Dec 20, 2018, at 1:09 AM, lampahome wrote: > > I do some benchmark in my experiments and I almost use ensemble-based > regressor. > > What is the time complexity if I use random forest regressor? Assume

Re: [scikit-learn] plan to add the association rule classification algorithm in scikit learn

2018-12-16 Thread Sebastian Raschka
alternative algorithm for frequent itemset generation in mind (I am not sure if others exist, to be honest). I would also be happy about that one, too. Best, Sebastian > On Dec 17, 2018, at 12:26 AM, Joel Nothman wrote: > > Hi Rui, > > This has been discussed several times on t

Re: [scikit-learn] make all new parameters keyword-only?

2018-11-15 Thread Sebastian Raschka
cross different package versions) despite (or maybe because) being more verbose. Best, Sebastian > On Nov 15, 2018, at 10:18 PM, Brown J.B. via scikit-learn > wrote: > > As an end-user, I would strongly support the idea of future enforcement of > keyword arguments for new param

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Sebastian Raschka
That's nice to know, thanks a lot for the reference! Best, Sebastian > On Oct 28, 2018, at 3:34 AM, Guillaume Lemaître > wrote: > > FYI: https://github.com/scikit-learn/scikit-learn/pull/12364 > > On Sun, 28 Oct 2018 at 09:32, Guillaume Lemaître > wrote: > The

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-27 Thread Sebastian Raschka
ge the implementation such that the shuffling only occurs if max_features < n_feature, because this way we could have deterministic behavior for the trees by default, which I'd find more intuitive for plain decision trees tbh. Let me know what you all think. Best, Sebastian > On

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-27 Thread Sebastian Raschka
about the random feature selection if max_features is not n_features, that there is generally some sorting of the features going on, and the different trees are then due to tie-breaking if two features have the same information gain? Best, Sebastian > On Oct 27, 2018, at 6:16 PM, Javier Lópe

[scikit-learn] How does the random state influence the decision tree splits?

2018-10-27 Thread Sebastian Raschka
e random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. Best, Sebastian ___ scikit-learn mailing li

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-03 Thread Sebastian Raschka
sth like that. Best, Sebastian > On Oct 3, 2018, at 5:49 AM, Javier López wrote: > > > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux > wrote: > The reason that pickles are brittle and that sharing pickles is a bad > practice is that pickle use an implicitly defined data model

Re: [scikit-learn] Splitting Method on RandomForestClassifier

2018-10-02 Thread Sebastian Raschka
llowable depth is reached" So but this is basically not the whole definition, right? There should be condition that if the weighted average of the child node impurities for any given feature is not smaller than the parent node impurity, the tree growing algorithm would terminate, right?

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Sebastian Raschka
mltools Didn't know about that. This is really nice! What do you think about referring to it under http://scikit-learn.org/stable/modules/model_persistence.html to make people aware that this option exists? Would be happy to add a PR. Best, Sebastian > On Sep 28, 2018, at 9:30 AM, Olivi

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-27 Thread Sebastian Raschka
Congrats everyone, this is awesome!!! I just started teaching an ML course this semester and introduced scikit-learn this week -- it was a great timing to demonstrate how well maintained the library is and praise all the efforts that go into it :). > I think model serialization should be a pri

Re: [scikit-learn] Contribute to Scikit-learn

2018-09-03 Thread Sebastian Raschka
ppreciate feedback regarding the current implementation. Best, Sebastian > On Sep 3, 2018, at 7:50 AM, Guillaume Lemaître wrote: > > I would add that Sequential Forward Selection is on the way to be > ported by Sebastian (@rabst) > to scikit-learn: > > https://github.co

Re: [scikit-learn] ANN Scikit-learn 0.20rc1 release candidate available

2018-08-31 Thread Sebastian Raschka
you prioritized the maintenance and improvement of scikit-learn as a fundamental ML library, rather than adding useful yet "niche" features. Cheers, Sebastian > On Aug 31, 2018, at 8:26 PM, Andreas Mueller wrote: > > Hey Folks! > > I'm happy to announce that the scikit-

Re: [scikit-learn] Unable to connect HDInsight hive to python

2018-08-12 Thread Sebastian Raschka
Hi Debu, since Azure HDInsights is a commercial service, their customer support should handle questions like this > On Aug 12, 2018, at 7:16 AM, Debabrata Ghosh wrote: > > Hi All, >Greetings ! Wish you are doing good ! I am just > reaching out to you in case if you hav

Re: [scikit-learn] Using GPU in scikit learn

2018-08-08 Thread Sebastian Raschka
7;s a good thing or a bad thing -- whether it's stable enough that it didn't need any updates). Anyway, maybe worth a try: https://github.com/EasonLiao/CudaTree Otherwise, I can imagine there are probably alternative implementations out there? Best, Sebastian > On Aug 8, 2

Re: [scikit-learn] Help with Pull Request( Checks failing)

2018-07-24 Thread Sebastian Raschka
I am not a core dev, but I think I can see what's wrong there (mostly Flake8 issues). Let me comment about that over there. > On Jul 24, 2018, at 7:34 PM, Prathusha Jonnagaddla Subramanyam Naidu > wrote: > > This is the link to the PR - > https://github.com/scikit-learn/scikit-learn/pull/1167

Re: [scikit-learn] RFE with logistic regression

2018-07-24 Thread Sebastian Raschka
I addition to checking _n_iter and fixing the random seed as I suggested maybe also try normalizing the features (eg z scores via the standard scale we) to see if that stabilizes the training Sent from my iPhone > On Jul 24, 2018, at 1:07 PM, Benoît Presles > wrote: > > I did the same tests

Re: [scikit-learn] RFE with logistic regression

2018-07-24 Thread Sebastian Raschka
sure that .n_iter_ < .max_iter to see if that results in more consistency. Best, Sebastian > On Jul 24, 2018, at 11:16 AM, Stuart Reynolds > wrote: > > liblinear regularizes the intercept (which is a questionable thing to > do and a poor choice of default in sklearn). >

Re: [scikit-learn] New core dev: Joris Van den Bossche

2018-06-23 Thread Sebastian Raschka
gards, Sebastian > On Jun 23, 2018, at 6:42 AM, Olivier Grisel wrote: > > Hi everyone! > > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a > scikit-learn core developer! > > Joris is one of the maintainers of the pandas project and recently &

Re: [scikit-learn] Jeff Levesque: association rules

2018-06-11 Thread Sebastian Raschka
es/ Best, Sebastian > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn > wrote: > > Hi guys, > What are some good approaches for association rules. Is there something built > in, or do people sometimes use alternate packages, maybe apache spark? > > Than

Re: [scikit-learn] Supervised prediction of multiple scores for a document

2018-06-03 Thread Sebastian Raschka
r a specified number of topics (e.g., 10, but depends on your dataset, I would experiment a bit here), look at the top words in each topic and then assign a topic label to each topic. Then, for a given article, you can assign e.g., the top X labeled topics. Best, Sebastian > On Jun

Re: [scikit-learn] Supervised prediction of multiple scores for a document

2018-06-03 Thread Sebastian Raschka
sorry, I had a copy & paste error, I meant "LogisticRegression(..., multi_class='multinomial')" and not "LogisticRegression(..., multi_class='ovr')" > On Jun 3, 2018, at 5:19 PM, Sebastian Raschka > wrote: > > Hi, > >> I

Re: [scikit-learn] DBScan freezes my computer !!!

2018-05-13 Thread Sebastian Raschka
pre-compute the distances and give that to the .fit() method after initializing the DBSCAN object with metric='precomputed') Best, Sebastian > On May 13, 2018, at 7:23 PM, Mauricio Reis wrote: > > I think the problem is due to the size of my database, which has 44,000 > re

Re: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance

2018-05-04 Thread Sebastian Raschka
an independent validation set though, because it's a general function that should not be restricted to random forests. If you have such an independent dataset, it should give more accurate results than using OOB samples. Best, Sebastian > On May 4, 2018, at 7:10 PM, Niyaghi, Faraz wro

Re: [scikit-learn] Retracting model from the 'blackbox' SVM

2018-05-04 Thread Sebastian Raschka
b/master/sklearn/svm/base.py And more info on the LIBLINEAR library it is using can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical reports and implementation details there) Best, Sebastian > On May 4, 2018, at 5:12 AM, Wouter Verduin wrote: >

Re: [scikit-learn] MLPClassifier - Softmax activation function

2018-04-18 Thread Sebastian Raschka
ax is, regardless of "activation," automatically used in the output layer. Best, Sebastian > On Apr 18, 2018, at 3:15 PM, Daniel Baláček wrote: > > Hello everyone > > I have a question regarding MLPClassifier in sklearn. In the documentation in > section 1

Re: [scikit-learn] Using KMeans cluster labels in KNN

2018-03-12 Thread Sebastian Raschka
Hi, If you want to predict the Kmeans cluster membership, you can use Kmeans' predict method instead of training a KNN model on the cluster assignments. This will be computationally more efficient and give you the correct assignment at the borders between clusters. Best, Sebastian > O

Re: [scikit-learn] Need help in dealing with large dataset

2018-03-05 Thread Sebastian Raschka
N implementation you use. I have some examples here if that helps: - https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-celeba.ipynb - https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-csv.ip

Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
Unfortunately (or maybe fortunately :)) no, maximizing variance reduction & minimizing MSE are just special cases :) Best, Sebastian > On Mar 1, 2018, at 9:59 AM, Thomas Evangelidis wrote: > > Does this generalize to any loss function? For example I also want to > impleme

Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
Hi, Thomas, as far as I know, it's all the same and doesn't matter, and you would get the same splits, since R^2 is just a rescaled MSE. Best, Sebastian > On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis wrote: > > Hi Sebastian, > > Going back to Pearson's R

Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
Hi, Thomas, in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction Best, Sebastian > On

Re: [scikit-learn] KMeans cluster

2018-02-20 Thread Sebastian Raschka
lpful (https://bl.ocks.org/rpgove/raw/0060ff3b656618e9136b/9aee23cc799d154520572b30443284525dbfcac5/) Maybe also take a look at the silhouette metric for choosing K: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html Best, Sebastian > On Feb 20, 2018, at

Re: [scikit-learn] Applying clustering to cosine distance matrix

2018-02-12 Thread Sebastian Raschka
X is your "[num_examples, num_features]" array. Best, Sebastian > On Feb 12, 2018, at 1:10 PM, prince gosavi wrote: > > I have generated a cosine distance matrix and would like to apply clustering > algorithm to the given matrix. > np.shape(distance_matrix)==(14000,14000

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Sebastian Raschka
Good point Joel, and I actually forgot that you can set the norm param in the TfidfVectorizer, so one could basically do vect = TfidfVectorizer(use_idf=False, norm='l1') to have the CountVectorizer behavior but normalizing by the document length. Best, Sebastian > On Jan 28, 201

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Sebastian Raschka
top_words='english') > vect.fit(dataset) > transf = vect.transform(dataset) > transf / counts Best, Sebastian > On Jan 27, 2018, at 11:31 PM, Yacine MAZARI wrote: > > Hi Jake, > > Thanks for the quick reply. > > What I meant is different from the TfIdfVe

Re: [scikit-learn] a dataset suitable for logistic regression

2017-12-03 Thread Sebastian Raschka
As far as I know, no. But you could simply truncate the iris dataset for binary classification, e.g., from sklearn import datasets iris = datasets.load_iris() X = iris.data[:100] y = iris.target[:100] Best, Sebastian > On Dec 3, 2017, at 3:54 PM, Peng Yu wrote: > > Hi, iris i

Re: [scikit-learn] How to get centroids from SciPy's hierarchical agglomerative clustering?

2017-10-20 Thread Sebastian Raschka
mples from a cluster (for each feature). Best. Sebastian > On Oct 20, 2017, at 9:13 AM, Sema Atasever wrote: > > Dear scikit-learn members, > > I am using SciPy's hierarchical agglomerative clustering methods to cluster a > 1000 x 22 matri

Re: [scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Sebastian Raschka
Oh, never mind my previous email, because while the components should be the same, the projection of the data points onto those components would still be affected by centering vs non-centering I guess. Best, Sebastian > On Oct 16, 2017, at 3:25 PM, Sebastian Raschka wrote: > > Hi

Re: [scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Sebastian Raschka
ector of feature means So, if you center the data prior to computing the covariance matrix, \bar{x} is simply 0. Best, Sebastian > On Oct 16, 2017, at 2:27 PM, Ismael Lemhadri wrote: > > @Andreas Muller: > My references do not assume centering, e.g. > http://ufldl.stanford.ed

Re: [scikit-learn] Combine already fitted models

2017-10-07 Thread Sebastian Raschka
I agree. I had added sth like that to the original version in mlxtend (not sure if it was before or after we ported it to sklearn). In at case though, it be happy to open a PR about that later today :) Best, Sebastian > On Oct 7, 2017, at 10:53 AM, Andreas Mueller wrote: > > For so

Re: [scikit-learn] Combine already fitted models

2017-10-01 Thread Sebastian Raschka
VotingClassifier was fit, so your proposed method could/should work as a workaround ;) Best, Sebastian > On Oct 1, 2017, at 7:22 PM, Rares Vernica wrote: > > > > I am looking at VotingClassifier but it seems that it is expected that > > > the estimators are fitted when Vo

Re: [scikit-learn] Combine already fitted models

2017-10-01 Thread Sebastian Raschka
m happy to add an issue or submit a PR to discuss/work on this further :) Best, Sebastian > On Oct 1, 2017, at 6:53 PM, Rares Vernica wrote: > > Hello, > > I have a distributed setup where subsets of the data is available at > different hosts. I plan to have each host fit a

Re: [scikit-learn] Commercial use of ML algorithms and scikit-learn

2017-09-30 Thread Sebastian Raschka
ribute any parts of sklearn. However, I'd still suggest to consult someone in your legal department regarding the license to make sure that you don't run into any troubles later on. Best, Sebastian > On Oct 1, 2017, at 12:58 AM, Paul Smith wrote: > > Dear Scikit-learn users

Re: [scikit-learn] anti-correlated predictions by SVR

2017-09-26 Thread Sebastian Raschka
r testing) Best, Sebastian > On Sep 26, 2017, at 12:48 PM, Thomas Evangelidis wrote: > > I have very small training sets (10-50 observations). Currently, I am working > with 16 observations for training and 25 for validation (external test set). > And I am doing Regression, not Clas

Re: [scikit-learn] batch_size for small training sets

2017-09-24 Thread Sebastian Raschka
gradient descent (I.e batch size = n training samples). Best, Sebastian Sent from my iPhone > On Sep 24, 2017, at 4:35 PM, Thomas Evangelidis wrote: > > Greetings, > > I traing MLPRegressors using small datasets, usually with 10-50 observations. > The default batch_size=min(2

Re: [scikit-learn] Help needed

2017-09-14 Thread Sebastian Raschka
Honestly not sure what the core dev's preference is, but maybe just submit it as a PR and take the discussion (for a potential removal, inclusion, or move of these features to the documentation) of the additional plotting features from there. Best, Sebastian > On Sep 14, 2017, at 9:

Re: [scikit-learn] Help needed

2017-09-14 Thread Sebastian Raschka
ly removing matplotlib imports will prob. solve the issue; otherwise, I guess discussing the PR via an issue with the main devs might be the way to go. Best, Sebastian > On Sep 14, 2017, at 9:24 PM, L Ali wrote: > > Hi guys, > > I am totally new to the scikit-learn,

Re: [scikit-learn] custom loss function

2017-09-13 Thread Sebastian Raschka
do in NumPy, the mean_squared_error above can be manually defined as e.g., cost = tf.reduce_sum(tf.pow(pred-y 2))/(2*n_samples) Best, Sebastian > On Sep 13, 2017, at 1:18 PM, Thomas Evangelidis wrote: > > ​​ > Thanks again for the clarifications Sebastian! > > Kera

Re: [scikit-learn] custom loss function

2017-09-13 Thread Sebastian Raschka
ures? Both x and x' should be denoting training examples. The kernel matrix is symmetric (N x N). Best, Sebastian > On Sep 13, 2017, at 5:25 AM, Thomas Evangelidis wrote: > > Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list, but > now it's in

Re: [scikit-learn] custom loss function

2017-09-11 Thread Sebastian Raschka
usly, you can pick up any of the two in about an hour and have your MLPRegressor up and running so that you can then experiment with your cost function). Best, Sebastian > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis wrote: > > Greetings, > > I know this is a recurrent ques

Re: [scikit-learn] control value range of MLPRegressor predictions

2017-09-10 Thread Sebastian Raschka
, but I am not sure the MLPRegressor allows that. In that case, you probably want to implement the MLP regressor yourself (e.g., via TensorFlow or PyTorch) to have some room for experimentation with your output units. Best, Sebastian > On Sep 10, 2017, at 4:43 PM, Thomas Evangelidis wr

Re: [scikit-learn] control value range of MLPRegressor predictions

2017-09-10 Thread Sebastian Raschka
820 and -800 sounds a bit extreme if your training data is in a -5 to -9 range. Is your training data from a different population then the one you use for testing/making predictions? Or maybe it's just an extreme case of overfitting. Best, Sebastian > On Sep 10, 2017, at 3:13 PM, Thomas

Re: [scikit-learn] combining datasets from different sources

2017-09-05 Thread Sebastian Raschka
in of salt anyway) Best, Sebastian > On Sep 5, 2017, at 1:02 PM, Jason Rudy wrote: > > Thomas, > > This is sort of related to the problem I did my M.S. thesis on years ago: > cross-platform normalization of gene expression data. If you google that > term you'll

Re: [scikit-learn] Problem found when testing DecisionTreeClassifier within the source folder

2017-09-04 Thread Sebastian Raschka
recommend/prefer. Anyway, to use venv that should be available in Python already, you could do e.g., python -m venv my-sklearn-dev source my-sklearn-dev/bin/activate Best, Sebastian > On Sep 4, 2017, at 11:21 PM, Joel Nothman wrote: > > I suspect this is due to an intricacy of Cy

Re: [scikit-learn] Random Forest Regressor criterion

2017-08-30 Thread Sebastian Raschka
hesis should be accessible from https://arxiv.org/abs/1407.7502 though, and I would recommend taking a look at "3.6.3 Finding the best binary split" and page 108+ on how it's implemented (if this is still up to date with the current implementation!?). This would probably address all your

Re: [scikit-learn] imbalanced-learn 0.3.0 is chasing scikit-learn 0.19.0

2017-08-24 Thread Sebastian Raschka
Just read through the summary of the new features and browsed through the user guide. The guide is really well structured and easy to navigate, thanks for putting all the work into it. Overall, thanks for this great contribution and new version :) Best, Sebastian > On Aug 24, 2017, at 8:14

Re: [scikit-learn] scikit-learn 0.19.0 is out!

2017-08-11 Thread Sebastian Raschka
Yay, as an avid user, thanks to all the developers! This is a great release indeed -- no breaking changes (at least for my code base) and so many improvements and additions (that I need to check out in detail) :) > On Aug 12, 2017, at 1:14 AM, Gael Varoquaux > wrote: > > Hurray, thank you ev

Re: [scikit-learn] transform categorical data to numerical representation

2017-08-06 Thread Sebastian Raschka
lues that could occur, do the transformation, and then only pass the 1 transformed sample to the classifier. I guess that could be even slow though ... Best, Sebastian > On Aug 6, 2017, at 6:30 AM, Georg Heiler wrote: > > @sebastian: thanks. Indeed, I am aware of this problem. &

Re: [scikit-learn] transform categorical data to numerical representation

2017-08-05 Thread Sebastian Raschka
le) and it would just assign arbitrary integers in increasing order. Thus, if you are dealing ordinal variables, there's no way around doing this manually; for example you could create mapping dictionaries for that (most conveniently done in pandas). Best, Sebastian > On Aug 5, 2017, at

Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
x27;t gotten traction. > Overshadowed by GBM & random forests? > > > On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka > wrote: >> Just to throw some additional ideas in here. Based on a conversation with a >> colleague some time ago, I think learning c

Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
ifference imho. I.e., treating ordinal variables like continuous variable probably makes more sense than one-hot encoding them. Looking forward to the PR :) > On Jul 21, 2017, at 2:52 PM, Sebastian Raschka wrote: > > Just to throw some additional ideas in here. Based on a conversation w

Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
ainst SVMs, random forests and the like for categorical (genomics data). Looked promising. Best, Sebastian > On Jul 21, 2017, at 2:37 PM, Raga Markely wrote: > > Thank you, Jacob. Appreciate it. > > Regarding 'perform better', I was referring to better accuracy, preci

Re: [scikit-learn] Max f1 score for soft classifier?

2017-07-17 Thread Sebastian Raschka
publication though, where the authors modified the F1 score so that it's differentiable and can be used as a cost function for optimization/training: Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection: http://ieeexplore.ieee.org/stamp/stamp.jsp?a

[scikit-learn] Inquiry third-party package affiliation

2017-07-14 Thread Sebastian
four thousand times a month after launch. All the best, Sebastian Flennerhag ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Replacing the Boston Housing Prices dataset

2017-07-06 Thread Sebastian Raschka
I am sure that the scikit-learn maintainers wouldn't have anything against it if someone would update the examples/tutorials with the use of different datasets Best, Sebastian > On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias wrote: > > For what it's worth: I'm sympath

Re: [scikit-learn] [Feature] drop_one in one hot encoder

2017-06-25 Thread Sebastian Raschka
from dropping a column, though (e.g., linear regression as a simple example). For instance, pandas' get_dummies has a "drop_first" parameter ... I think it would make sense to have such a parameter in the onehotencoder as well, e.g., for working with pipelines. Best, Sebastian

  1   2   3   >