Re: [scikit-learn] Retracting model from the 'blackbox' SVM (Sebastian Raschka)
Hi Sebastian, If you are looking to reduce the feature space for your model, I suggest you look at the scikit-learn page on doing just that http://scikit-learn.org/stable/modules/feature_selection.html David On 2018-05-04 12:00 PM, scikit-learn-requ...@python.org wrote: Send scikit-learn mailing list submissions to scikit-learn@python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-requ...@python.org You can reach the person managing the list at scikit-learn-ow...@python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Re: Retracting model from the 'blackbox' SVM (Sebastian Raschka) -- Message: 1 Date: Fri, 4 May 2018 05:51:26 -0400 From: Sebastian RaschkaTo: Scikit-learn mailing list Subject: Re: [scikit-learn] Retracting model from the 'blackbox' SVM Message-ID: <5331a676-d6c6-4f01-8a4d-edde9318e...@sebastianraschka.com> Content-Type: text/plain; charset=us-ascii Dear Wouter, for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the linear kernel, you could use the more efficient LinearSVC scikit-learn class to get similar results. I guess this in turn is easier to handle in terms of Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function. More specifically, LinearSVC uses the _fit_liblinear code available here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py And more info on the LIBLINEAR library it is using can be found here: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (they have links to technical reports and implementation details there) Best, Sebastian On May 4, 2018, at 5:12 AM, Wouter Verduin wrote: Dear developers of Scikit, I am working on a scientific paper on a predictionmodel predicting complications in major abdominal resections. I have been using scikit to create that model and got good results (score of 0.94). This makes us want to see what the model is like that is made by scikit. As for now we got 100 input variables but logically these arent all as usefull as the others and we want to reduce this number to about 20 and see what the effects on the score are. My question: Is there a way to get the underlying formula for the model out of scikit instead of having it as a 'blackbox' in my svm function. At this moment i am predicting a dichtomous variable with 100 variables, (continuous, ordinal and binair). My code: import numpy as np from numpy import * import pandas as pd from sklearn import tree, svm, linear_model, metrics, preprocessing import datetime from sklearn.model_selection import KFold, cross_val_score, ShuffleSplit, GridSearchCV from time import gmtime, strftime #database openen en voorbereiden file = "/home/wouter/scikit/DB_SCIKIT.csv" DB = pd.read_csv(file, sep=";", header=0, decimal= ',').as_matrix() DBT = DB print "Vorm van de DB: ", DB. shape target = [] for i in range(len(DB[:,-1])): target .append(DB[i,-1]) DB = delete(DB,s_[-1],1) #Laatste kolom verwijderen AantalOutcome = target.count(1) print "Aantal outcome:", AantalOutcome print "Aantal patienten:", len(target) A = DB b = target print len(DBT) svc =svm.SVC(kernel='linear', cache_size=500, probability=True) indices = np.random.permutation(len(DBT)) rs = ShuffleSplit(n_splits=5, test_size=.15, random_state=None) scores = cross_val_score(svc, A, b, cv=rs) A = ("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) print A X_train = DBT[indices[:-302]] y_train = [] for i in range(len(X_train[:,-1])): y_train .append(X_train[i,-1]) X_train = delete(X_train,s_[-1],1) #Laatste kolom verwijderen X_test = DBT[indices[-302:]] y_test = [] for i in range(len(X_test[:,-1])): y_test .append(X_test[i,-1]) X_test = delete(X_test,s_[-1],1) #Laatste kolom verwijderen model = svc.fit(X_train,y_train) print model uitkomst = model.score(X_test, y_test) print uitkomst voorspel = model.predict(X_test) print voorspel And output: Vorm van de DB: (2011, 101) Aantal outcome: 128 Aantal patienten: 2011 2011 Accuracy: 0.94 (+/- 0.01) SVC (C=1.0, cache_size=500, class_weight=None, coef0=0.0, decision_function_shape ='ovr', degree=3, gamma='auto', kernel='linear', max_iter =-1, probability=True, random_state=None, shrinking=True, tol =0.001, verbose=False) 0.927152317881 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
[scikit-learn] pipeline for modifying target and number of samples
Hi, I posted a while back about this, and am reposting now since I have made progress on this topic. As you are probably aware, the sklearn Pipeline only supports transformers for X, and the number of samples must stay the same. I work with time series where the learning pipeline relies on transformations like resampling, segmentation, etc that change the target and number of samples in the data set. In order to address this, I created an sklearn compatible pipeline that handles transformers that alter X, y, and sample_weight together. It can undergo model selection using the sklearn tools, and integrates with all the sklearn transformers and estimators. It also has some new options for setting hyper-parameters with callables and in reference to other parameters. The implementation is in my time series package seglearn: https://github.com/dmbee/seglearn - Best David Burns ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] seglearn: package for time series and sequence learning
I implemented a meta-estimator and transformers for time series / sequence learning with sliding window segmentation. It can be used for classification, regression, or forecasting - supporting multivariate time series / sequences and contextual (time-independent) data. It can learn time series or contextual targets. It is (mostly) compatible with the sklearn model evaluation and selection tools - despite changing the number of samples and the target vector mid pipeline (during segmentation). I've created a pull request on related_projects.rst - but thought I would share it here for those of you interested in this area. https://github.com/dmbee/seglearn Cheers, David Burns ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] New Transformer (Guillaume Lema?tre)
Thanks everyone for your suggested. I will have a look at PipeGraph - which might be a suitable option for us as Guillaume suggested. If it works out, I will share it Thanks David On 02/28/2018 08:29 AM, scikit-learn-requ...@python.org wrote: Send scikit-learn mailing list submissions to scikit-learn@python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-requ...@python.org You can reach the person managing the list at scikit-learn-ow...@python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. New Transformer (David Burns) 2. Re: New Transformer (Guillaume Lema?tre) 3. Re: New Transformer (Manuel Castej?n Limas) -- Message: 1 Date: Tue, 27 Feb 2018 12:02:27 -0500 From: David Burns <david.mo.bu...@gmail.com> To: scikit-learn@python.org Subject: [scikit-learn] New Transformer Message-ID: <726f2e70-63eb-783f-b470-5ea45af93...@gmail.com> Content-Type: text/plain; charset="utf-8"; Format="flowed" First post on this mailing list. I have been working with time series data for a project, and thought I could contribute a new transformer to segment time series data using a sliding window, with variable overlap. I have attached demonstration of how this would fit in the existing framework. The only challenge for me here is that the transformer needs to transform both the X and y variable in order to perform the segmentation. I am not sure from the documentation how to implement this in the framework. Overlapping segments is a great way to boost performance for time series classifiers, so this may be a worthwhile contribution for some in this area of ML. Ultimately, model_selection.TimeSeries.Split would need to be modified to support overlapping segments, or a new class created to enable validation for this. Please let me know if this would be a worthwhile contribution, and if so how to go about transforming the target vector y in the framework / pipeline? Thanks! David Burns -- next part -- A non-text attachment was scrubbed... Name: TimeSeriesSegment.py Type: text/x-python Size: 3336 bytes Desc: not available URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180227/143ced86/attachment-0001.py> -- Message: 2 Date: Tue, 27 Feb 2018 19:42:52 +0100 From: Guillaume Lema?tre <g.lemaitr...@gmail.com> To: Scikit-learn mailing list <scikit-learn@python.org> Subject: Re: [scikit-learn] New Transformer Message-ID: <cacdxx9gy91jwt+xjfgtnub_5wvmv279dgums6autzffsnfe...@mail.gmail.com> Content-Type: text/plain; charset="utf-8" Transforming y is a big deal :) You can refer to https://github.com/scikit-learn/enhancement_proposals/pull/2 and the associated issues/PR to see what is going on. This is probably an additional use case to think about when designing estimator which will be modifying y. Regarding the pipeline, I assume that your strategy would be to resample at fit and do nothing at predict, isn't it? NB: you could actually implement this sampling in a FunctionSampler of imblearn: http://contrib.scikit-learn.org/imbalanced-learn/dev/generated/imblearn.FunctionSampler.html#imblearn.FunctionSampler and then use the imblearn pipeline which would apply the transform at fit time but not at predict. On 27 February 2018 at 18:02, David Burns <david.mo.bu...@gmail.com> wrote: First post on this mailing list. I have been working with time series data for a project, and thought I could contribute a new transformer to segment time series data using a sliding window, with variable overlap. I have attached demonstration of how this would fit in the existing framework. The only challenge for me here is that the transformer needs to transform both the X and y variable in order to perform the segmentation. I am not sure from the documentation how to implement this in the framework. Overlapping segments is a great way to boost performance for time series classifiers, so this may be a worthwhile contribution for some in this area of ML. Ultimately, model_selection.TimeSeries.Split would need to be modified to support overlapping segments, or a new class created to enable validation for this. Please let me know if this would be a worthwhile contribution, and if so how to go about transforming the target vector y in the framework / pipeline? Thanks! David Burns ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Inclusion of an LSTM Classifier
There is an sklearn wrapper for Keras models in the Keras library. That's an easy way to use LSTM in sklearn. Also the sklearn estimator API is pretty easy to figure out if you want to roll your own wrapper for any model really. ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn