Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Guillaume Lemaître
I think that you could you use imbalanced-learn regarding the issue that you have with the y. You should be able to wrap your clustering inside the FunctionSampler ( https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we are on the way to merge it) On 19 December 2017 at 13:44,

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Christos Aridas
Hey Manuel, In imbalanced-learn we have an extra type of estimators, named Samplers, which are able to modify X and y, at the same time, with the use of new API methods, sample and fit_sample. Also, we have adopted a modified version of scikit-learn's Pipeline class where we allow subsequent

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Eager to learn! Diving on the code right now! Thanks for the tip! Manuel 2017-12-19 14:18 GMT+01:00 Guillaume Lemaître : > I think that you could you use imbalanced-learn regarding the issue that > you have with the y. > You should be able to wrap your clustering inside

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Wow, that seems promising. I'll read with interest the imbalance-learn code. Thanks for the info! Manuel 2017-12-19 14:15 GMT+01:00 Christos Aridas : > Hey Manuel, > > In imbalanced-learn we have an extra type of estimators, named Samplers, > which are able to modify X and y,

[scikit-learn] Text classification of large dataet

2017-12-19 Thread Ranjana Girish
Hai all, I am doing text classification. I have around 10 million data to be classified to around 7k category. Below is the code I am using *# Importing the libraries* *import pandas as pd* *import nltk* *from nltk.corpus import stopwords* *from nltk.tokenize import word_tokenize* *from

Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Jeffrey Levesque via scikit-learn
Hi guys, I'm currently developing a web-interface, and programmatic rest-API for sklearn. I currently have SVM, and SVR available with some parameters like C, and gamma exposed: - https://github.com/jeff1evesque/machine-learning I'm working a bit to improve the web-interface at the moment.

Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Gael Varoquaux
With as few data points, there is a huge uncertainty in the estimation of the prediction accuracy with cross-validation. This isn't a problem of the method, is it a basic limitation of the small amount of data. I've written a paper on this problem is the specific context of neuroimaging:

[scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Taylor, Johnmark
Hello, I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing decoding between two classes, each of which has 24 exemplars per class. I am comparing two different methods of cross-validation for my data: in one, I am training on 23 exemplars from each class, and testing on

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Joel Nothman
At a glance, and perhaps not knowing imbalanced-learn well enough, I have some doubts that it will provide an immediate solution for all your needs. At the end of the day, the Pipeline keeps its scope relatively tight, but it should not be so hard to implement something for your own needs if your

Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Jacob Vanderplas
Hi JohnMark, SVMs, by design, are quite sensitive to the addition of single data points – but only if those data points happen to lie near the margin. I wrote about some of those types of details here: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html Hope

[scikit-learn] Feature selection with words.

2017-12-19 Thread Luigi Lomasto
Hi all. I’m working for text classification to classify Wikipedia documents. I using a word count approach to extract feature from my text so I obtain a big vocabulary that contains all documents word (train dataset) after lemmatization and deleted stop word. Now I have 7 features. I

Re: [scikit-learn] Feature selection with words.

2017-12-19 Thread Joel Nothman
It depends what the set of classes is. Best way to find out is to try it... On 19 December 2017 at 19:36, Luigi Lomasto < l.loma...@innovationengineering.eu> wrote: > Hi all. > > I’m working for text classification to classify Wikipedia documents. I > using a word count approach to extract