[scikit-learn] Feature selection with words.

2017-12-19 Thread Luigi Lomasto
Hi all. I’m working for text classification to classify Wikipedia documents. I using a word count approach to extract feature from my text so I obtain a big vocabulary that contains all documents word (train dataset) after lemmatization and deleted stop word. Now I have 7 features. I think

Re: [scikit-learn] Feature selection with words.

2017-12-19 Thread Joel Nothman
It depends what the set of classes is. Best way to find out is to try it... On 19 December 2017 at 19:36, Luigi Lomasto < l.loma...@innovationengineering.eu> wrote: > Hi all. > > I’m working for text classification to classify Wikipedia documents. I > using a word count approach to extract featur

[scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Dear all, Kudos to scikit-learn! Having said that, Pipeline is killing me not being able to transform anything other than X. My current study case would need: - Transformers being able to handle both X and y, e.g. clustering X and y concatenated - Pipeline being able to change other params, e.g.

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Christos Aridas
Hey Manuel, In imbalanced-learn we have an extra type of estimators, named Samplers, which are able to modify X and y, at the same time, with the use of new API methods, sample and fit_sample. Also, we have adopted a modified version of scikit-learn's Pipeline class where we allow subsequent trans

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Guillaume Lemaître
I think that you could you use imbalanced-learn regarding the issue that you have with the y. You should be able to wrap your clustering inside the FunctionSampler ( https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we are on the way to merge it) On 19 December 2017 at 13:44, Man

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Wow, that seems promising. I'll read with interest the imbalance-learn code. Thanks for the info! Manuel 2017-12-19 14:15 GMT+01:00 Christos Aridas : > Hey Manuel, > > In imbalanced-learn we have an extra type of estimators, named Samplers, > which are able to modify X and y, at the same time, w

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Eager to learn! Diving on the code right now! Thanks for the tip! Manuel 2017-12-19 14:18 GMT+01:00 Guillaume Lemaître : > I think that you could you use imbalanced-learn regarding the issue that > you have with the y. > You should be able to wrap your clustering inside the FunctionSampler ( > h

[scikit-learn] Text classification of large dataet

2017-12-19 Thread Ranjana Girish
Hai all, I am doing text classification. I have around 10 million data to be classified to around 7k category. Below is the code I am using *# Importing the libraries* *import pandas as pd* *import nltk* *from nltk.corpus import stopwords* *from nltk.tokenize import word_tokenize* *from nltk.ste

[scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Taylor, Johnmark
Hello, I am a researcher in fMRI and am using SVMs to analyze brain data. I am doing decoding between two classes, each of which has 24 exemplars per class. I am comparing two different methods of cross-validation for my data: in one, I am training on 23 exemplars from each class, and testing on t

Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Jacob Vanderplas
Hi JohnMark, SVMs, by design, are quite sensitive to the addition of single data points – but only if those data points happen to lie near the margin. I wrote about some of those types of details here: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html Hope tha

Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Luigi Lomasto
Hi, you can try to use CV with k-fold partition, so you can see with all training/test combination (generally 90%/10% or 80/20). If you have very different results, probably that you obtain overfitting. Inviato da iPhone > Il giorno 19 dic 2017, alle ore 22:37, Jacob Vanderplas > ha scritto

Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Jeffrey Levesque via scikit-learn
Hi guys, I'm currently developing a web-interface, and programmatic rest-API for sklearn. I currently have SVM, and SVR available with some parameters like C, and gamma exposed: - https://github.com/jeff1evesque/machine-learning I'm working a bit to improve the web-interface at the moment. Sinc

Re: [scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

2017-12-19 Thread Gael Varoquaux
With as few data points, there is a huge uncertainty in the estimation of the prediction accuracy with cross-validation. This isn't a problem of the method, is it a basic limitation of the small amount of data. I've written a paper on this problem is the specific context of neuroimaging: https://ww

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Joel Nothman
At a glance, and perhaps not knowing imbalanced-learn well enough, I have some doubts that it will provide an immediate solution for all your needs. At the end of the day, the Pipeline keeps its scope relatively tight, but it should not be so hard to implement something for your own needs if your