Re: [Scikit-learn-general] using string features for classification

2012-01-03 Thread Vlad Niculae
On Jan 3, 2012, at 17:02 , Olivier Grisel wrote: > 2012/1/3 Lars Buitinck : >> >>> We probably need to extend the sklearn.feature_extraction.text package >>> to make it more user friendly to work with with pure categorical >>> features occurrences: >> >> I'm not sure this belongs in feature_ext

Re: [Scikit-learn-general] using string features for classification

2012-01-03 Thread Olivier Grisel
2012/1/3 Lars Buitinck : > >> We probably need to extend the sklearn.feature_extraction.text package >> to make it more user friendly to work with with pure categorical >> features occurrences: > > I'm not sure this belongs in feature_extraction.text; it's much more > broadly applicable. > > If you

Re: [Scikit-learn-general] using string features for classification

2012-01-03 Thread Lars Buitinck
2011/12/30 Olivier Grisel : > Alright, then the name of this kind of features is "categorical > features" in machine learning jargon: the string is used as an > identifier and the ordered sequence of letters is not exploited by the > model. On the opposite "string features" means something very spe

Re: [Scikit-learn-general] using string features for classification

2012-01-03 Thread Lars Buitinck
2011/12/30 Bronco Zaurus : > One more way would be computing classification probability for each value > and plugging the resulting number back into data. For example, let's say > there are 10 samples with BMW in the training set, and 3 of them are 1 > (true), 7 are 0 (false). So the maximum likeli

Re: [Scikit-learn-general] using string features for classification

2011-12-30 Thread Olivier Grisel
In the previous mail variable `X` should be replaced by `data`. -- Olivier -- Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure acce

Re: [Scikit-learn-general] using string features for classification

2011-12-30 Thread Olivier Grisel
2011/12/30 Bronco Zaurus : > Thank you for all the answers. Yes, I'm not dealing with arbitrary strings, > just a set of possible values, so the binary representation seems OK. Alright, then the name of this kind of features is "categorical features" in machine learning jargon: the string is used

Re: [Scikit-learn-general] using string features for classification

2011-12-30 Thread Bronco Zaurus
Thank you for all the answers. Yes, I'm not dealing with arbitrary strings, just a set of possible values, so the binary representation seems OK. One more way would be computing classification probability for each value and plugging the resulting number back into data. For example, let's say there

Re: [Scikit-learn-general] using string features for classification

2011-12-29 Thread David Warde-Farley
On 2011-12-29, at 3:18 PM, Bronco Zaurus wrote: > Hello, > > I have a beginner's question: how do you classify using non-numerical > features, concretely strings (for example: 'audi', 'bmw', > 'chevrolet')? > > One way that comes to mind is to give each value a number. Is there a > more s

Re: [Scikit-learn-general] using string features for classification

2011-12-29 Thread xinfan meng
There are actually work on embedding word sense into vector space, "Word representations: A simple and general method for semi-supervised learning" for example. On Fri, Dec 30, 2011 at 6:26 AM, Robert Layton wrote: > On 30 December 2011 08:57, Gael Varoquaux > wrote: > >> On Thu, Dec 29, 2011 a

Re: [Scikit-learn-general] using string features for classification

2011-12-29 Thread Robert Layton
On 30 December 2011 08:57, Gael Varoquaux wrote: > On Thu, Dec 29, 2011 at 09:18:38PM +0100, Bronco Zaurus wrote: > >I have a beginner's question: how do you classify using non-numerical > >features, concretely strings (for example: 'audi', 'bmw', > >'chevrolet')? > > You are in troubl

Re: [Scikit-learn-general] using string features for classification

2011-12-29 Thread Gael Varoquaux
On Thu, Dec 29, 2011 at 09:18:38PM +0100, Bronco Zaurus wrote: >I have a beginner's question: how do you classify using non-numerical >features, concretely strings (for example: 'audi', 'bmw', >'chevrolet')? You are in trouble as your input space is not metric: what's .5*('audi' + 'che