Re: [Scikit-learn-general] Strings as features

2014-06-30 Thread Joel Nothman
They are defined in the beta release of version 0.15. On 30 June 2014 02:53, Abijith Kp wrote: > In which version of sklearn, is the above mention 'make_pipeline' and > 'make_union' defined?? > > When I read through some example, the idea of using FeatureUnion and > Pipelined are easy, I guess.

Re: [Scikit-learn-general] Strings as features

2014-06-29 Thread Abijith Kp
In which version of sklearn, is the above mention 'make_pipeline' and 'make_union' defined?? When I read through some example, the idea of using FeatureUnion and Pipelined are easy, I guess. Former chains the features obtained from each individual estimators given as the input were as the latter u

Re: [Scikit-learn-general] Strings as features

2014-06-22 Thread Joel Nothman
Actually, it is a little easier with `make_pipeline` and `make_union` which weren't around at the time. I think it's a little more abstracted than most people who would come across this problem would be comfortable to implement. Still, it needs an example. On 22 June 2014 15:31, Andy wrote: >

Re: [Scikit-learn-general] Strings as features

2014-06-22 Thread Andy
Yeah that is exactly what I was thinking about. Though I would disagree that it is not simple to write and lengthy ;) class GetItemTransformer(TransformerMixin): def __init__(self, field): self.field = field # assume default fit() def transform(X): return X[fiel

Re: [Scikit-learn-general] Strings as features

2014-06-21 Thread Joel Nothman
It is possible to do what you want, but it is not simple to write. Scikit-learn could definitely benefit from an example showing this sort of thing, or from a better API to help the user do it, as suggested at https://github.com/scikit-learn/scikit-learn/issues/2034. There you will find a lengthy c

Re: [Scikit-learn-general] Strings as features

2014-06-21 Thread Abijith Kp
What would be the advantage for using a shared vocabulary for Count Vectorizer?? When I read about FeatureUnion, what I understood was that, the given list of transformers would process the given data set completely. Could we use it to selectively process different features?? Or is my understandin

Re: [Scikit-learn-general] Strings as features

2014-06-21 Thread Andy
Yes, you can use CountVectorizer. Do you want the different features to share the same vocabulary? To use the Count Vectorizer, you probably have to either get all the values (for a shared vocabulary) or learn one CountVectorizer per key (you could use FeatureUnion for that). So there is a litt

Re: [Scikit-learn-general] Strings as features

2014-06-21 Thread Abijith Kp
Hi, Initially, one of my feature list looks like: {"a":"3", "b":"random1", "c":"", "d":"random2 text"}. The random text contains names of people, email ids, some description, numbers and goes on. When I used DictVectorizer, I could not get an accurate clustering. I wanted know if I could get a

Re: [Scikit-learn-general] Strings as features

2014-06-21 Thread Andy
Hi Abijith. It depends on how you want to interpret the strings. If they are texts and you want to interpret them based on their content, Brians suggestion is the right one. If you want to consider each possible string as a distinct feature, the OneHotEncoder would be the right choice. Could

Re: [Scikit-learn-general] Strings as features

2014-06-20 Thread Brian Wingenroth
Hi Abijith, This should get you started: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html Brian On 6/20/14, 12:05 PM, Abijith Kp wrote: > Can anyone help me with the problem of dealing with feature which are > both strings of varying length(say from 0 to 100-150 c

[Scikit-learn-general] Strings as features

2014-06-20 Thread Abijith Kp
Can anyone help me with the problem of dealing with feature which are both strings of varying length(say from 0 to 100-150 characters) and numbers? What will be the most widely used techniques in such kind of situations? And can it be solved using only scikit-learn? PS: Initially I have to conver