On 3 June 2013 08:43, Andreas Mueller wrote:
> On 06/03/2013 05:19 AM, Joel Nothman wrote:
>>
>> However, in these last two cases, the number of possible splits at a
>> single node is linear in the number of categories. Selecting an
>> arbitrary partition allows exponentially many splits with resp
Our decision tree implementation only supports numerical splits; i.e. if
tests val < threshold .
Categorical features need to be encoded properly. I recommend one-hot
encoding for features with small cardinality (e.g. < 50) and ordinal
encoding (simply assign each category an integer value) for fe
On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
> Our decision tree implementation only supports numerical splits; i.e.
> if tests val < threshold .
>
> Categorical features need to be encoded properly. I recommend one-hot
> encoding for features with small cardinality (e.g. < 50) and ordinal
2013/6/2 Harold Nguyen :
> http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
> Does TfidfVectorizer take a sequence of filenames, where each file is just a
> plain text file ?
Depends on the parameter input (the first in the list). In the
example, I
2013/6/3 Andreas Mueller :
> I named the variable, I think, and it is a bad name :-(
> Should we rename it?
>
> I think giving a count makes more sense than giving a frequency: you want to
> exclude outliers that appear only once or twice for example.
I actually hadn't seen this reply. It's not a
On 06/03/2013 04:09 PM, Lars Buitinck wrote:
> 2013/6/3 Andreas Mueller :
>> I named the variable, I think, and it is a bad name :-(
>> Should we rename it?
>>
>> I think giving a count makes more sense than giving a frequency: you want to
>> exclude outliers that appear only once or twice for exam
On Tue, Jun 4, 2013 at 12:14 AM, Andreas Mueller
wrote:
> On 06/03/2013 04:09 PM, Lars Buitinck wrote:
> > 2013/6/3 Andreas Mueller :
> >> I named the variable, I think, and it is a bad name :-(
> >> Should we rename it?
> >>
> >> I think giving a count makes more sense than giving a frequency: yo
Many thanks to all for your help and detailed answers, I really appreciate it.
So I wanted to test the discussion's takeaway, namely, what Peter
suggested: one-hot encode the categorical features with small
cardinality, and leave the others in their ordinal form.
So from the same dataset I mentio