On 06/03/2013 06:41 AM, Vlad Niculae wrote:
With the right settings, SGDClassifier is a home cooked implementation
of SVM so there's that too.
That is true. Thinking about it, it is a bit weird that SGDClassifier is
in linear_model and LinearSVC is in svm, as they both solve the
same
On 06/03/2013 05:19 AM, Joel Nothman wrote:
However, in these last two cases, the number of possible splits at a
single node is linear in the number of categories. Selecting an
arbitrary partition allows exponentially many splits with respect to
the number of categories (though there may
On 06/03/2013 04:41 AM, Christian Jauvin wrote:
Sklearn does not implement any special treatment for categorical variables.
You can feed any float. The question is if it would work / what it does.
I think I'm confused about a couple of aspects (that's what happens I
guess when you play with
On 06/02/2013 08:48 PM, Harold Nguyen wrote:
Hi Lars,
Thank you very much for this response. Please excuse my questions
since I'm new.
From here the document on TfidfVectorizer here:
http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Does
On 3 June 2013 08:43, Andreas Mueller amuel...@ais.uni-bonn.de wrote:
On 06/03/2013 05:19 AM, Joel Nothman wrote:
However, in these last two cases, the number of possible splits at a
single node is linear in the number of categories. Selecting an
arbitrary partition allows exponentially many
On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
Our decision tree implementation only supports numerical splits; i.e.
if tests val threshold .
Categorical features need to be encoded properly. I recommend one-hot
encoding for features with small cardinality (e.g. 50) and ordinal
2013/6/2 Harold Nguyen har...@nexgate.com:
http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Does TfidfVectorizer take a sequence of filenames, where each file is just a
plain text file ?
Depends on the parameter input (the first in the list).
2013/6/3 Andreas Mueller amuel...@ais.uni-bonn.de:
I named the variable, I think, and it is a bad name :-(
Should we rename it?
I think giving a count makes more sense than giving a frequency: you want to
exclude outliers that appear only once or twice for example.
I actually hadn't seen
Many thanks to all for your help and detailed answers, I really appreciate it.
So I wanted to test the discussion's takeaway, namely, what Peter
suggested: one-hot encode the categorical features with small
cardinality, and leave the others in their ordinal form.
So from the same dataset I