The importance should be based on some statistics, for example, the standard deviation of the feature column and the magnitude of the weight. If the columns are scaled to unit standard deviation (using StandardScaler), you can tell the importance by the absolute value of the weight. But there are other statistics for feature importance. It would be great if you are interested in working on this. -Xiangrui
On Thu, Sep 18, 2014 at 12:17 PM, Debasish Das <debasish.da...@gmail.com> wrote: > sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ? > > For logistic you might want both positive and negative feature...so just > pass it through a filter on abs and then pick top(k) > > > On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak <ssti...@live.com> wrote: >> >> Hi All, >> >> I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB >> Libsvm file of sparse data) with 6700 features. >> >> val model = LinearRegressionWithSGD.train(examples, numIterations) >> >> At the end I get a model that >> >> model.weights.size >> res6: Int = 6699 >> >> I am assuming each entry in the model is weight for the corresponding >> feature/index. However,, if I want to get the top10 most important features >> or all features with weights higher than certain threshold, is that >> functionality available out-of-box? I can implement that on my own, but >> seems like a common feature that most of the people will need when they are >> working on high-dimensional dataset. >> >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org