Thanks for asking! We should improve the documentation. The sample dataset is actually mimicking the MNIST digits dataset, where the values are gray levels (0-255). So by dividing by 16, we want to map it to 16 coarse bins for the gray levels. Actually, there is a bug in the doc, we should convert the values to integer first before dividing by 16. I created https://issues.apache.org/jira/browse/SPARK-7739 for this issue. Welcome to submit a patch:) Thanks!
Best, Xiangrui On Thu, May 7, 2015 at 9:20 PM, spark_user_2015 <[email protected]> wrote: > The Spark documentation shows the following example code: > > // Discretize data in 16 equal bins since ChiSqSelector requires categorical > features > val discretizedData = data.map { lp => > LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 16 > } ) ) > } > > I'm sort of missing why "x / 16" is considered a discretization approach > here. > > [https://spark.apache.org/docs/latest/mllib-feature-extraction.html#feature-selection] > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Discretization-tp22811.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
