Hi folks, I have a set of categorical columns (strings), that I'm parsing and converting into Vectors of features to pass to a mllib classifier (random forest).
In my input data, some columns have null values. Say, in one of those columns, I have p values + a null value : How should I build my feature Vectors, and the categoricalFeaturesInfo map of the classifier ? * option 1 : I tell p values in categoricalFeaturesInfo, and I use Double.NaN in my input Vectors ? [ How NaNs are handled by classifiers ? ] * option 2 : I consider nulls as a value, so I tell (p+1) values in categoricalFeaturesInfo, and I map nulls to some int ? Thanks for your help. Mathieu (PS : I know the the new dataframe + pipeline + vectorindexer API, but for reasons it doesn't fit well my need, so I need to do that by myself) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Handle-null-NaN-values-in-mllib-classifier-tp24822.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org