Hi, I was taking a look through the mllib examples in the official spark documentation and came across the following: http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2
specifically the lines: label = data.map(lambda x: x.label) features = data.map(lambda x: x.features) ... ... data1 = label.zip(scaler1.transform(features)) my question: wouldn't it be possible that some labels in the pairs returned by the label.zip(..) operation are not paired with their original features? i.e. are the original orderings of `labels` and `features` preserved after the scaler1.transform(..) and label.zip(..) operations? This issue was also mentioned in http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html I would greatly appreciate some clarification on this, as I've run into this issue whilst experimenting with feature extraction for text classification, where (correct me if I'm wrong) there is no built-in mechanism to keep track of document-ids through the HashingTF and IDF fitting and transformations. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org