order preservation with RDDs

kian.ho Sat, 14 Mar 2015 20:52:00 -0700

Hi, I was taking a look through the mllib examples in the official spark
documentation and came across the following: 
http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2


specifically the lines:

label = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)
...
...
data1 = label.zip(scaler1.transform(features))

my question:
wouldn't it be possible that some labels in the pairs returned by the
label.zip(..) operation are not paired with their original features? i.e.
are the original orderings of `labels` and `features` preserved after the
scaler1.transform(..) and label.zip(..) operations?

This issue was also mentioned in
http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p19433.html

I would greatly appreciate some clarification on this, as I've run into this
issue whilst experimenting with feature extraction for text classification,
where (correct me if I'm wrong) there is no built-in mechanism to keep track
of document-ids through the HashingTF and IDF fitting and transformations.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/order-preservation-with-RDDs-tp22052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

order preservation with RDDs

Reply via email to