As a source, I have a textfile with n rows that each contain m
comma-separated integers. 
Each row is then converted into a feature vector with m features each.

I've noticed, that given the same total filesize and number of features, a
larger number of columns is much more expensive for training a KMeans model
than a large number of rows.

To give an example:
10k rows X 1k columns took 21 seconds on my cluster, whereas 1k rows X 10k
colums took 1min47s. Both files had a size of 238M. 

Can someone explain what in the implementation of KMeans causes large
vectors to be so much more expensive than having many of these vectors?
A pointer to the exact part of the source would be fantastic, but even a
general explanation would help me.


Best regards,
Simon 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to