I have an RDD of SparseVectors and I'd like to calculate the means returning
a dense vector. I've tried doing this with the following (using pyspark,
spark v1.2.0):
def aggregate_partition_values(vec1, vec2) :
vec1[vec2.indices] += vec2.values
return vec1
def aggregate_combined_vectors(vec1, vec2) :
if all(vec1 == vec2) :
# then the vector came from only one partition
return vec1
else:
return vec1 + vec2
means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
aggregate_combined_vectors)
means = means / nvals
This turns out to be really slow -- and doesn't seem to depend on how many
vectors there are so there seems to be some overhead somewhere that I'm not
understanding. Is there a better way of doing this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]