Dear all,
I am using Spark 2 in order to cluster data with the K-means algorithm.
My input data is flat and K-means requires sparse vectors with ordered
keys. Here is an example of an input and the expected output:
[id, key, value]
[1, 10, 100]
[1, 30, 300]
[2, 40, 400]
[1, 20, 200]
[id, list(key), list(value)]
[1, [10, 20, 30], [100, 200, 300]]
[2, [40], [400]]
The naive algorithm orders by key then groups by id and then aggregates
in a list is not working: the list is unorder!
Depending on how data is partitioned, Spark will append values to the
list as soon as it finds a row in the group. The order depends on how
Spark plans the aggregation over the executors.
Thanks to the web, I have more complex algorithm:
WindowSpec w = Window.partitionBy("id").orderBy("key");
ds.withColumn("collect_list(key)", collect_list("key").over(w))
.withColumn("collect_list(value))", collect_list("value").over(w))
.groupBy("id")
.agg(
max("collect_list(key)").alias("collect_list(key)"),
max("collect_list(value)").alias("collect_list(value)"))
According to my test, it returns the expected result but it open 2
questions:
1. Does exist an easiest way to do it?
2. Why is-it working? Can I assume that all the data with the same id
is in the same partition? Since I use two times over(w), why the pairs
(key,value) are never mixted?
Thanks for your help.
Denis
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org