Dear all,

I am using Spark 2 in order to cluster data with the K-means algorithm. My input data is flat and K-means requires sparse vectors with ordered keys. Here is an example of an input and the expected output:

[id, key, value]
[1, 10, 100]
[1, 30, 300]
[2, 40, 400]
[1, 20, 200]

[id, list(key), list(value)]
[1, [10, 20, 30], [100, 200, 300]]
[2, [40], [400]]

The naive algorithm orders by key then groups by id and then aggregates in a list is not working: the list is unorder! Depending on how data is partitioned, Spark will append values to the list as soon as it finds a row in the group. The order depends on how Spark plans the aggregation over the executors.

Thanks to the web, I have more complex algorithm:

WindowSpec w = Window.partitionBy("id").orderBy("key");
ds.withColumn("collect_list(key)", collect_list("key").over(w))
   .withColumn("collect_list(value))", collect_list("value").over(w))
   .groupBy("id")
   .agg(
      max("collect_list(key)").alias("collect_list(key)"),
      max("collect_list(value)").alias("collect_list(value)"))

According to my test, it returns the expected result but it open 2 questions:

 1. Does exist an easiest way to do it?
2. Why is-it working? Can I assume that all the data with the same id is in the same partition? Since I use two times over(w), why the pairs (key,value) are never mixted?

Thanks for your help.
Denis



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to