Spark 2 - How to order keys in sparse vector (K-means)?

ddebarbieux Fri, 21 Dec 2018 06:16:09 -0800

Dear all,

I am using Spark 2 in order to cluster data with the K-means algorithm.My input data is flat and K-means requires sparse vectors with orderedkeys. Here is an example of an input and the expected output:


[id, key, value]
[1, 10, 100]
[1, 30, 300]
[2, 40, 400]
[1, 20, 200]

[id, list(key), list(value)]
[1, [10, 20, 30], [100, 200, 300]]
[2, [40], [400]]

The naive algorithm orders by key then groups by id and then aggregatesin a list is not working: the list is unorder!Depending on how data is partitioned, Spark will append values to thelist as soon as it finds a row in the group. The order depends on howSpark plans the aggregation over the executors.


Thanks to the web, I have more complex algorithm:

WindowSpec w = Window.partitionBy("id").orderBy("key");
ds.withColumn("collect_list(key)", collect_list("key").over(w))
   .withColumn("collect_list(value))", collect_list("value").over(w))
   .groupBy("id")
   .agg(
      max("collect_list(key)").alias("collect_list(key)"),
      max("collect_list(value)").alias("collect_list(value)"))

According to my test, it returns the expected result but it open 2questions:


 1. Does exist an easiest way to do it?

2. Why is-it working? Can I assume that all the data with the same idis in the same partition? Since I use two times over(w), why the pairs(key,value) are never mixted?


Thanks for your help.
Denis



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 2 - How to order keys in sparse vector (K-means)?

Reply via email to