I'm working on implementing LSH on Spark. I start with an implementation provided by SoundCloud: https://github.com/soundcloud/cosine-lsh-join-spark/blob/master/src/main/scala/com/soundcloud/lsh/Lsh.scala when I check WebUI, I see that after call sortBy, the number of partitions of RDD descreases from 30 to 2. I'm also verify this by checking rdd.partitions.size As I can see from the code of RDD class ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala), the default number of output partitions should equal the number of partitions in the parent RDD, which in this case should be 30. Even when I set it number explicitly, this problem still occurs. However, when I try simple code as follow but it works as I wish. val d = Seq(1,2,5,6,3,4,2) val data = sc.parallelize(d, 5) val sortedData = data.sortBy(x => x) println(sortedData.partitions.size) // return "5"
I'm using spark 1.6.1. Thank you for your help. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26819/54.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Something-wrong-with-sortBy-tp26819.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org