Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre
D[(DeviceKey, Int)] = ShuffledRDD[1] at > repartitionAndSortWithinPartitions at :30 > > > Yong > > > -- > *From:* Pariksheet Barapatre <pbarapa...@gmail.com> > *Sent:* Wednesday, March 29, 2017 9:02 AM > *To:* user > *Subject:* Second

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Yong Zhang
la> t.repartitionAndSortWithinPartitions(new DeviceKeyPartitioner(2)) res0: org.apache.spark.rdd.RDD[(DeviceKey, Int)] = ShuffledRDD[1] at repartitionAndSortWithinPartitions at :30 Yong From: Pariksheet Barapatre <pbarapa...@gmail.com> Sent: Wednesday, March

Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Pariksheet Barapatre
Hi, <http://stackoverflow.com/questions/43038682/secondary-sort-using-apache-spark-1-6#> I am referring web link http://codingjunkie.net/spark-secondary-sort/ to implement secondary sort in my spark job. I have defined my key case class as case class DeviceKey(serialNum: String, eve

GroupedDataset flatMapGroups with sorting (aka secondary sort redux)

2016-02-12 Thread Koert Kuipers
is there a way to leverage the shuffle in Dataset/GroupedDataset so that Iterator[V] in flatMapGroups has a well defined ordering? is hard for me to see many good use cases for flatMapGroups and mapGroups if you do not have sorting. since spark has a sort based shuffle not exposing this would be

Re: What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread Kevin Jung
You should create key as tuple type. In your case, RDD[((id, timeStamp) , value)] is the proper way to do. Kevin --- Original Message --- Sender : swethaswethakasire...@gmail.com Date : 2015-08-12 09:37 (GMT+09:00) Title : What is the optimal approach to do Secondary Sort in Spark? Hi

What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread swetha
Hi, What is the optimal approach to do Secondary sort in Spark? I have to first Sort by an Id in the key and further sort it by timeStamp which is present in the value. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-optimal

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
so well: it allowed sorting of values (using secondary sort), and it processed all values per key in a streaming fashion. the library spark-sorted aims to bring these kind of operations back to spark, by providing a way to process values with a user provided Ordering[V] and a user provided

spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Koert Kuipers
. basically this is what the original hadoop reduce operation did so well: it allowed sorting of values (using secondary sort), and it processed all values per key in a streaming fashion. the library spark-sorted aims to bring these kind of operations back to spark, by providing a way to process values

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz
when they do not fit in memory. examples are algorithms that need to process the values ordered, or algorithms that need to emit all values again. basically this is what the original hadoop reduce operation did so well: it allowed sorting of values (using secondary sort), and it processed all

secondary sort

2014-09-20 Thread Koert Kuipers
now that spark has a sort based shuffle, can we expect a secondary sort soon? there are some use cases where getting a sorted iterator of values per key is helpful.