D[(DeviceKey, Int)] = ShuffledRDD[1] at
> repartitionAndSortWithinPartitions at :30
>
>
> Yong
>
>
> --
> *From:* Pariksheet Barapatre <pbarapa...@gmail.com>
> *Sent:* Wednesday, March 29, 2017 9:02 AM
> *To:* user
> *Subject:* Second
la> t.repartitionAndSortWithinPartitions(new DeviceKeyPartitioner(2))
res0: org.apache.spark.rdd.RDD[(DeviceKey, Int)] = ShuffledRDD[1] at
repartitionAndSortWithinPartitions at :30
Yong
From: Pariksheet Barapatre <pbarapa...@gmail.com>
Sent: Wednesday, March
Hi,
<http://stackoverflow.com/questions/43038682/secondary-sort-using-apache-spark-1-6#>
I am referring web link http://codingjunkie.net/spark-secondary-sort/ to
implement secondary sort in my spark job.
I have defined my key case class as
case class DeviceKey(serialNum: String, eve
is there a way to leverage the shuffle in Dataset/GroupedDataset so that
Iterator[V] in flatMapGroups has a well defined ordering?
is hard for me to see many good use cases for flatMapGroups and mapGroups
if you do not have sorting.
since spark has a sort based shuffle not exposing this would be
You should create key as tuple type. In your case, RDD[((id, timeStamp) ,
value)] is the proper way to do.
Kevin
--- Original Message ---
Sender : swethaswethakasire...@gmail.com
Date : 2015-08-12 09:37 (GMT+09:00)
Title : What is the optimal approach to do Secondary Sort in Spark?
Hi
Hi,
What is the optimal approach to do Secondary sort in Spark? I have to first
Sort by an Id in the key and further sort it by timeStamp which is present
in the value.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-optimal
so well: it allowed sorting of values (using secondary
sort), and it processed all values per key in a streaming fashion.
the library spark-sorted aims to bring these kind of operations back to
spark, by providing a way to process values with a user provided
Ordering[V] and a user provided
. basically this is what the original hadoop reduce
operation did so well: it allowed sorting of values (using secondary sort),
and it processed all values per key in a streaming fashion.
the library spark-sorted aims to bring these kind of operations back to
spark, by providing a way to process values
when they do not fit in memory. examples are
algorithms that need to process the values ordered, or algorithms that need
to emit all values again. basically this is what the original hadoop reduce
operation did so well: it allowed sorting of values (using secondary sort),
and it processed all
now that spark has a sort based shuffle, can we expect a secondary sort
soon? there are some use cases where getting a sorted iterator of values
per key is helpful.
10 matches
Mail list logo