Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-17 Thread Xiangrui Meng
This is implemented in MLlib:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41.
-Xiangrui

On Wed, Jun 10, 2015 at 1:53 PM, erisa  wrote:
> Hi,
>
> I am a Spark newbie, and trying to solve the same problem, and have
> implemented the same exact solution that sowen  is suggesting. I am using
> priorityqueues to keep trak of the top 25 sub_categories, per each category,
> and using the combineByKey function to do that.
> However I run into the following exception when I submit the spark job:
>
> ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 17)
> java.lang.UnsupportedOperationException: unsuitable as hash key
> at
> scala.collection.mutable.PriorityQueue.hashCode(PriorityQueue.scala:226)
>
>
> From the error it looks like spark is trying to use the mutable priority
> queue as a hashkey so the error makes sense, but I don't get why it is doing
> that since the value of the RDD record is a priority queue not the key.
>
> Maybe there is a more straightforward solution to what I want to achieve, so
> any suggestion is appreciated :)
>
> Thanks,
> Erisa
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370p23263.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-10 Thread erisa
Hi,

I am a Spark newbie, and trying to solve the same problem, and have
implemented the same exact solution that sowen  is suggesting. I am using
priorityqueues to keep trak of the top 25 sub_categories, per each category,
and using the combineByKey function to do that. 
However I run into the following exception when I submit the spark job:

ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 17)
java.lang.UnsupportedOperationException: unsuitable as hash key
at
scala.collection.mutable.PriorityQueue.hashCode(PriorityQueue.scala:226)


>From the error it looks like spark is trying to use the mutable priority
queue as a hashkey so the error makes sense, but I don't get why it is doing
that since the value of the RDD record is a priority queue not the key.

Maybe there is a more straightforward solution to what I want to achieve, so
any suggestion is appreciated :)

Thanks,
Erisa



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370p23263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Efficient way to get top K values per key in (key, value) RDD?

2014-12-04 Thread Sean Owen
You probably want to use combineByKey, and create an empty min queue
for each key. Merge values into the queue if its size is < K. If >= K,
only merge the value if it exceeds the smallest element; if so add it
and remove the smallest element.

This gives you an RDD of keys mapped to collections of up to K values
each, and should be about as efficient as it gets in general.

On Thu, Dec 4, 2014 at 8:53 AM, Theodore Vasiloudis
 wrote:
> Hello everyone,
>
> I was wondering what is the most efficient way for retrieving the top K
> values per key in a (key, value) RDD.
>
> The simplest way I can think of is to do a groupByKey, sort the iterables
> and then take the top K
> elements for every key.
>
> But reduceByKey is an operation that can be very costly.
>
> This
> 
> thread seems related, where it is recommended to change the key include the
> value we want to sort on, and then perform an aggregate operation.
>
> My use case would be to filter an RDD representing the edges of a graph (
> (srcID, dstID), edgeWeight),
> so that we only retain at most top K edges according to weight for each
> (srcID, dstID) key.
> The graph can have multiple  edges between the same two vertices.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org