this works

val top10 = logs.filter(log => log.responseCode != 200).map(log =>
(log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))

or

val top10 = logs.filter(log => log.responseCode != 200).map(log =>
(log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))

On Tue, Oct 20, 2015 at 11:07 AM, Sean Owen <so...@cloudera.com> wrote:

> I believe it will be most efficient to let top(n) do the work, rather than
> sort the whole RDD and then take the first n. The reason is that top and
> takeOrdered know they need at most n elements from each partition, and then
> just need to merge those. It's never required to sort the whole thing.
>
> I also believe it will be marginally faster to provide an Ordering rather
> than swap pairs just to use the natural Ordering, but, I don't know if it's
> significant.
>
> Note that I think you can write "Ordering.by(_._2)" to be more concise
> (not 100% sure about the syntax off the top of my head).
>
>
>
> On Tue, Oct 20, 2015 at 3:56 PM, Carol McDonald <cmcdon...@maprtech.com>
> wrote:
>
>> To find the top 10 counts , which is better using top(10) with Ordering
>> on the value,
>> or swapping the key value and ordering on the key ?  For example which is
>> better below ?
>> Or does it matter
>>
>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>> (log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering[Long].on(x=>x._2))
>>
>>
>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>> (log.endpoint,
>> 1)).reduceByKey((x,y)=>x+y).map(x=>(x._2,x._1)).sortByKey(false).take(10)
>>
>>
>>  val top10 = logs.filter(log => log.responseCode != 200).map(log =>
>> (log.endpoint, 1)).reduceByKey((x,y)=>x+y).map(pair => pair.swap).top(10)
>>
>>
>

Reply via email to