Re: Re: Selecting the top 100 records per group by?

2016-09-29 Thread Mariano Semelman
It's not Spark specific, but it answers your question:
https://blog.jooq.org/2014/08/12/the-difference-between-row_number-rank-and-dense_rank/

On 12 September 2016 at 12:42, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> I don't understand why you need to add a column row_number when you can
> use rank or dens_rank?
>
> Why  one cannot one use rank or dens_rank here?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 12 September 2016 at 15:37, <luohui20...@sina.com> wrote:
>
>> hi kevin
>> window function is what you need, like below:
>> val hivetable = hc.sql("select * from house_sale_pv_location")
>> val overLocation = Window.partitionBy(hivetable.c
>> ol("lp_location_id"))
>> val sortedDF = hivetable.withColumn("rowNumber",
>> row_number().over(overLocation)).filter("rowNumber<=50")
>>
>> here I add a column as rownumber,  get all data partitioned and get the
>> top 50 rows.
>>
>>
>>
>> 
>>
>> ThanksBest regards!
>> San.Luo
>>
>> - 原始邮件 -
>> 发件人:Mich Talebzadeh <mich.talebza...@gmail.com>
>> 收件人:"user @spark" <user@spark.apache.org>
>> 主题:Re: Selecting the top 100 records per group by?
>> 日期:2016年09月11日 22点20分
>>
>> You can of course do this using FP.
>>
>> val wSpec = Window.partitionBy('price).orderBy(desc("price"))
>> df2.filter('security > " 
>> ").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED,
>> 'SECURITY, substring('PRICE,1,7)).filter('rank<=10).show
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 11 September 2016 at 07:15, Mich Talebzadeh <mich.talebza...@gmail.com
>> > wrote:
>>
>> DENSE_RANK will give you ordering and sequence within a particular
>> column. This is Hive
>>
>>  var sqltext = """
>>  | SELECT RANK, timecreated,security, price
>>  |  FROM (
>>  |SELECT timecreated,security, price,
>>  |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
>>  |  FROM test.prices
>>  |   ) tmp
>>  |  WHERE rank <= 10
>>  | """
>> sql(sqltext).collect.foreach(println)
>>
>> [1,2016-09-09 16:55:44,Esso,99.995]
>> [1,2016-09-09 21:22:52,AVIVA,99.995]
>> [1,2016-09-09 21:22:52,Barclays,99.995]
>> [1,2016-09-09 21:24:28,JPM,99.995]
>> [1,2016-09-09 21:30:38,Microsoft,99.995]
>> [1,2016-09-09 21:31:12,UNILEVER,99.995]
>> [2,2016-09-09 16:54:14,BP,99.99]
>> [2,2016-09-09 16:54:36,Tate & Lyle,99.99]
>> [2,2016-09-09 16:56:28,EASYJET,99.99]
>> [2,2016-09-09 16:59:28,IBM,99.99]
>> [2,2016-09-09 20:16:10,EXPERIAN,99.99]
>> [2,2016-09-09 22:25:20,Microsoft,99.99]
>> [2,2016-09-09 22:53:49,Tate & Lyle,99.99]
>> [3,2016-09-09 15:31:06,UNILEVER,99.985]
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http

Re: Re: Selecting the top 100 records per group by?

2016-09-12 Thread Mich Talebzadeh
Hi,

I don't understand why you need to add a column row_number when you can use
rank or dens_rank?

Why  one cannot one use rank or dens_rank here?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 September 2016 at 15:37, <luohui20...@sina.com> wrote:

> hi kevin
> window function is what you need, like below:
> val hivetable = hc.sql("select * from house_sale_pv_location")
> val overLocation = Window.partitionBy(hivetable.col("lp_location_id"))
> val sortedDF = hivetable.withColumn("rowNumber", row_number().over(
> overLocation)).filter("rowNumber<=50")
>
> here I add a column as rownumber,  get all data partitioned and get the
> top 50 rows.
>
>
>
> 
>
> ThanksBest regards!
> San.Luo
>
> - 原始邮件 -
> 发件人:Mich Talebzadeh <mich.talebza...@gmail.com>
> 收件人:"user @spark" <user@spark.apache.org>
> 主题:Re: Selecting the top 100 records per group by?
> 日期:2016年09月11日 22点20分
>
> You can of course do this using FP.
>
> val wSpec = Window.partitionBy('price).orderBy(desc("price"))
> df2.filter('security > " 
> ").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED,
> 'SECURITY, substring('PRICE,1,7)).filter('rank<=10).show
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 September 2016 at 07:15, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> DENSE_RANK will give you ordering and sequence within a particular column.
> This is Hive
>
>  var sqltext = """
>  | SELECT RANK, timecreated,security, price
>  |  FROM (
>  |SELECT timecreated,security, price,
>  |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
>  |  FROM test.prices
>  |   ) tmp
>  |  WHERE rank <= 10
>  | """
> sql(sqltext).collect.foreach(println)
>
> [1,2016-09-09 16:55:44,Esso,99.995]
> [1,2016-09-09 21:22:52,AVIVA,99.995]
> [1,2016-09-09 21:22:52,Barclays,99.995]
> [1,2016-09-09 21:24:28,JPM,99.995]
> [1,2016-09-09 21:30:38,Microsoft,99.995]
> [1,2016-09-09 21:31:12,UNILEVER,99.995]
> [2,2016-09-09 16:54:14,BP,99.99]
> [2,2016-09-09 16:54:36,Tate & Lyle,99.99]
> [2,2016-09-09 16:56:28,EASYJET,99.99]
> [2,2016-09-09 16:59:28,IBM,99.99]
> [2,2016-09-09 20:16:10,EXPERIAN,99.99]
> [2,2016-09-09 22:25:20,Microsoft,99.99]
> [2,2016-09-09 22:53:49,Tate & Lyle,99.99]
> [3,2016-09-09 15:31:06,UNILEVER,99.985]
>
> HTH
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 September 2016 at 04:32, Kevin Burton <bur...@spinn3r.com> wrote:
>
> Looks like you can do it with dense_rank functions.
>
> https://databricks.com/blog/2015/07/15/introducing-window-fu
> nctions-in-spark-sql.html
>
> I setup some basic records and seems like it did the right thing.
>
> Now time to throw 50TB and 100 spark nodes at this problem and see what

回复:Re: Selecting the top 100 records per group by?

2016-09-12 Thread luohui20001
hi kevinwindow function is what you need, like below:val hivetable = 
hc.sql("select * from house_sale_pv_location")
val overLocation = Window.partitionBy(hivetable.col("lp_location_id"))
val sortedDF = hivetable.withColumn("rowNumber", 
row_number().over(overLocation)).filter("rowNumber<=50")
here I add a column as rownumber,  get all data partitioned and get the top 50 
rows.




 

ThanksBest regards!
San.Luo

- 原始邮件 -
发件人:Mich Talebzadeh <mich.talebza...@gmail.com>
收件人:"user @spark" <user@spark.apache.org>
主题:Re: Selecting the top 100 records per group by?
日期:2016年09月11日 22点20分

You can of course do this using FP.
val wSpec = 
Window.partitionBy('price).orderBy(desc("price"))df2.filter('security > " 
").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY, 
substring('PRICE,1,7)).filter('rank<=10).show


HTH



Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 September 2016 at 07:15, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:
DENSE_RANK will give you ordering and sequence within a particular column. This 
is Hive 
 var sqltext = """
 | SELECT RANK, timecreated,security, price
 |  FROM (
 |SELECT timecreated,security, price,
 |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
 |  FROM test.prices
 |   ) tmp
 |  WHERE rank <= 10
 | """
sql(sqltext).collect.foreach(println)[1,2016-09-09 16:55:44,Esso,99.995]
[1,2016-09-09 21:22:52,AVIVA,99.995]
[1,2016-09-09 21:22:52,Barclays,99.995]
[1,2016-09-09 21:24:28,JPM,99.995]
[1,2016-09-09 21:30:38,Microsoft,99.995]
[1,2016-09-09 21:31:12,UNILEVER,99.995]
[2,2016-09-09 16:54:14,BP,99.99]
[2,2016-09-09 16:54:36,Tate & Lyle,99.99]
[2,2016-09-09 16:56:28,EASYJET,99.99]
[2,2016-09-09 16:59:28,IBM,99.99]
[2,2016-09-09 20:16:10,EXPERIAN,99.99]
[2,2016-09-09 22:25:20,Microsoft,99.99]
[2,2016-09-09 22:53:49,Tate & Lyle,99.99]
[3,2016-09-09 15:31:06,UNILEVER,99.985]

HTH







Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 September 2016 at 04:32, Kevin Burton <bur...@spinn3r.com> wrote:
Looks like you can do it with dense_rank functions.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

I setup some basic records and seems like it did the right thing.
Now time to throw 50TB and 100 spark nodes at this problem and see what happens 
:)
On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton <bur...@spinn3r.com> wrote:
Ah.. might actually. I'll have to mess around with that.
On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote:
Would `topByKey` help?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42

Best,Karl
On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote:
I'm trying to figure out a way to group by and return the top 100 records in 
that group.
Something like:
SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
But I can't really figure out the best way to do this... 
There is a FIRST and LAST aggregate function but this only returns one column.
I could do something like:
SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) LIMIT 100;
But that limit is applied for ALL the records. Not each individual user.  
The only other thing I can think of is to do a manual map reduce and then have 
the reducer only return the top 100 each time... 
Would LOVE some advice here... 
-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ profile




-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ profile




-- 
We’re hiring if you know o

Re: Selecting the top 100 records per group by?

2016-09-11 Thread Mich Talebzadeh
You can of course do this using FP.

val wSpec = Window.partitionBy('price).orderBy(desc("price"))
df2.filter('security > "
").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY,
substring('PRICE,1,7)).filter('rank<=10).show


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 September 2016 at 07:15, Mich Talebzadeh 
wrote:

> DENSE_RANK will give you ordering and sequence within a particular column.
> This is Hive
>
>  var sqltext = """
>  | SELECT RANK, timecreated,security, price
>  |  FROM (
>  |SELECT timecreated,security, price,
>  |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
>  |  FROM test.prices
>  |   ) tmp
>  |  WHERE rank <= 10
>  | """
> sql(sqltext).collect.foreach(println)
>
> [1,2016-09-09 16:55:44,Esso,99.995]
> [1,2016-09-09 21:22:52,AVIVA,99.995]
> [1,2016-09-09 21:22:52,Barclays,99.995]
> [1,2016-09-09 21:24:28,JPM,99.995]
> [1,2016-09-09 21:30:38,Microsoft,99.995]
> [1,2016-09-09 21:31:12,UNILEVER,99.995]
> [2,2016-09-09 16:54:14,BP,99.99]
> [2,2016-09-09 16:54:36,Tate & Lyle,99.99]
> [2,2016-09-09 16:56:28,EASYJET,99.99]
> [2,2016-09-09 16:59:28,IBM,99.99]
> [2,2016-09-09 20:16:10,EXPERIAN,99.99]
> [2,2016-09-09 22:25:20,Microsoft,99.99]
> [2,2016-09-09 22:53:49,Tate & Lyle,99.99]
> [3,2016-09-09 15:31:06,UNILEVER,99.985]
>
> HTH
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 September 2016 at 04:32, Kevin Burton  wrote:
>
>> Looks like you can do it with dense_rank functions.
>>
>> https://databricks.com/blog/2015/07/15/introducing-window-fu
>> nctions-in-spark-sql.html
>>
>> I setup some basic records and seems like it did the right thing.
>>
>> Now time to throw 50TB and 100 spark nodes at this problem and see what
>> happens :)
>>
>> On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton  wrote:
>>
>>> Ah.. might actually. I'll have to mess around with that.
>>>
>>> On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley  wrote:
>>>
 Would `topByKey` help?

 https://github.com/apache/spark/blob/master/mllib/src/main/s
 cala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42

 Best,
 Karl

 On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton 
 wrote:

> I'm trying to figure out a way to group by and return the top 100
> records in that group.
>
> Something like:
>
> SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
>
> But I can't really figure out the best way to do this...
>
> There is a FIRST and LAST aggregate function but this only returns one
> column.
>
> I could do something like:
>
> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ )
> LIMIT 100;
>
> But that limit is applied for ALL the records. Not each individual
> user.
>
> The only other thing I can think of is to do a manual map reduce and
> then have the reducer only return the top 100 each time...
>
> Would LOVE some advice here...
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux
> Operations Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>
>>>
>>>
>>> --
>>>
>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>> Engineers!
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> 
>>>
>>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>

RE: Selecting the top 100 records per group by?

2016-09-11 Thread Mendelson, Assaf

You can also create a custom aggregation function. It might provide better 
performance than dense_rank.

Consider the following example to collect everything as list:
class CollectListFunction[T](val colType: DataType) extends 
UserDefinedAggregateFunction {

  def inputSchema: StructType =
new StructType().add("inputCol", colType)

  def bufferSchema: StructType =
new StructType().add("outputCol", ArrayType(colType))

  def dataType: DataType = ArrayType(colType)

  def deterministic: Boolean = true

  def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, new mutable.ArrayBuffer[T])
  }

  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val list = buffer.getSeq[T](0)
if (!input.isNullAt(0)) {
  val sales = input.getAs[T](0)
  buffer.update(0, list:+sales)
}
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0, buffer1.getSeq[T](0) ++ buffer2.getSeq[T](0))
  }

  def evaluate(buffer: Row): Any = {
buffer.getSeq[T](0)
  }
}

All you would need to do is modify it to contain only the top 100…

From: burtonator2...@gmail.com<mailto:burtonator2...@gmail.com> 
[mailto:burtonator2...@gmail.com] On Behalf Of Kevin Burton
Sent: Sunday, September 11, 2016 6:33 AM
To: Karl Higley
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Selecting the top 100 records per group by?

Looks like you can do it with dense_rank functions.

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

I setup some basic records and seems like it did the right thing.

Now time to throw 50TB and 100 spark nodes at this problem and see what happens 
:)

On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton 
<bur...@spinn3r.com<mailto:bur...@spinn3r.com>> wrote:
Ah.. might actually. I'll have to mess around with that.

On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley 
<kmhig...@gmail.com<mailto:kmhig...@gmail.com>> wrote:
Would `topByKey` help?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42

Best,
Karl

On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton 
<bur...@spinn3r.com<mailto:bur...@spinn3r.com>> wrote:
I'm trying to figure out a way to group by and return the top 100 records in 
that group.

Something like:

SELECT TOP(100, user_id) FROM posts GROUP BY user_id;

But I can't really figure out the best way to do this...

There is a FIRST and LAST aggregate function but this only returns one column.

I could do something like:

SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) LIMIT 100;

But that limit is applied for ALL the records. Not each individual user.

The only other thing I can think of is to do a manual map reduce and then have 
the reducer only return the top 100 each time...

Would LOVE some advice here...

--
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com<http://Spinn3r.com>
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ 
profile<https://plus.google.com/102718274791889610666/posts>




--
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com<http://Spinn3r.com>
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ 
profile<https://plus.google.com/102718274791889610666/posts>




--
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com<http://Spinn3r.com>
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ 
profile<https://plus.google.com/102718274791889610666/posts>



Re: Selecting the top 100 records per group by?

2016-09-11 Thread Mich Talebzadeh
DENSE_RANK will give you ordering and sequence within a particular column.
This is Hive

 var sqltext = """
 | SELECT RANK, timecreated,security, price
 |  FROM (
 |SELECT timecreated,security, price,
 |   DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
 |  FROM test.prices
 |   ) tmp
 |  WHERE rank <= 10
 | """
sql(sqltext).collect.foreach(println)

[1,2016-09-09 16:55:44,Esso,99.995]
[1,2016-09-09 21:22:52,AVIVA,99.995]
[1,2016-09-09 21:22:52,Barclays,99.995]
[1,2016-09-09 21:24:28,JPM,99.995]
[1,2016-09-09 21:30:38,Microsoft,99.995]
[1,2016-09-09 21:31:12,UNILEVER,99.995]
[2,2016-09-09 16:54:14,BP,99.99]
[2,2016-09-09 16:54:36,Tate & Lyle,99.99]
[2,2016-09-09 16:56:28,EASYJET,99.99]
[2,2016-09-09 16:59:28,IBM,99.99]
[2,2016-09-09 20:16:10,EXPERIAN,99.99]
[2,2016-09-09 22:25:20,Microsoft,99.99]
[2,2016-09-09 22:53:49,Tate & Lyle,99.99]
[3,2016-09-09 15:31:06,UNILEVER,99.985]

HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 September 2016 at 04:32, Kevin Burton  wrote:

> Looks like you can do it with dense_rank functions.
>
> https://databricks.com/blog/2015/07/15/introducing-window-
> functions-in-spark-sql.html
>
> I setup some basic records and seems like it did the right thing.
>
> Now time to throw 50TB and 100 spark nodes at this problem and see what
> happens :)
>
> On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton  wrote:
>
>> Ah.. might actually. I'll have to mess around with that.
>>
>> On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley  wrote:
>>
>>> Would `topByKey` help?
>>>
>>> https://github.com/apache/spark/blob/master/mllib/src/main/s
>>> cala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
>>>
>>> Best,
>>> Karl
>>>
>>> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton  wrote:
>>>
 I'm trying to figure out a way to group by and return the top 100
 records in that group.

 Something like:

 SELECT TOP(100, user_id) FROM posts GROUP BY user_id;

 But I can't really figure out the best way to do this...

 There is a FIRST and LAST aggregate function but this only returns one
 column.

 I could do something like:

 SELECT * FROM posts WHERE user_id IN ( /* select top users here */ )
 LIMIT 100;

 But that limit is applied for ALL the records. Not each individual
 user.

 The only other thing I can think of is to do a manual map reduce and
 then have the reducer only return the top 100 each time...

 Would LOVE some advice here...

 --

 We’re hiring if you know of any awesome Java Devops or Linux Operations
 Engineers!

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 


>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> 
>>
>>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
Looks like you can do it with dense_rank functions.

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

I setup some basic records and seems like it did the right thing.

Now time to throw 50TB and 100 spark nodes at this problem and see what
happens :)

On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton  wrote:

> Ah.. might actually. I'll have to mess around with that.
>
> On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley  wrote:
>
>> Would `topByKey` help?
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/
>> scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
>>
>> Best,
>> Karl
>>
>> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton  wrote:
>>
>>> I'm trying to figure out a way to group by and return the top 100
>>> records in that group.
>>>
>>> Something like:
>>>
>>> SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
>>>
>>> But I can't really figure out the best way to do this...
>>>
>>> There is a FIRST and LAST aggregate function but this only returns one
>>> column.
>>>
>>> I could do something like:
>>>
>>> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ )
>>> LIMIT 100;
>>>
>>> But that limit is applied for ALL the records. Not each individual user.
>>>
>>>
>>> The only other thing I can think of is to do a manual map reduce and
>>> then have the reducer only return the top 100 each time...
>>>
>>> Would LOVE some advice here...
>>>
>>> --
>>>
>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>> Engineers!
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> 
>>>
>>>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
Ah.. might actually. I'll have to mess around with that.

On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley  wrote:

> Would `topByKey` help?
>
> https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
>
> Best,
> Karl
>
> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton  wrote:
>
>> I'm trying to figure out a way to group by and return the top 100 records
>> in that group.
>>
>> Something like:
>>
>> SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
>>
>> But I can't really figure out the best way to do this...
>>
>> There is a FIRST and LAST aggregate function but this only returns one
>> column.
>>
>> I could do something like:
>>
>> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ )
>> LIMIT 100;
>>
>> But that limit is applied for ALL the records. Not each individual user.
>>
>> The only other thing I can think of is to do a manual map reduce and then
>> have the reducer only return the top 100 each time...
>>
>> Would LOVE some advice here...
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> 
>>
>>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: Selecting the top 100 records per group by?

2016-09-10 Thread Karl Higley
Would `topByKey` help?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42

Best,
Karl

On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton  wrote:

> I'm trying to figure out a way to group by and return the top 100 records
> in that group.
>
> Something like:
>
> SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
>
> But I can't really figure out the best way to do this...
>
> There is a FIRST and LAST aggregate function but this only returns one
> column.
>
> I could do something like:
>
> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) LIMIT
> 100;
>
> But that limit is applied for ALL the records. Not each individual user.
>
> The only other thing I can think of is to do a manual map reduce and then
> have the reducer only return the top 100 each time...
>
> Would LOVE some advice here...
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>