Re: DataFrame First method is resulting different results in each iteration

2016-02-04 Thread Ali Tajeldin EDU
Hi Satish,
  Take a look at the smvTopNRecs() function in the SMV package.  It does 
exactly what you are looking for.  It might be overkill to bring in all of SMV 
for just one function but you will also get a lot more than just DF helper 
functions (modular views, higher level graphs, dynamic loading of modules 
(coming soon), data/code sync). Ok, end of SMV plug :-)

http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvGroupedDataFunc
 (See SmvTopNRecs function at the end).
https://github.com/TresAmigosSD/SMV : SMV github page

For your specific example,
emp_df.smvGroupBy("DeptNo").smvTopNRecs(1, $"Sal".desc)

Two things to note:
1. Use "emp_df" and not the sorted "ordrd_emp_df" as the sort will be performed 
by smvTopNRecs internally.
2. Must use "smvGroupBy" instead of normal "groupBy" method on DataFrame as the 
result of standard "groupBy" hides the original DF and grouping column :-(

--
Ali 

On Feb 3, 2016, at 9:08 PM, Hemant Bhanawat  wrote:

> Ahh.. missed that. 
> 
> I see that you have used "first" function. 'first' returns the first row it 
> has found. On a single executor it may return the right results. But, on 
> multiple executors, it will return the first row of any of the executor which 
> may not be the first row when the results are combined. 
> 
> I believe, if you change your query like this, you will get the right 
> results: 
> 
> ordrd_emp_df.groupBy("DeptNo").
> agg($"DeptNo", max("Sal").as("HighestSal"))
> 
> But as you can see, you get the highest Sal and not the EmpId with highest 
> Sal. For getting EmpId with highest Sal, you will have to change your query 
> to add filters or add subqueries. See the following thread: 
> 
> http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group
> 
> Hemant Bhanawat
> SnappyData (http://snappydata.io/)
> 
> 
> On Wed, Feb 3, 2016 at 4:33 PM, satish chandra j  
> wrote:
> Hi Hemant,
> My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy in 
> the first step
> And also tried having "orderBy" method before "groupBy" than also getting 
> different results in each iteration
> 
> Regards,
> Satish Chandra
> 
> 
> On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat  wrote:
> Missing order by? 
> 
> Hemant Bhanawat
> SnappyData (http://snappydata.io/)
> 
> 
> On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j  
> wrote:
> HI All,
> I have data in a emp_df (DataFrame) as mentioned below:
> 
> EmpId   Sal   DeptNo 
> 001   100   10
> 002   120   20
> 003   130   10
> 004   140   20
> 005   150   10
> 
> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as below:
> 
> DeptNo  Sal   EmpId
> 10 150   005
> 10 130   003
> 10 100   001
> 20 140   004
> 20 120   002
> 
> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg First 
> method as below
> 
> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")
> 
> Expected output is DeptNo  TopSal
>   10005
>20   004
> But my output varies for each iteration such as
> 
> First Iteration results as  Dept  TopSal
>   10 003
>20 004
> 
> Secnd Iteration results as Dept  TopSal
>   10 005
>   20 004
> 
> Third Iteration results as  Dept  TopSal
>   10 003
>   20 002
> 
> Not sure why output varies on each iteration as no change in code and values 
> in DataFrame
> 
> Please let me know if any inputs on this 
> 
> Regards,
> Satish Chandra J
> 
> 
> 



Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread Hemant Bhanawat
Ahh.. missed that.

I see that you have used "first" function. 'first' returns the first row it
has found. On a single executor it may return the right results. But, on
multiple executors, it will return the first row of any of the executor
which may not be the first row when the results are combined.

I believe, if you change your query like this, you will get the right
results:

ordrd_emp_df.groupBy("DeptNo").
agg($"DeptNo", max("Sal").as("HighestSal"))

But as you can see, you get the highest Sal and not the EmpId with highest
Sal. For getting EmpId with highest Sal, you will have to change your query
to add filters or add subqueries. See the following thread:

http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group

Hemant Bhanawat
SnappyData (http://snappydata.io/)


On Wed, Feb 3, 2016 at 4:33 PM, satish chandra j 
wrote:

> Hi Hemant,
> My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy
> in the first step
> And also tried having "orderBy" method before "groupBy" than also getting
> different results in each iteration
>
> Regards,
> Satish Chandra
>
>
> On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat 
> wrote:
>
>> Missing order by?
>>
>> Hemant Bhanawat
>> SnappyData (http://snappydata.io/)
>>
>>
>> On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j <
>> jsatishchan...@gmail.com> wrote:
>>
>>> HI All,
>>> I have data in a emp_df (DataFrame) as mentioned below:
>>>
>>> EmpId   Sal   DeptNo
>>> 001   100   10
>>> 002   120   20
>>> 003   130   10
>>> 004   140   20
>>> 005   150   10
>>>
>>> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as
>>> below:
>>>
>>> DeptNo  Sal   EmpId
>>> 10 150   005
>>> 10 130   003
>>> 10 100   001
>>> 20 140   004
>>> 20 120   002
>>>
>>> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg
>>> First method as below
>>>
>>>
>>> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")
>>>
>>> Expected output is DeptNo  TopSal
>>>   10005
>>>20   004
>>> But my output varies for each iteration such as
>>>
>>> First Iteration results as  Dept  TopSal
>>>   10 003
>>>20 004
>>>
>>> Secnd Iteration results as Dept  TopSal
>>>   10 005
>>>   20 004
>>>
>>> Third Iteration results as  Dept  TopSal
>>>   10 003
>>>   20 002
>>>
>>> Not sure why output varies on each iteration as no change in code and
>>> values in DataFrame
>>>
>>> Please let me know if any inputs on this
>>>
>>> Regards,
>>> Satish Chandra J
>>>
>>
>>
>


Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread satish chandra j
Hi Hemant,
My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy
in the first step
And also tried having "orderBy" method before "groupBy" than also getting
different results in each iteration

Regards,
Satish Chandra


On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat 
wrote:

> Missing order by?
>
> Hemant Bhanawat
> SnappyData (http://snappydata.io/)
>
>
> On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j  > wrote:
>
>> HI All,
>> I have data in a emp_df (DataFrame) as mentioned below:
>>
>> EmpId   Sal   DeptNo
>> 001   100   10
>> 002   120   20
>> 003   130   10
>> 004   140   20
>> 005   150   10
>>
>> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as
>> below:
>>
>> DeptNo  Sal   EmpId
>> 10 150   005
>> 10 130   003
>> 10 100   001
>> 20 140   004
>> 20 120   002
>>
>> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg
>> First method as below
>>
>>
>> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")
>>
>> Expected output is DeptNo  TopSal
>>   10005
>>20   004
>> But my output varies for each iteration such as
>>
>> First Iteration results as  Dept  TopSal
>>   10 003
>>20 004
>>
>> Secnd Iteration results as Dept  TopSal
>>   10 005
>>   20 004
>>
>> Third Iteration results as  Dept  TopSal
>>   10 003
>>   20 002
>>
>> Not sure why output varies on each iteration as no change in code and
>> values in DataFrame
>>
>> Please let me know if any inputs on this
>>
>> Regards,
>> Satish Chandra J
>>
>
>


Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread Hemant Bhanawat
Missing order by?

Hemant Bhanawat
SnappyData (http://snappydata.io/)

On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j 
wrote:

> HI All,
> I have data in a emp_df (DataFrame) as mentioned below:
>
> EmpId   Sal   DeptNo
> 001   100   10
> 002   120   20
> 003   130   10
> 004   140   20
> 005   150   10
>
> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as
> below:
>
> DeptNo  Sal   EmpId
> 10 150   005
> 10 130   003
> 10 100   001
> 20 140   004
> 20 120   002
>
> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg
> First method as below
>
>
> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")
>
> Expected output is DeptNo  TopSal
>   10005
>20   004
> But my output varies for each iteration such as
>
> First Iteration results as  Dept  TopSal
>   10 003
>20 004
>
> Secnd Iteration results as Dept  TopSal
>   10 005
>   20 004
>
> Third Iteration results as  Dept  TopSal
>   10 003
>   20 002
>
> Not sure why output varies on each iteration as no change in code and
> values in DataFrame
>
> Please let me know if any inputs on this
>
> Regards,
> Satish Chandra J
>


DataFrame First method is resulting different results in each iteration

2016-02-03 Thread satish chandra j
HI All,
I have data in a emp_df (DataFrame) as mentioned below:

EmpId   Sal   DeptNo
001   100   10
002   120   20
003   130   10
004   140   20
005   150   10

ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as
below:

DeptNo  Sal   EmpId
10 150   005
10 130   003
10 100   001
20 140   004
20 120   002

Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg
First method as below

ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")

Expected output is DeptNo  TopSal
  10005
   20   004
But my output varies for each iteration such as

First Iteration results as  Dept  TopSal
  10 003
   20 004

Secnd Iteration results as Dept  TopSal
  10 005
  20 004

Third Iteration results as  Dept  TopSal
  10 003
  20 002

Not sure why output varies on each iteration as no change in code and
values in DataFrame

Please let me know if any inputs on this

Regards,
Satish Chandra J