Re: DataFrame First method is resulting different results in each iteration
Hi Satish, Take a look at the smvTopNRecs() function in the SMV package. It does exactly what you are looking for. It might be overkill to bring in all of SMV for just one function but you will also get a lot more than just DF helper functions (modular views, higher level graphs, dynamic loading of modules (coming soon), data/code sync). Ok, end of SMV plug :-) http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvGroupedDataFunc (See SmvTopNRecs function at the end). https://github.com/TresAmigosSD/SMV : SMV github page For your specific example, emp_df.smvGroupBy("DeptNo").smvTopNRecs(1, $"Sal".desc) Two things to note: 1. Use "emp_df" and not the sorted "ordrd_emp_df" as the sort will be performed by smvTopNRecs internally. 2. Must use "smvGroupBy" instead of normal "groupBy" method on DataFrame as the result of standard "groupBy" hides the original DF and grouping column :-( -- Ali On Feb 3, 2016, at 9:08 PM, Hemant Bhanawat wrote: > Ahh.. missed that. > > I see that you have used "first" function. 'first' returns the first row it > has found. On a single executor it may return the right results. But, on > multiple executors, it will return the first row of any of the executor which > may not be the first row when the results are combined. > > I believe, if you change your query like this, you will get the right > results: > > ordrd_emp_df.groupBy("DeptNo"). > agg($"DeptNo", max("Sal").as("HighestSal")) > > But as you can see, you get the highest Sal and not the EmpId with highest > Sal. For getting EmpId with highest Sal, you will have to change your query > to add filters or add subqueries. See the following thread: > > http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group > > Hemant Bhanawat > SnappyData (http://snappydata.io/) > > > On Wed, Feb 3, 2016 at 4:33 PM, satish chandra j > wrote: > Hi Hemant, > My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy in > the first step > And also tried having "orderBy" method before "groupBy" than also getting > different results in each iteration > > Regards, > Satish Chandra > > > On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat wrote: > Missing order by? > > Hemant Bhanawat > SnappyData (http://snappydata.io/) > > > On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j > wrote: > HI All, > I have data in a emp_df (DataFrame) as mentioned below: > > EmpId Sal DeptNo > 001 100 10 > 002 120 20 > 003 130 10 > 004 140 20 > 005 150 10 > > ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc) which results as below: > > DeptNo Sal EmpId > 10 150 005 > 10 130 003 > 10 100 001 > 20 140 004 > 20 120 002 > > Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg First > method as below > > ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal") > > Expected output is DeptNo TopSal > 10005 >20 004 > But my output varies for each iteration such as > > First Iteration results as Dept TopSal > 10 003 >20 004 > > Secnd Iteration results as Dept TopSal > 10 005 > 20 004 > > Third Iteration results as Dept TopSal > 10 003 > 20 002 > > Not sure why output varies on each iteration as no change in code and values > in DataFrame > > Please let me know if any inputs on this > > Regards, > Satish Chandra J > > >
Re: DataFrame First method is resulting different results in each iteration
Ahh.. missed that. I see that you have used "first" function. 'first' returns the first row it has found. On a single executor it may return the right results. But, on multiple executors, it will return the first row of any of the executor which may not be the first row when the results are combined. I believe, if you change your query like this, you will get the right results: ordrd_emp_df.groupBy("DeptNo"). agg($"DeptNo", max("Sal").as("HighestSal")) But as you can see, you get the highest Sal and not the EmpId with highest Sal. For getting EmpId with highest Sal, you will have to change your query to add filters or add subqueries. See the following thread: http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group Hemant Bhanawat SnappyData (http://snappydata.io/) On Wed, Feb 3, 2016 at 4:33 PM, satish chandra j wrote: > Hi Hemant, > My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy > in the first step > And also tried having "orderBy" method before "groupBy" than also getting > different results in each iteration > > Regards, > Satish Chandra > > > On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat > wrote: > >> Missing order by? >> >> Hemant Bhanawat >> SnappyData (http://snappydata.io/) >> >> >> On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j < >> jsatishchan...@gmail.com> wrote: >> >>> HI All, >>> I have data in a emp_df (DataFrame) as mentioned below: >>> >>> EmpId Sal DeptNo >>> 001 100 10 >>> 002 120 20 >>> 003 130 10 >>> 004 140 20 >>> 005 150 10 >>> >>> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc) which results as >>> below: >>> >>> DeptNo Sal EmpId >>> 10 150 005 >>> 10 130 003 >>> 10 100 001 >>> 20 140 004 >>> 20 120 002 >>> >>> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg >>> First method as below >>> >>> >>> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal") >>> >>> Expected output is DeptNo TopSal >>> 10005 >>>20 004 >>> But my output varies for each iteration such as >>> >>> First Iteration results as Dept TopSal >>> 10 003 >>>20 004 >>> >>> Secnd Iteration results as Dept TopSal >>> 10 005 >>> 20 004 >>> >>> Third Iteration results as Dept TopSal >>> 10 003 >>> 20 002 >>> >>> Not sure why output varies on each iteration as no change in code and >>> values in DataFrame >>> >>> Please let me know if any inputs on this >>> >>> Regards, >>> Satish Chandra J >>> >> >> >
Re: DataFrame First method is resulting different results in each iteration
Hi Hemant, My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy in the first step And also tried having "orderBy" method before "groupBy" than also getting different results in each iteration Regards, Satish Chandra On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat wrote: > Missing order by? > > Hemant Bhanawat > SnappyData (http://snappydata.io/) > > > On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j > wrote: > >> HI All, >> I have data in a emp_df (DataFrame) as mentioned below: >> >> EmpId Sal DeptNo >> 001 100 10 >> 002 120 20 >> 003 130 10 >> 004 140 20 >> 005 150 10 >> >> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc) which results as >> below: >> >> DeptNo Sal EmpId >> 10 150 005 >> 10 130 003 >> 10 100 001 >> 20 140 004 >> 20 120 002 >> >> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg >> First method as below >> >> >> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal") >> >> Expected output is DeptNo TopSal >> 10005 >>20 004 >> But my output varies for each iteration such as >> >> First Iteration results as Dept TopSal >> 10 003 >>20 004 >> >> Secnd Iteration results as Dept TopSal >> 10 005 >> 20 004 >> >> Third Iteration results as Dept TopSal >> 10 003 >> 20 002 >> >> Not sure why output varies on each iteration as no change in code and >> values in DataFrame >> >> Please let me know if any inputs on this >> >> Regards, >> Satish Chandra J >> > >
Re: DataFrame First method is resulting different results in each iteration
Missing order by? Hemant Bhanawat SnappyData (http://snappydata.io/) On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j wrote: > HI All, > I have data in a emp_df (DataFrame) as mentioned below: > > EmpId Sal DeptNo > 001 100 10 > 002 120 20 > 003 130 10 > 004 140 20 > 005 150 10 > > ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc) which results as > below: > > DeptNo Sal EmpId > 10 150 005 > 10 130 003 > 10 100 001 > 20 140 004 > 20 120 002 > > Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg > First method as below > > > ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal") > > Expected output is DeptNo TopSal > 10005 >20 004 > But my output varies for each iteration such as > > First Iteration results as Dept TopSal > 10 003 >20 004 > > Secnd Iteration results as Dept TopSal > 10 005 > 20 004 > > Third Iteration results as Dept TopSal > 10 003 > 20 002 > > Not sure why output varies on each iteration as no change in code and > values in DataFrame > > Please let me know if any inputs on this > > Regards, > Satish Chandra J >
DataFrame First method is resulting different results in each iteration
HI All, I have data in a emp_df (DataFrame) as mentioned below: EmpId Sal DeptNo 001 100 10 002 120 20 003 130 10 004 140 20 005 150 10 ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc) which results as below: DeptNo Sal EmpId 10 150 005 10 130 003 10 100 001 20 140 004 20 120 002 Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg First method as below ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal") Expected output is DeptNo TopSal 10005 20 004 But my output varies for each iteration such as First Iteration results as Dept TopSal 10 003 20 004 Secnd Iteration results as Dept TopSal 10 005 20 004 Third Iteration results as Dept TopSal 10 003 20 002 Not sure why output varies on each iteration as no change in code and values in DataFrame Please let me know if any inputs on this Regards, Satish Chandra J