Hi Zhan, How would this be achieved? Should the data be partitioned by name in this case?
Thank you! Best, Wenlei On Thu, Apr 30, 2015 at 7:55 PM, Zhan Zhang <zzh...@hortonworks.com> wrote: > One optimization is to reduce the shuffle by first aggregate locally > (only keep the max for each name), and then reduceByKey. > > Thanks. > > Zhan Zhang > > On Apr 24, 2015, at 10:03 PM, ayan guha <guha.a...@gmail.com> wrote: > > Here you go.... > > t = > [["A",10,"A10"],["A",20,"A20"],["A",30,"A30"],["B",15,"B15"],["C",10,"C10"],["C",20,"C200"]] > TRDD = sc.parallelize(t).map(lambda t: > Row(name=str(t[0]),age=int(t[1]),other=str(t[2]))) > TDF = ssc.createDataFrame(TRDD) > print TDF.printSchema() > TDF.registerTempTable("tab") > JN = ssc.sql("select t.name,t.age,t.other from tab t inner join > (select name,max(age) age from tab group by name) t1 on t.name=t1.name > and t.age=t1.age") > for i in JN.collect(): > print i > > Result: > Row(name=u'A', age=30, other=u'A30') > Row(name=u'B', age=15, other=u'B15') > Row(name=u'C', age=20, other=u'C200') > > On Sat, Apr 25, 2015 at 2:48 PM, Wenlei Xie <wenlei....@gmail.com> wrote: > >> Sure. A simple example of data would be (there might be many other >> columns) >> >> Name Age Other >> A 10 A10 >> A 20 A20 >> A 30 A30 >> B 15 B15 >> C 10 C10 >> C 20 C20 >> >> The desired output would be >> Name Age Other >> A 30 A30 >> B 15 B15 >> C 20 C20 >> >> Thank you so much for the help! >> >> On Sat, Apr 25, 2015 at 12:41 AM, ayan guha <guha.a...@gmail.com> wrote: >> >>> can you give an example set of data and desired output> >>> >>> On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie <wenlei....@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I would like to answer the following customized aggregation query on >>>> Spark SQL >>>> 1. Group the table by the value of Name >>>> 2. For each group, choose the tuple with the max value of Age (the ages >>>> are distinct for every name) >>>> >>>> I am wondering what's the best way to do it on Spark SQL? Should I >>>> use UDAF? Previously I am doing something like the following on Spark: >>>> >>>> personRDD.map(t => (t.name, t)) >>>> .reduceByKey((a, b) => if (a.age > b.age) a else b) >>>> >>>> Thank you! >>>> >>>> Best, >>>> Wenlei >>>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >> >> -- >> Wenlei Xie (谢文磊) >> >> Ph.D. Candidate >> Department of Computer Science >> 456 Gates Hall, Cornell University >> Ithaca, NY 14853, USA >> Email: wenlei....@gmail.com >> > > > > -- > Best Regards, > Ayan Guha > > > -- Wenlei Xie (谢文磊) Ph.D. Candidate Department of Computer Science 456 Gates Hall, Cornell University Ithaca, NY 14853, USA Email: wenlei....@gmail.com