Hi Zhan,

How would this be achieved? Should the data be partitioned by name in this
case?

Thank you!

Best,
Wenlei

On Thu, Apr 30, 2015 at 7:55 PM, Zhan Zhang <zzh...@hortonworks.com> wrote:

>  One optimization is to reduce the shuffle by first aggregate locally
> (only keep the max for each name), and then reduceByKey.
>
>  Thanks.
>
>  Zhan Zhang
>
>  On Apr 24, 2015, at 10:03 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>  Here you go....
>
>  t =
> [["A",10,"A10"],["A",20,"A20"],["A",30,"A30"],["B",15,"B15"],["C",10,"C10"],["C",20,"C200"]]
>     TRDD = sc.parallelize(t).map(lambda t:
> Row(name=str(t[0]),age=int(t[1]),other=str(t[2])))
>     TDF = ssc.createDataFrame(TRDD)
>     print TDF.printSchema()
>     TDF.registerTempTable("tab")
>     JN = ssc.sql("select t.name,t.age,t.other from tab t inner join
> (select name,max(age) age from tab group by name) t1 on t.name=t1.name
> and t.age=t1.age")
>     for i in JN.collect():
>         print i
>
>  Result:
>  Row(name=u'A', age=30, other=u'A30')
> Row(name=u'B', age=15, other=u'B15')
> Row(name=u'C', age=20, other=u'C200')
>
> On Sat, Apr 25, 2015 at 2:48 PM, Wenlei Xie <wenlei....@gmail.com> wrote:
>
>> Sure. A simple example of data would be (there might be many other
>> columns)
>>
>>  Name             Age        Other
>> A                   10            A10
>> A                    20           A20
>> A                    30           A30
>> B                    15           B15
>> C                    10            C10
>> C                    20           C20
>>
>>  The desired output would be
>> Name          Age        Other
>> A                 30           A30
>> B                 15           B15
>> C                 20           C20
>>
>>  Thank you so much for the help!
>>
>> On Sat, Apr 25, 2015 at 12:41 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> can you give an example set of data and desired output>
>>>
>>> On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie <wenlei....@gmail.com>
>>> wrote:
>>>
>>>>  Hi,
>>>>
>>>>  I would like to answer the following customized aggregation query on
>>>> Spark SQL
>>>> 1. Group the table by the value of Name
>>>> 2. For each group, choose the tuple with the max value of Age (the ages
>>>> are distinct for every name)
>>>>
>>>>  I am wondering what's the best way to do it on Spark SQL? Should I
>>>> use UDAF? Previously I am doing something like the following on Spark:
>>>>
>>>>  personRDD.map(t => (t.name, t))
>>>>     .reduceByKey((a, b) => if (a.age > b.age) a else b)
>>>>
>>>>  Thank you!
>>>>
>>>>  Best,
>>>> Wenlei
>>>>
>>>
>>>
>>>
>>>  --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>>   --
>>  Wenlei Xie (谢文磊)
>>
>> Ph.D. Candidate
>> Department of Computer Science
>> 456 Gates Hall, Cornell University
>> Ithaca, NY 14853, USA
>> Email: wenlei....@gmail.com
>>
>
>
>
>  --
> Best Regards,
> Ayan Guha
>
>
>


-- 
Wenlei Xie (谢文磊)

Ph.D. Candidate
Department of Computer Science
456 Gates Hall, Cornell University
Ithaca, NY 14853, USA
Email: wenlei....@gmail.com

Reply via email to