Re: How to merge multiple rows

2018-08-22 Thread Patrick McCarthy
You didn't specify which API, but in pyspark you could do

import pyspark.sql.functions as F

df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show()

+---++
| ID| DETAILS|
+---++
|  1|[A1, A2, A3]|
|  3|[B2]|
|  2|[B1]|
+---++


If you want to sort by PART I think you'll need a UDF.

On Wed, Aug 22, 2018 at 4:12 PM, Jean Georges Perrin  wrote:

> How do you do it now?
>
> You could use a withColumn(“newDetails”,  details_2...>)
>
> jg
>
>
> > On Aug 22, 2018, at 16:04, msbreuer  wrote:
> >
> > A dataframe with following contents is given:
> >
> > ID PART DETAILS
> > 11 A1
> > 12 A2
> > 13 A3
> > 21 B1
> > 31 C1
> >
> > Target format should be as following:
> >
> > ID DETAILS
> > 1 A1+A2+A3
> > 2 B1
> > 3 C1
> >
> > Note, the order of A1-3 is important.
> >
> > Currently I am using this alternative:
> >
> > ID DETAIL_1 DETAIL_2 DETAIL_3
> > 1 A1   A2   A3
> > 2 B1
> > 3 C1
> >
> > What would be the best method to do such transformation an a large
> dataset?
> >
> >
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to merge multiple rows

2018-08-22 Thread Jean Georges Perrin
How do you do it now? 

You could use a withColumn(“newDetails”, ) 

jg


> On Aug 22, 2018, at 16:04, msbreuer  wrote:
> 
> A dataframe with following contents is given:
> 
> ID PART DETAILS
> 11 A1
> 12 A2
> 13 A3
> 21 B1
> 31 C1
> 
> Target format should be as following:
> 
> ID DETAILS
> 1 A1+A2+A3
> 2 B1
> 3 C1
> 
> Note, the order of A1-3 is important.
> 
> Currently I am using this alternative:
> 
> ID DETAIL_1 DETAIL_2 DETAIL_3
> 1 A1   A2   A3
> 2 B1
> 3 C1
> 
> What would be the best method to do such transformation an a large dataset?
> 
> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org