You didn't specify which API, but in pyspark you could do
import pyspark.sql.functions as F
df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show()
+---++
| ID| DETAILS|
+---++
| 1|[A1, A2, A3]|
| 3|[B2]|
| 2|[B1]|
+---++
If you want to sort by PART I think you'll need a UDF.
On Wed, Aug 22, 2018 at 4:12 PM, Jean Georges Perrin wrote:
> How do you do it now?
>
> You could use a withColumn(“newDetails”, details_2...>)
>
> jg
>
>
> > On Aug 22, 2018, at 16:04, msbreuer wrote:
> >
> > A dataframe with following contents is given:
> >
> > ID PART DETAILS
> > 11 A1
> > 12 A2
> > 13 A3
> > 21 B1
> > 31 C1
> >
> > Target format should be as following:
> >
> > ID DETAILS
> > 1 A1+A2+A3
> > 2 B1
> > 3 C1
> >
> > Note, the order of A1-3 is important.
> >
> > Currently I am using this alternative:
> >
> > ID DETAIL_1 DETAIL_2 DETAIL_3
> > 1 A1 A2 A3
> > 2 B1
> > 3 C1
> >
> > What would be the best method to do such transformation an a large
> dataset?
> >
> >
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>