Or maybe you could also check using the collect_list from the SQL functions
val compacter = Data1.groupBy(“UserID")
  
.agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures"))


> On 15 May 2017, at 15:15, Jone Zhang <joyoungzh...@gmail.com> wrote:
> 
> For example
> Data1(has 1 billion records)
> user_id1  feature1
> user_id1  feature2
> 
> Data2(has 1 billion records)
> user_id1  feature3
> 
> Data3(has 1 billion records)
> user_id1  feature4
> user_id1  feature5
> ...
> user_id1  feature100
> 
> I want to get the result as follow
> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
> 
> Is there a more efficient way except join?
> 
> Thanks!

Didac Gil de la Iglesia
PhD in Computer Science
didacg...@gmail.com
Spain:     +34 696 285 544
Sweden: +46 (0)730229737
Skype: didac.gil.de.la.iglesia

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to