I guess that if your user_id field is the key, you could use the 
updateStateByKey function.

I did not test it, but it could be something along these lines:

def yourCombineFunction(input: Seq[(String)],accumulatedInput: Option[(String)] 
= {
        val state = accumulatedInput.getOrElse((“”)) //In case the current Key 
was not found before, the features list is empty
        val feature = input._1 //We get the feature value of this new entry

        val newFeature = state._1 +” “+feature
        Some((newFeature)) //The new accumulated value for the features is 
returned
}

val updatedData = Data1.updateStateByKey(yourCombineFunction) //This would 
“iterate” among all the entries in your Dataset and, for each row, will update 
the “accumulatedFeatures”

Good luck

> On 15 May 2017, at 15:15, Jone Zhang <joyoungzh...@gmail.com> wrote:
> 
> For example
> Data1(has 1 billion records)
> user_id1  feature1
> user_id1  feature2
> 
> Data2(has 1 billion records)
> user_id1  feature3
> 
> Data3(has 1 billion records)
> user_id1  feature4
> user_id1  feature5
> ...
> user_id1  feature100
> 
> I want to get the result as follow
> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
> 
> Is there a more efficient way except join?
> 
> Thanks!

Didac Gil de la Iglesia
PhD in Computer Science
didacg...@gmail.com
Spain:     +34 696 285 544
Sweden: +46 (0)730229737
Skype: didac.gil.de.la.iglesia

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to