Hi, Jone Zhang 1. Hive UDF You might need collect_set or collect_list (to eliminate duplication), but make sure reduce its cardinality before applying UDFs as it can cause problems while handling 1 billion records. Union dataset 1,2,3 -> group by user_id1 -> collect_set (feature column) would works.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF 2.Spark Dataframe Pivot https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html - Goun 2017-05-15 22:15 GMT+09:00 Jone Zhang <joyoungzh...@gmail.com>: > For example > Data1(has 1 billion records) > user_id1 feature1 > user_id1 feature2 > > Data2(has 1 billion records) > user_id1 feature3 > > Data3(has 1 billion records) > user_id1 feature4 > user_id1 feature5 > ... > user_id1 feature100 > > I want to get the result as follow > user_id1 feature1 feature2 feature3 feature4 feature5...feature100 > > Is there a more efficient way except join? > > Thanks! >