Hi, Jone Zhang

1. Hive UDF
You might need collect_set or collect_list (to eliminate duplication), but
make sure reduce its cardinality before applying UDFs as it can cause
problems while handling 1 billion records. Union dataset 1,2,3 -> group by
user_id1 -> collect_set (feature column) would works.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

2.Spark Dataframe Pivot
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

- Goun

2017-05-15 22:15 GMT+09:00 Jone Zhang <joyoungzh...@gmail.com>:

> For example
> Data1(has 1 billion records)
> user_id1  feature1
> user_id1  feature2
>
> Data2(has 1 billion records)
> user_id1  feature3
>
> Data3(has 1 billion records)
> user_id1  feature4
> user_id1  feature5
> ...
> user_id1  feature100
>
> I want to get the result as follow
> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
>
> Is there a more efficient way except join?
>
> Thanks!
>

Reply via email to