A dataframe with following contents is given:
ID PART DETAILS
11 A1
12 A2
13 A3
21 B1
31 C1
Target format should be as following:
ID DETAILS
1 A1+A2+A3
2 B1
3 C1
Note, the order of A1-3 is important.
Currently I am using this alternative:
ID DETAIL_1 DETAIL_2 DETAI
Many threads talk about memory requirements and most often answers are,
to add more memory to spark. My understanding of spark is a scaleable
anyltics engine, which is able to utilize assigned resources and to
calculate the correct answer. So assigning core and memory may speedup
an task.
I am usi
While working with larger datasets I run into out of memory issues.
Basically a hadoop sequence file is read, its contents are sorted and a
hadoop map file is written back. Code works fine for workloads greater
than 20gb. Than I changed one column in my dataset to store a large
object and size of r