Hey, I got everyday's Event table and want to merge them into a single Event table. But there so many duplicates among each day's data.
I use Parquet as the data source. What I am doing now is EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet file"). Each day's Event is stored in their own Parquet file But it failed at the stage2 which keeps losing connection to one executor. I guess this is due to the memory issue. Any suggestion how I do this efficiently? Thanks, Gavin