Is your Parquet data source partitioned by date ? Can you dedup within partitions ?
Cheers On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > I tried on Three day's data. The total input is only 980GB, but the > shuffle write Data is about 6.2TB, then the job failed during shuffle read > step, which should be another 6.2TB shuffle read. > > I think to Dedup, the shuffling can not be avoided. Is there anything I > could do to stablize this process? > > Thanks. > > > On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > >> Hey, >> >> I got everyday's Event table and want to merge them into a single Event >> table. But there so many duplicates among each day's data. >> >> I use Parquet as the data source. What I am doing now is >> >> EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet >> file"). >> >> Each day's Event is stored in their own Parquet file >> >> But it failed at the stage2 which keeps losing connection to one >> executor. I guess this is due to the memory issue. >> >> Any suggestion how I do this efficiently? >> >> Thanks, >> Gavin >> > >