Is your Parquet data source partitioned by date ?

Can you dedup within partitions ?

Cheers

On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> I tried on Three day's data.  The total input is only 980GB, but the
> shuffle write Data is about 6.2TB, then the job failed during shuffle read
> step, which should be another 6.2TB shuffle read.
>
> I think to Dedup, the shuffling can not be avoided. Is there anything I
> could do to stablize this process?
>
> Thanks.
>
>
> On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:
>
>> Hey,
>>
>> I got everyday's Event table and want to merge them into a single Event
>> table. But there so many duplicates among each day's data.
>>
>> I use Parquet as the data source.  What I am doing now is
>>
>> EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet
>> file").
>>
>> Each day's Event is stored in their own Parquet file
>>
>> But it failed at the stage2 which keeps losing connection to one
>> executor. I guess this is due to the memory issue.
>>
>> Any suggestion how I do this efficiently?
>>
>> Thanks,
>> Gavin
>>
>
>

Reply via email to