Re: How to merge two large table and remove duplicates?

Gavin Yue Fri, 08 Jan 2016 14:32:37 -0800

And the most frequent operation I am gonna do is find the UserID who have
some events, then retrieve all the events associted with the UserID.


In this case, how should I partition to speed up the process?

Thanks.

On Fri, Jan 8, 2016 at 2:29 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> hey Ted,
>
> Event table is like this: UserID, EventType, EventKey, TimeStamp,
> MetaData.  I just parse it from Json and save as Parquet, did not change
> the partition.
>
> Annoyingly, every day's incoming Event data having duplicates among each
> other.  One same event could show up in Day1 and Day2 and probably Day3.
>
> I only want to keep single Event table and each day it come so many
> duplicates.
>
> Is there a way I could just insert into Parquet and if duplicate found,
> just ignore?
>
> Thanks,
> Gavin
>
>
>
>
>
>
>
> On Fri, Jan 8, 2016 at 2:18 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Is your Parquet data source partitioned by date ?
>>
>> Can you dedup within partitions ?
>>
>> Cheers
>>
>> On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:
>>
>>> I tried on Three day's data.  The total input is only 980GB, but the
>>> shuffle write Data is about 6.2TB, then the job failed during shuffle read
>>> step, which should be another 6.2TB shuffle read.
>>>
>>> I think to Dedup, the shuffling can not be avoided. Is there anything I
>>> could do to stablize this process?
>>>
>>> Thanks.
>>>
>>>
>>> On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <yue.yuany...@gmail.com>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> I got everyday's Event table and want to merge them into a single Event
>>>> table. But there so many duplicates among each day's data.
>>>>
>>>> I use Parquet as the data source.  What I am doing now is
>>>>
>>>> EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet
>>>> file").
>>>>
>>>> Each day's Event is stored in their own Parquet file
>>>>
>>>> But it failed at the stage2 which keeps losing connection to one
>>>> executor. I guess this is due to the memory issue.
>>>>
>>>> Any suggestion how I do this efficiently?
>>>>
>>>> Thanks,
>>>> Gavin
>>>>
>>>
>>>
>>
>

Re: How to merge two large table and remove duplicates?

Reply via email to