Re: Equally split a RDD partition into two partition at the same node

Fei Hu Sun, 15 Jan 2017 10:38:47 -0800

Hi Anastasios,

Thanks for your information. I will look into the CoalescedRDD code.


Thanks,
Fei

On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias <zouz...@gmail.com>
wrote:

> Hi Fei,
>
> I looked at the code of CoalescedRDD and probably what I suggested will
> not work.
>
> Speaking of which, CoalescedRDD is private[spark]. If this was not the
> case, you could set balanceSlack to 1, and get what you requested, see
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75
>
> Maybe you could try to use the CoalescedRDD code to implement your
> requirement.
>
> Good luck!
> Cheers,
> Anastasios
>
>
> On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu <hufe...@gmail.com> wrote:
>
>> Hi Anastasios,
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>> Thanks,
>> Fei
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zouz...@gmail.com>
>> wrote:
>>
>>> Hi Fei,
>>>
>>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>>
>>> https://github.com/apache/spark/blob/branch-1.6/core/src/mai
>>> n/scala/org/apache/spark/rdd/RDD.scala#L395
>>>
>>> coalesce is mostly used for reducing the number of partitions before
>>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>>> requirements) if you increase the # of partitions.
>>>
>>> Best,
>>> Anastasios
>>>
>>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hufe...@gmail.com> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I want to equally divide a RDD partition into two partitions. That
>>>> means, the first half of elements in the partition will create a new
>>>> partition, and the second half of elements in the partition will generate
>>>> another new partition. But the two new partitions are required to be at the
>>>> same node with their parent partition, which can help get high data
>>>> locality.
>>>>
>>>> Is there anyone who knows how to implement it or any hints for it?
>>>>
>>>> Thanks in advance,
>>>> Fei
>>>>
>>>>
>>>
>>>
>>> --
>>> -- Anastasios Zouzias
>>> <a...@zurich.ibm.com>
>>>
>>
>>
>
>
> --
> -- Anastasios Zouzias
> <a...@zurich.ibm.com>
>

Re: Equally split a RDD partition into two partition at the same node

Reply via email to