Re: Python Custom Partitioner

ayan guha Mon, 04 May 2015 15:27:42 -0700

Thanks, but is there non broadcast solution?
On 5 May 2015 01:34, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <deepuj...@gmail.com> wrote:


> I have implemented map-side join with broadcast variables and the code is
> on mailing list (scala).
>
>
> On Mon, May 4, 2015 at 8:38 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Hi
>>
>> Can someone share some working code for custom partitioner in python?
>>
>> I am trying to understand it better.
>>
>> Here is documentation
>>
>> partitionBy(*numPartitions*, *partitionFunc=<function portable_hash at
>> 0x2c45140>*)
>> <https://spark.apache.org/docs/1.3.1/api/python/pyspark.html#pyspark.RDD.partitionBy>
>>
>> Return a copy of the RDD partitioned using the specified partitioner.
>>
>>
>> what I am trying to do -
>>
>> 1. Create a dataframe
>>
>> 2. Partition it using one specific column
>>
>> 3. create another dataframe
>>
>> 4. partition it on the same column
>>
>> 5. join (to enforce map-side join)
>>
>> My question:
>>
>> a) Am I on right path?
>>
>> b) How can I do partitionby? Specifically, when I call
>> DF.rdd.partitionBy, what gets passed to the custom function? tuple? row?
>> how to access (say 3rd column of a tuple inside partitioner function)?
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Deepak
>
>

Re: Python Custom Partitioner

Reply via email to