Thanks, but is there non broadcast solution? On 5 May 2015 01:34, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <deepuj...@gmail.com> wrote:
> I have implemented map-side join with broadcast variables and the code is > on mailing list (scala). > > > On Mon, May 4, 2015 at 8:38 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> Can someone share some working code for custom partitioner in python? >> >> I am trying to understand it better. >> >> Here is documentation >> >> partitionBy(*numPartitions*, *partitionFunc=<function portable_hash at >> 0x2c45140>*) >> <https://spark.apache.org/docs/1.3.1/api/python/pyspark.html#pyspark.RDD.partitionBy> >> >> Return a copy of the RDD partitioned using the specified partitioner. >> >> >> what I am trying to do - >> >> 1. Create a dataframe >> >> 2. Partition it using one specific column >> >> 3. create another dataframe >> >> 4. partition it on the same column >> >> 5. join (to enforce map-side join) >> >> My question: >> >> a) Am I on right path? >> >> b) How can I do partitionby? Specifically, when I call >> DF.rdd.partitionBy, what gets passed to the custom function? tuple? row? >> how to access (say 3rd column of a tuple inside partitioner function)? >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > Deepak > >