Hi Sorry was typing from mobile hence could not elaborate earlier.
I presume you want to do map-side join and you mean you want to join 2 RDD without shuffle? Please have a quick look http://apache-spark-user-list.1001560.n3.nabble.com/Text-file-and-shuffle-td5973.html#none 1) co-partition you data for cogroup: val par = HashPartitioner(128) val x = sc.textFile(..).map(...).partitionBy(par) val y = sc.textFile(...).map(...).partitionBy(par) ... This should enable join with (much less) shuffle. Another option provided in the same thread - to broadcast in case one of the table is small(ish). Hope this helps. Best Ayan On Tue, Apr 21, 2015 at 3:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote: > These are pair RDDs (itemId, item) & (itemId, listing). > > What do you mean by re-partitioning of these RDDS ? > Now what you mean by "your partitioner" > > Can you elaborate ? > > On Tue, Apr 21, 2015 at 11:18 AM, ayan guha <guha.a...@gmail.com> wrote: > >> If you are using a pairrdd, then you can use partition by method to >> provide your partitioner >> On 21 Apr 2015 15:04, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <deepuj...@gmail.com> wrote: >> >>> What is re-partition ? >>> >>> On Tue, Apr 21, 2015 at 10:23 AM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> In my understanding you need to create a key out of the data and >>>> repartition both datasets to achieve map side join. >>>> On 21 Apr 2015 14:10, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <deepuj...@gmail.com> wrote: >>>> >>>>> Can someone share their working code of Map Side join in Spark + >>>>> Scala. (No Spark-SQL) >>>>> >>>>> The only resource i could find was this (Open in chrome with Chinese >>>>> to english translator) >>>>> >>>>> http://dongxicheng.org/framework-on-yarn/apache-spark-join-two-tables/ >>>>> >>>>> >>>>> >>>>> -- >>>>> Deepak >>>>> >>>>> >>> >>> >>> -- >>> Deepak >>> >>> > > > -- > Deepak > > -- Best Regards, Ayan Guha