Re: Best practice for join

2014-11-04 Thread Akhil Das
Oh, in that case, if you want to reduce the GC time, you can specify the level of parallelism along with your join, reduceByKey operations. Thanks Best Regards On Wed, Nov 5, 2014 at 1:11 PM, Benyi Wang wrote: > I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't > support H

Re: Best practice for join

2014-11-04 Thread Benyi Wang
I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't support Hash join in this version. On Tue, Nov 4, 2014 at 10:54 PM, Akhil Das wrote: > How about Using SparkSQL ? > > Thanks > Best Regards > > On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang wrote

Re: Best practice for join

2014-11-04 Thread Akhil Das
How about Using SparkSQL ? Thanks Best Regards On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang wrote: > I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did, > > # build (K,V) from A and B to prepare the join > > val ja = A.map( r => (K1, Va)) > val jb = B.map

Best practice for join

2014-11-04 Thread Benyi Wang
I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did, # build (K,V) from A and B to prepare the join val ja = A.map( r => (K1, Va)) val jb = B.map( r => (K1, Vb)) # join A, B val jab = ja.join(jb) # build (K,V) from the joined result of A and B to prepare joining with C val jc = C.ma