Re: Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Michael Armbrust
ot;, "USER_DIM_USER_ID") > .withColumnRenamed("USER_CNTRY_ID","USER_DIM_COUNTRY_ID") > .as("userdim") > , userAndRetailDates("USER_ID") <=> $"userdim.USER_DIM_USER_ID" > && userAndRetailDates("US

Re: Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Prasad Ravilla
R_ID") .withColumnRenamed("USER_CNTRY_ID","USER_DIM_COUNTRY_ID") .as("userdim") , userAndRetailDates("USER_ID") <=> $"userdim.USER_DIM_USER_ID" && userAndRetailDates("USER_CNTRY_ID") <=> $"us

Re: Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Ted Yu
Can you try the lastest 1.6.0 RC which includes SPARK-1 ? Cheers On Fri, Dec 18, 2015 at 7:38 AM, Prasad Ravilla wrote: > Hi, > > I am running into performance issue when joining data frames created from > avro files using spark-avro library. > > The data frames are created from 120K avro f

Joining DataFrames - Causing Cartesian Product

2015-12-18 Thread Prasad Ravilla
Hi, I am running into performance issue when joining data frames created from avro files using spark-avro library. The data frames are created from 120K avro files and the total size is around 1.5 TB. The two data frames are very huge with billions of records. The join for these two DataFrames