You can change paralllism like following: conf = SparkConf() conf.set('spark.sql.shuffle.partitions',10) sc = SparkContext(conf=conf)
On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <darshan.m...@gmail.com> wrote: > Hi, > > My default parallelism is 100. Now I join 2 dataframes with 20 partitions > each , joined dataframe has 100 partition. I want to know what is the way > to keep it to 20 (except re-partition and coalesce. > > Also, when i join these 2 dataframes I am using 4 columns as joined > columns. The dataframes are partitions based on first 2 columns of join and > thus, in effect one partition should be joined corresponding joins and > doesn't need to join with rest of partitions so why spark is shuffling all > the data. > > Simialrly, when my dataframe is partitioned by col1,col2 and if i use > group by on col1,col2,col3,col4 then why does it shuffle everything whereas > it need to sort each partitions and then should grouping there itself. > > Bit confusing , I am using 1.5.1 > > Is it fixed in future versions. > > Thanks > -- Best Regards, Ayan Guha