RE: Question about relationship between number of files and initial tasks(partitions)

2019-04-13 Thread email
11, 2019 8:23 AM To: yeikel valdes Cc: jasonnerot...@gmail.com; arthur...@flipp.com; user @spark/'user @spark'/spark users/user@spark Subject: Re: Question about relationship between number of files and initial tasks(partitions) Extending Arthur's question, I am facing the same problem

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-11 Thread Sagar Grover
Extending Arthur's question, I am facing the same problem(no of partitions were huge- cored 960, partitions - 16000). I tried to decrease the number of partitions with coalesce, but the problem is unbalanced data. After using coalesce, it gives me Java out of heap space error. There was no out of

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-10 Thread yeikel valdes
If you need to reduce the number of partitions you could also try df.coalesce On Thu, 04 Apr 2019 06:52:26 -0700 jasonnerot...@gmail.com wrote Have you tried something like this? spark.conf.set("spark.sql.shuffle.partitions", "5" )  On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote:

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-04 Thread Jason Nerothin
Have you tried something like this? spark.conf.set("spark.sql.shuffle.partitions", "5" ) On Wed, Apr 3, 2019 at 8:37 PM Arthur Li wrote: > Hi Sparkers, > > I noticed that in my spark application, the number of tasks in the first > stage is equal to the number of files read by the