Yes these data sets are in HDFs. Earlier that task completed in 25 mins. Now its 15 + 20
On Wed, Jun 24, 2015 at 3:16 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Yes, it will introduce a shuffle stage in order to perform the > repartitioning. So it’s more useful if you’re planning to do many > downstream transformations for which you need the increased parallelism. > > Is this a dataset from HDFS? > > From: "ÐΞ€ρ@Ҝ (๏̯͡๏)" > Date: Wednesday, June 24, 2015 at 6:11 PM > To: Silvio Fiorito > Cc: user > Subject: Re: how to increase parallelism ? > > What that did was run a repartition with 174 tasks > > repartition with 174 tasks > AND > actual .filter.map stage with 500 tasks > > It actually doubled to stages. > > > > On Wed, Jun 24, 2015 at 12:01 PM, Silvio Fiorito < > silvio.fior...@granturing.com> wrote: > >> Hi Deepak, >> >> Parallelism is controlled by the number of partitions. In this case, >> how many partitions are there for the details RDD (likely 170). >> >> You can check by running “details.partitions.length”. If you want to >> increase parallelism you can do so by repartitioning, increasing the number >> of partitions: “details.repartition(xxxx)” >> >> Thanks, >> Silvio >> >> From: "ÐΞ€ρ@Ҝ (๏̯͡๏)" >> Date: Wednesday, June 24, 2015 at 1:57 PM >> To: user >> Subject: how to increase parallelism ? >> >> I have a filter.map that triggers 170 tasks. How can i increase it ? >> >> Code: >> >> val viEvents = details.filter(_.get(14).asInstanceOf[Long] != NULL_VALUE).map >> { vi => (vi.get(14).asInstanceOf[Long], vi) } >> >> >> Deepak >> >> > > > -- > Deepak > > -- Deepak