Re: iterating over DataFrame Partitions sequentially

sujeet jog Sat, 10 Sep 2016 10:43:50 -0700

Thank you Jacob,
It works for me.

On Sat, Sep 10, 2016 at 12:54 AM, Jakob Odersky <ja...@odersky.com> wrote:


> > Hi Jakob, I have a DataFrame with like 10 patitions, based on the exact
> content on each partition i want to batch load some other data from DB, i
> cannot operate in parallel due to resource contraints i have,  hence want
> to sequential iterate over each partition and perform operations.
>
>
> Ah I see. I think in that case your best option is to run several
> jobs, selecting different subsets of your dataframe for each job and
> running them one after the other. One way to do that would be to get
> the underlying rdd, mapping with the partition's index and then
> filtering and itering over every element. Eg.:
>
> val withPartitionIndex = df.rdd.mapPartitionWithIndex((idx, it) =>
> it.map(elem => (idx, elem))
>
> for (i <- 0 until n) {
>   withPartitionIndex.filter{case (idx, _) => idx == i}.foreach{ case
> (idx, elem) =>
>     //do something with elem
>   }
> }
>
> it's not the best use-case of Spark though and will probably be a
> performance bottleneck.
>
> On Fri, Sep 9, 2016 at 11:45 AM, Jakob Odersky <ja...@odersky.com> wrote:
> > Hi Sujeet,
> >
> > going sequentially over all parallel, distributed data seems like a
> > counter-productive thing to do. What are you trying to accomplish?
> >
> > regards,
> > --Jakob
> >
> > On Fri, Sep 9, 2016 at 3:29 AM, sujeet jog <sujeet....@gmail.com> wrote:
> >> Hi,
> >> Is there a way to iterate over a DataFrame with n partitions
> sequentially,
> >>
> >>
> >> Thanks,
> >> Sujeet
> >>
>

Re: iterating over DataFrame Partitions sequentially

Reply via email to