Re: Rdd - zip with index

Sean Owen Tue, 23 Mar 2021 19:33:55 -0700

I don't think that would change partitioning? try .repartition(). It isn't
necessary to write it out let alone in Avro.


‪On Tue, Mar 23, 2021 at 8:45 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
yur...@gmail.com> wrote:‬

> Hi, Mohammed
> I think that the reason that only one executor is running and have single
> partition is because you have single file that might be read/loaded into
> memory.
>
> In order to achieve better parallelism I’d suggest to split the csv file.
>
> Another problem is question: why are you using rdd?
> Just Spark.read.option(“header”,
> true).load()..select(....).write.format(“avro”).save(...)
>
>
> > On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I have 10gb file that should be loaded into spark dataframe. This file
> is csv with header and we were using rdd.zipwithindex to get column names
> and convert to avro accordingly.
> >
> > I am assuming this is taking long time and only executor runs and never
> achieves parallelism. Is there a easy way to achieve parallelism after
> filtering out the header.
> >
> > I am
> > Also interested in solution that can remove header from the file and I
> can give my own schema. This way I can split the files.
> >
> > Rdd.partitions is always 1 for this even after repartitioning the
> dataframe after zip with index . Any help on this topic please .
> >
> > Thanks,
> > Asmath
> >
> > Sent from my iPhone
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Rdd - zip with index

Reply via email to