I don't think that would change partitioning? try .repartition(). It isn't necessary to write it out let alone in Avro.
On Tue, Mar 23, 2021 at 8:45 PM "Yuri Oleynikov (יורי אולייניקוב)" < yur...@gmail.com> wrote: > Hi, Mohammed > I think that the reason that only one executor is running and have single > partition is because you have single file that might be read/loaded into > memory. > > In order to achieve better parallelism I’d suggest to split the csv file. > > Another problem is question: why are you using rdd? > Just Spark.read.option(“header”, > true).load()..select(....).write.format(“avro”).save(...) > > > > On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > > > Hi, > > > > I have 10gb file that should be loaded into spark dataframe. This file > is csv with header and we were using rdd.zipwithindex to get column names > and convert to avro accordingly. > > > > I am assuming this is taking long time and only executor runs and never > achieves parallelism. Is there a easy way to achieve parallelism after > filtering out the header. > > > > I am > > Also interested in solution that can remove header from the file and I > can give my own schema. This way I can split the files. > > > > Rdd.partitions is always 1 for this even after repartitioning the > dataframe after zip with index . Any help on this topic please . > > > > Thanks, > > Asmath > > > > Sent from my iPhone > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >