Hi. No. of partitions are determined by the RDD it uses in the plan it creates. It uses NewHadoopRDD which gives partitions by getSplits of input format it is using. It uses FilteringParquetRowInputFormat subclass of ParquetInputFormat. To change the no of partitions write a new input format and make the NewHadoopRDD use your plan. or if u r ready to shuffle u can use repartition api without change of code.
Thanks & Regards. On Tue, May 5, 2015 at 7:56 PM, Masf <masfwo...@gmail.com> wrote: > Hi Eric. > > Q1: > When I read parquet files, I've tested that Spark generates so many > partitions as parquet files exist in the path. > > Q2: > To reduce the number of partitions you can use rdd.repartition(x), x=> > number of partitions. Depend on your case, repartition could be a heavy task > > > Regards. > Miguel. > > On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom < > eric.eijkelenb...@gmail.com> wrote: > >> Hello guys >> >> Q1: How does Spark determine the number of partitions when reading a >> Parquet file? >> >> val df = sqlContext.parquetFile(path) >> >> Is it some way related to the number of Parquet row groups in my input? >> >> Q2: How can I reduce this number of partitions? Doing this: >> >> df.rdd.coalesce(200).count >> >> from the spark-shell causes job execution to hang… >> >> Any ideas? Thank you in advance. >> >> Eric >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > > > Saludos. > Miguel Ángel >