Re: Controlling number of spark partitions in dataframes

lucas.g...@gmail.com Thu, 26 Oct 2017 09:53:18 -0700

I think we'd need to see the code that loads the df.

Parallelism and partition count are related but they're not the same.  I've
found the documentation fuzzy on this, but it looks like
default.parrallelism is what spark uses for partitioning when it has no
other guidance.  I'm also under the impression (and I could be wrong here)
that the data loading step has some impact on partitioning.


In any case, I think it would be more helpful with the df loading code.

Good luck!

Gary Lucas

On 26 October 2017 at 09:35, Noorul Islam Kamal Malmiyoda <noo...@noorul.com
> wrote:

> Hi all,
>
> I have the following spark configuration
>
> spark.app.name=Test
> spark.cassandra.connection.host=127.0.0.1
> spark.cassandra.connection.keep_alive_ms=5000
> spark.cassandra.connection.port=10000
> spark.cassandra.connection.timeout_ms=30000
> spark.cleaner.ttl=3600
> spark.default.parallelism=4
> spark.master=local[2]
> spark.ui.enabled=false
> spark.ui.showConsoleProgress=false
>
> Because I am setting spark.default.parallelism to 4, I was expecting
> only 4 spark partitions. But it looks like it is not the case
>
> When I do the following
>
>     df.foreachPartition { partition =>
>       val groupedPartition = partition.toList.grouped(3).toList
>       println("Grouped partition " + groupedPartition)
>     }
>
> There are too many print statements with empty list at the top. Only
> the relevant partitions are at the bottom. Is there a way to control
> number of partitions?
>
> Regards,
> Noorul
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Controlling number of spark partitions in dataframes

Reply via email to