Re: Controlling number of spark partitions in dataframes

Daniel Siegmann Thu, 26 Oct 2017 13:44:09 -0700

Those settings apply when a shuffle happens. But they don't affect the way
the data will be partitioned when it is initially read, for example
spark.read.parquet("path/to/input"). So for HDFS / S3 I think it depends on
how the data is split into chunks, but if there are lots of small chunks
Spark will automatically merge them into small partitions. There are going
to be various settings depending on what you're reading from.


val df = spark.read.parquet("path/to/input") // partitioning will depend on
the data
val df2 = df.groupBy("thing").count() // a shuffle happened, so shuffle
partitioning configuration applies


Tip: gzip files can't be split, so if you read a gzip file everything will
be in one partition. That's a good reason to avoid large gzip files. :-)

If you don't have a shuffle but you want to change how many partitions
there are, you will need to coalesce or repartition.


--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com <lucas.g...@gmail.com
> wrote:

> Thanks Daniel!
>
> I've been wondering that for ages!
>
> IE where my JDBC sourced datasets are coming up with 200 partitions on
> write to S3.
>
> What do you mean for (except for the initial read)?
>
> Can you explain that a bit further?
>
> Gary Lucas
>
> On 26 October 2017 at 11:28, Daniel Siegmann <dsiegmann@securityscorecard.
> io> wrote:
>
>> When working with datasets, Spark uses spark.sql.shuffle.partitions. It
>> defaults to 200. Between that and the default parallelism you can control
>> the number of partitions (except for the initial read).
>>
>> More info here: http://spark.apache.org/docs/l
>> atest/sql-programming-guide.html#other-configuration-options
>>
>> I have no idea why it defaults to a fixed 200 (while default parallelism
>> defaults to a number scaled to your number of cores), or why there are two
>> separate configuration properties.
>>
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001&entry=gmail&source=g>
>> New York, NY 10001
>> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001&entry=gmail&source=g>
>>
>>
>> On Thu, Oct 26, 2017 at 9:53 AM, Deepak Sharma <deepakmc...@gmail.com>
>> wrote:
>>
>>> I guess the issue is spark.default.parallelism is ignored when you are
>>> working with Data frames.It is supposed to work with only raw RDDs.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
>>> noo...@noorul.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have the following spark configuration
>>>>
>>>> spark.app.name=Test
>>>> spark.cassandra.connection.host=127.0.0.1
>>>> spark.cassandra.connection.keep_alive_ms=5000
>>>> spark.cassandra.connection.port=10000
>>>> spark.cassandra.connection.timeout_ms=30000
>>>> spark.cleaner.ttl=3600
>>>> spark.default.parallelism=4
>>>> spark.master=local[2]
>>>> spark.ui.enabled=false
>>>> spark.ui.showConsoleProgress=false
>>>>
>>>> Because I am setting spark.default.parallelism to 4, I was expecting
>>>> only 4 spark partitions. But it looks like it is not the case
>>>>
>>>> When I do the following
>>>>
>>>>     df.foreachPartition { partition =>
>>>>       val groupedPartition = partition.toList.grouped(3).toList
>>>>       println("Grouped partition " + groupedPartition)
>>>>     }
>>>>
>>>> There are too many print statements with empty list at the top. Only
>>>> the relevant partitions are at the bottom. Is there a way to control
>>>> number of partitions?
>>>>
>>>> Regards,
>>>> Noorul
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>

Re: Controlling number of spark partitions in dataframes

Reply via email to