Re: Controlling number of spark partitions in dataframes

lucas.g...@gmail.com Thu, 26 Oct 2017 14:02:50 -0700

Ok, so for JDBC I presume it defaults to a single partition if you don't
provide partitioning meta data?


Thanks!

Gary

On 26 October 2017 at 13:43, Daniel Siegmann <dsiegm...@securityscorecard.io
> wrote:

> Those settings apply when a shuffle happens. But they don't affect the way
> the data will be partitioned when it is initially read, for example
> spark.read.parquet("path/to/input"). So for HDFS / S3 I think it depends
> on how the data is split into chunks, but if there are lots of small chunks
> Spark will automatically merge them into small partitions. There are going
> to be various settings depending on what you're reading from.
>
> val df = spark.read.parquet("path/to/input") // partitioning will depend
> on the data
> val df2 = df.groupBy("thing").count() // a shuffle happened, so shuffle
> partitioning configuration applies
>
>
> Tip: gzip files can't be split, so if you read a gzip file everything will
> be in one partition. That's a good reason to avoid large gzip files. :-)
>
> If you don't have a shuffle but you want to change how many partitions
> there are, you will need to coalesce or repartition.
>
>
> --
> Daniel Siegmann
> Senior Software Engineer
> *SecurityScorecard Inc.*
> 214 W 29th Street, 5th Floor
> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001&entry=gmail&source=g>
> New York, NY 10001
> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001&entry=gmail&source=g>
>
>
> On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> Thanks Daniel!
>>
>> I've been wondering that for ages!
>>
>> IE where my JDBC sourced datasets are coming up with 200 partitions on
>> write to S3.
>>
>> What do you mean for (except for the initial read)?
>>
>> Can you explain that a bit further?
>>
>> Gary Lucas
>>
>> On 26 October 2017 at 11:28, Daniel Siegmann <
>> dsiegm...@securityscorecard.io> wrote:
>>
>>> When working with datasets, Spark uses spark.sql.shuffle.partitions. It
>>> defaults to 200. Between that and the default parallelism you can control
>>> the number of partitions (except for the initial read).
>>>
>>> More info here: http://spark.apache.org/docs/l
>>> atest/sql-programming-guide.html#other-configuration-options
>>>
>>> I have no idea why it defaults to a fixed 200 (while default parallelism
>>> defaults to a number scaled to your number of cores), or why there are two
>>> separate configuration properties.
>>>
>>>
>>> --
>>> Daniel Siegmann
>>> Senior Software Engineer
>>> *SecurityScorecard Inc.*
>>> 214 W 29th Street, 5th Floor
>>> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001&entry=gmail&source=g>
>>> New York, NY 10001
>>> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001&entry=gmail&source=g>
>>>
>>>
>>> On Thu, Oct 26, 2017 at 9:53 AM, Deepak Sharma <deepakmc...@gmail.com>
>>> wrote:
>>>
>>>> I guess the issue is spark.default.parallelism is ignored when you are
>>>> working with Data frames.It is supposed to work with only raw RDDs.
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
>>>> noo...@noorul.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have the following spark configuration
>>>>>
>>>>> spark.app.name=Test
>>>>> spark.cassandra.connection.host=127.0.0.1
>>>>> spark.cassandra.connection.keep_alive_ms=5000
>>>>> spark.cassandra.connection.port=10000
>>>>> spark.cassandra.connection.timeout_ms=30000
>>>>> spark.cleaner.ttl=3600
>>>>> spark.default.parallelism=4
>>>>> spark.master=local[2]
>>>>> spark.ui.enabled=false
>>>>> spark.ui.showConsoleProgress=false
>>>>>
>>>>> Because I am setting spark.default.parallelism to 4, I was expecting
>>>>> only 4 spark partitions. But it looks like it is not the case
>>>>>
>>>>> When I do the following
>>>>>
>>>>>     df.foreachPartition { partition =>
>>>>>       val groupedPartition = partition.toList.grouped(3).toList
>>>>>       println("Grouped partition " + groupedPartition)
>>>>>     }
>>>>>
>>>>> There are too many print statements with empty list at the top. Only
>>>>> the relevant partitions are at the bottom. Is there a way to control
>>>>> number of partitions?
>>>>>
>>>>> Regards,
>>>>> Noorul
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Deepak
>>>> www.bigdatabig.com
>>>> www.keosha.net
>>>>
>>>
>>>
>>
>

Re: Controlling number of spark partitions in dataframes

Reply via email to