Re: DataFrame --- join / groupBy-agg question...

Muthu Jayakumar Wed, 19 Jul 2017 11:44:44 -0700

The problem with 'spark.sql.shuffle.partitions' is that, it needs to be set
before spark session is create (I guess?). But ideally, I want to partition
by column during a join / group-by (something roughly like
repartitionBy(partitionExpression: Column*) from
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset).
This way I can change the numbers by the data.


Thanks,
Muthu

On Wed, Jul 19, 2017 at 8:23 AM, ayan guha <guha.a...@gmail.com> wrote:

> You can use spark.sql.shuffle.partitions to adjust amount of parallelism.
>
> On Wed, Jul 19, 2017 at 11:41 PM, muthu <bablo...@gmail.com> wrote:
>
>> Hello there,
>>
>> Thank you for looking into the question.
>>
>> >Is the partition count of df depending on fields of groupby?
>> Absolute partition number or by column value to determine the partition
>> count would be fine for me (which is similar to repartition() I suppose)
>>
>> >Also is the performance of groupby-agg comparable to
>> reducebykey/aggbykey?
>> In theory the DF/ DS APIs are supposed to be better as they would
>> optimize the execution order and so on by building an effective Query Plan.
>>
>> Currently I am hacking to spin up a new spark-submit per query request by
>> setting 'spark.sql.shuffle.partitions'. In ideal situations, we have a
>> long running application that uses the same spark-session and runs one or
>> more query using FAIR mode.
>>
>> Thanks,
>> Muthu
>>
>>
>>
>> On Wed, Jul 19, 2017 at 6:03 AM, qihuagao [via Apache Spark User List] 
>> <[hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=28880&i=0>> wrote:
>>
>>> also interested in this.
>>> Is the partition count of df depending on fields of groupby?
>>> Also is the performance of groupby-agg comparable to
>>> reducebykey/aggbykey?
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/DataFram
>>> e-join-groupBy-agg-question-tp28849p28879.html
>>> To start a new topic under Apache Spark User List, email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=28880&i=1>
>>> To unsubscribe from DataFrame --- join / groupBy-agg question..., click
>>> here.
>>> NAML
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: DataFrame --- join / groupBy-agg
>> question...
>> <http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-join-groupBy-agg-question-tp28849p28880.html>
>>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: DataFrame --- join / groupBy-agg question...

Reply via email to