Check this link

https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/

You can set it

        spark.conf.set("sparkDefaultParallelism", value])


Have a look at Streaming statistics in Spark GUI, especially *Processing
Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
*The **Scheduling Dela*y and *the **Total Dela*y are additional indicators
of health.


then decide how to set the value.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Yes I need to check the performance of my streaming job in terms of
> latency and throughput. Is there any working example of how to increase the
> parallelism with spark structured streaming  using Dataset data structures?
> Thanks in advance.
>
> Kind regards,
>
> ------------------------------------------------------------------
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> <https://sites.bu.edu/casp/people/ekritharakis/>
>
> Boston University
>
>
> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> What benefits are you going with increasing parallelism? Better througput
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I hope this email finds you well!
>>>
>>> I have a simple dataflow in which I read from a kafka topic, perform a
>>> map transformation and then I write the result to another topic. Based on
>>> your documentation here
>>> <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
>>> I need to work with Dataset data structures. Even though my solution works,
>>> I need to increase the parallelism. The spark documentation includes a lot
>>> of parameters that I can change based on specific data structures like
>>> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
>>> former is the default number of partitions in RDDs returned by
>>> transformations like join, reduceByKey while the later is not recommended
>>> for structured streaming as it is described in documentation: "Note: For
>>> structured streaming, this configuration cannot be changed between query
>>> restarts from the same checkpoint location".
>>>
>>> So my question is how can I increase the parallelism for a simple
>>> dataflow based on datasets with a map transformation only?
>>>
>>> I am looking forward to hearing from you as soon as possible. Thanks in
>>> advance!
>>>
>>> Kind regards,
>>>
>>> ------------------------------------------------------------------
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>>
>>> Boston University
>>>
>>

Reply via email to