Re: Change parallelism number in Spark Streaming

Jacek Laskowski Thu, 27 Jun 2019 03:33:08 -0700

Hi,

I've got a talk "The internals of stateful stream processing in Spark
Structured Streaming" at https://dataxday.fr/ today and am going to include
the tool on the slides to thank you for the work. Thanks.


Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Thu, Jun 27, 2019 at 3:32 AM Jungtaek Lim <kabh...@gmail.com> wrote:

> Glad to help, Jacek.
>
> I'm happy you're doing similar thing, which means it could be pretty
> useful for others as well. Looks like it might be good enough to contribute
> state source and sink. I'll sort out my code and submit a PR.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Thu, Jun 27, 2019 at 7:54 AM Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi Jungtaek,
>>
>> That's very helpful to have the state source. As a matter of fact I've
>> just this week been working on a similar tool (!) and have been wondering
>> how to recreate the schema of the state key and value. You've helped me a
>> lot. Thanks.
>>
>> Jacek
>>
>> On Wed, 26 Jun 2019, 23:58 Jungtaek Lim, <kabh...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> you could consider state operator's partition numbers as "max
>>> parallelism", as parallelism can be reduced via applying coalesce. It would
>>> be effectively working similar as key groups.
>>>
>>> If you're also considering offline query, there's a tool to manipulate
>>> state which enables reading and writing state in structured streaming,
>>> achieving rescaling and schema evolution.
>>>
>>> https://github.com/HeartSaVioR/spark-state-tools
>>> (DISCLAIMER: I'm an author of this tool.)
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei <jia...@amazon.com.invalid>
>>> wrote:
>>>
>>>> Thank you for your quick reply!
>>>>
>>>> Is there any plan to improve this?
>>>>
>>>> I asked this question due to some investigation on comparing those
>>>> state of art streaming systems, among which Flink and DataFlow allow
>>>> changing parallelism number, and by my knowledge of Spark Streaming, it
>>>> seems it is also able to do that: if some “key interval” concept is used,
>>>> then state can somehow decoupled from partition number by consistent
>>>> hashing.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> Jialei
>>>>
>>>>
>>>>
>>>> *From: *Jacek Laskowski <ja...@japila.pl>
>>>> *Date: *Wednesday, June 26, 2019 at 11:00 AM
>>>> *To: *"Rong, Jialei" <jia...@amazon.com.invalid>
>>>> *Cc: *"user @spark" <user@spark.apache.org>
>>>> *Subject: *Re: Change parallelism number in Spark Streaming
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> It's not allowed to change the numer of partitions after your streaming
>>>> query is started.
>>>>
>>>>
>>>>
>>>> The reason is exactly the number of state stores which is exactly the
>>>> number of partitions (perhaps multiplied by the number of stateful
>>>> operators).
>>>>
>>>>
>>>>
>>>> I think you'll even get a warning or an exception when you change it
>>>> after restarting the query.
>>>>
>>>>
>>>>
>>>> The number of partitions is stored in a checkpoint location.
>>>>
>>>>
>>>>
>>>> Jacek
>>>>
>>>>
>>>>
>>>> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, <jia...@amazon.com.invalid>
>>>> wrote:
>>>>
>>>> Hi Dear Spark Expert
>>>>
>>>>
>>>>
>>>> I’m curious about a question regarding Spark Streaming/Structured
>>>> Streaming: whether it allows to change parallelism number(the default one
>>>> or the one specified in particular operator) in a stream having stateful
>>>> transform/operator? Whether this will cause my checkpointed state get
>>>> messed up?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> Jialei
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>

Re: Change parallelism number in Spark Streaming

Reply via email to