Hi, I've got a talk "The internals of stateful stream processing in Spark Structured Streaming" at https://dataxday.fr/ today and am going to include the tool on the slides to thank you for the work. Thanks.
Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski On Thu, Jun 27, 2019 at 3:32 AM Jungtaek Lim <kabh...@gmail.com> wrote: > Glad to help, Jacek. > > I'm happy you're doing similar thing, which means it could be pretty > useful for others as well. Looks like it might be good enough to contribute > state source and sink. I'll sort out my code and submit a PR. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > > On Thu, Jun 27, 2019 at 7:54 AM Jacek Laskowski <ja...@japila.pl> wrote: > >> Hi Jungtaek, >> >> That's very helpful to have the state source. As a matter of fact I've >> just this week been working on a similar tool (!) and have been wondering >> how to recreate the schema of the state key and value. You've helped me a >> lot. Thanks. >> >> Jacek >> >> On Wed, 26 Jun 2019, 23:58 Jungtaek Lim, <kabh...@gmail.com> wrote: >> >>> Hi, >>> >>> you could consider state operator's partition numbers as "max >>> parallelism", as parallelism can be reduced via applying coalesce. It would >>> be effectively working similar as key groups. >>> >>> If you're also considering offline query, there's a tool to manipulate >>> state which enables reading and writing state in structured streaming, >>> achieving rescaling and schema evolution. >>> >>> https://github.com/HeartSaVioR/spark-state-tools >>> (DISCLAIMER: I'm an author of this tool.) >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei <jia...@amazon.com.invalid> >>> wrote: >>> >>>> Thank you for your quick reply! >>>> >>>> Is there any plan to improve this? >>>> >>>> I asked this question due to some investigation on comparing those >>>> state of art streaming systems, among which Flink and DataFlow allow >>>> changing parallelism number, and by my knowledge of Spark Streaming, it >>>> seems it is also able to do that: if some “key interval” concept is used, >>>> then state can somehow decoupled from partition number by consistent >>>> hashing. >>>> >>>> >>>> >>>> >>>> >>>> Regards >>>> >>>> Jialei >>>> >>>> >>>> >>>> *From: *Jacek Laskowski <ja...@japila.pl> >>>> *Date: *Wednesday, June 26, 2019 at 11:00 AM >>>> *To: *"Rong, Jialei" <jia...@amazon.com.invalid> >>>> *Cc: *"user @spark" <user@spark.apache.org> >>>> *Subject: *Re: Change parallelism number in Spark Streaming >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> It's not allowed to change the numer of partitions after your streaming >>>> query is started. >>>> >>>> >>>> >>>> The reason is exactly the number of state stores which is exactly the >>>> number of partitions (perhaps multiplied by the number of stateful >>>> operators). >>>> >>>> >>>> >>>> I think you'll even get a warning or an exception when you change it >>>> after restarting the query. >>>> >>>> >>>> >>>> The number of partitions is stored in a checkpoint location. >>>> >>>> >>>> >>>> Jacek >>>> >>>> >>>> >>>> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, <jia...@amazon.com.invalid> >>>> wrote: >>>> >>>> Hi Dear Spark Expert >>>> >>>> >>>> >>>> I’m curious about a question regarding Spark Streaming/Structured >>>> Streaming: whether it allows to change parallelism number(the default one >>>> or the one specified in particular operator) in a stream having stateful >>>> transform/operator? Whether this will cause my checkpointed state get >>>> messed up? >>>> >>>> >>>> >>>> >>>> >>>> Regards >>>> >>>> Jialei >>>> >>>> >>>> >>>> >>> >>> -- >>> Name : Jungtaek Lim >>> Blog : http://medium.com/@heartsavior >>> Twitter : http://twitter.com/heartsavior >>> LinkedIn : http://www.linkedin.com/in/heartsavior >>> >> > > -- > Name : Jungtaek Lim > Blog : http://medium.com/@heartsavior > Twitter : http://twitter.com/heartsavior > LinkedIn : http://www.linkedin.com/in/heartsavior >