Oh, that looks neat! Thx, will read up on that. On Mon, May 16, 2016, 14:10 Ofir Manor <ofir.ma...@equalum.io> wrote:
> Yuval, > Not sure what in-scope to land in 2.0, but there is another new infra bit > to manage state more efficiently called State Store, whose initial version > is already commited: > SPARK-13809 - State Store: A new framework for state management for > computing Streaming Aggregates > https://issues.apache.org/jira/browse/SPARK-13809 > Eventually the pull request links into the design doc, that discusses the > limits of updateStateByKey and mapWithState and how that will be > handled... > > At a quick glance at the code, it seems to be used already in streaming > aggregations. > > Just my two cents, > > Ofir Manor > > Co-Founder & CTO | Equalum > > Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io > > On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov <yuva...@gmail.com> > wrote: > >> Also, re-reading the relevant part from the Structured Streaming >> documentation ( >> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x >> ): >> Discretized streams (aka dstream) >> >> Unlike Storm, dstream exposes a higher level API similar to RDDs. There >> are two main challenges with dstream: >> >> >> 1. >> >> Similar to Storm, it exposes a monotonic system (processing) time >> metric, and makes support for event time difficult. >> 2. >> >> Its APIs are tied to the underlying microbatch execution model, and >> as a result lead to inflexibilities such as changing the underlying batch >> interval would require changing the window size. >> >> >> RQ addresses the above: >> >> >> 1. >> >> RQ operations support both system time and event time. >> 2. >> >> RQ APIs are decoupled from the underlying execution model. As a >> matter of fact, it is possible to implement an alternative engine that is >> not microbatch-based for RQ. >> 3. In addition, due to the declarative specification of operations, >> RQ leverages a relational query optimizer and can often generate more >> efficient query plans. >> >> >> This doesn't seem to attack the actual underlying implementation for how >> things like "mapWithState" are going to be translated into RQ, and I think >> thats the hole that's causing my misunderstanding. >> >> On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov <yuva...@gmail.com> >> wrote: >> >>> Hi Ofir, >>> Thanks for the elaborated answer. I have read both documents, where they >>> do a light touch on infinite Dataframes/Datasets. However, they do not go >>> in depth as regards to how existing transformations on DStreams, for >>> example, will be transformed into the Dataset APIs. I've been browsing the >>> 2.0 branch and have yet been able to understand how they correlate. >>> >>> Also, placing SparkSession in the sql package seems like a peculiar >>> choice, since this is going to be the global abstraction over >>> SparkContext/StreamingContext from now on. >>> >>> On Sun, May 15, 2016, 23:42 Ofir Manor <ofir.ma...@equalum.io> wrote: >>> >>>> Hi Yuval, >>>> let me share my understanding based on similar questions I had. >>>> First, Spark 2.x aims to replace a whole bunch of its APIs with just >>>> two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset >>>> (merging of Dataset and Dataframe - which is why it inherits all the >>>> SparkSQL goodness), while RDD seems as a low-level API only for special >>>> cases. The new Dataset should also support both batch and streaming - >>>> replacing (eventually) DStream as well. See the design docs in SPARK-13485 >>>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro. >>>> However, as you noted, not all will be fully delivered in 2.0. For >>>> example, it seems that streaming from / to Kafka using StructuredStreaming >>>> didn't make it (so far?) to 2.0 (which is a showstopper for me). >>>> Anyway, as far as I understand, you should be able to apply stateful >>>> operators (non-RDD) on Datasets (for example, the new event-time window >>>> processing SPARK-8360). The gap I see is mostly limited streaming sources / >>>> sinks migrated to the new (richer) API and semantics. >>>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and >>>> examples will align with the current offering... >>>> >>>> >>>> Ofir Manor >>>> >>>> Co-Founder & CTO | Equalum >>>> >>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io >>>> >>>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com> >>>> wrote: >>>> >>>>> I've been reading/watching videos about the upcoming Spark 2.0 release >>>>> which >>>>> brings us Structured Streaming. One thing I've yet to understand is >>>>> how this >>>>> relates to the current state of working with Streaming in Spark with >>>>> the >>>>> DStream abstraction. >>>>> >>>>> All examples I can find, in the Spark repository/different videos is >>>>> someone >>>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when >>>>> browsing >>>>> the source, SparkSession seems to be defined inside >>>>> org.apache.spark.sql, so >>>>> this gives me a hunch that this is somehow all related to SQL and the >>>>> likes, >>>>> and not really to DStreams. >>>>> >>>>> What I'm failing to understand is: Will this feature impact how we do >>>>> Streaming today? Will I be able to consume a Kafka source in a >>>>> streaming >>>>> fashion (like we do today when we open a stream using KafkaUtils)? >>>>> Will we >>>>> be able to do state-full operations on a Dataset[T] like we do today >>>>> using >>>>> MapWithStateRDD? Or will there be a subset of operations that the >>>>> catalyst >>>>> optimizer can understand such as aggregate and such? >>>>> >>>>> I'd be happy anyone could shed some light on this. >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >