Re: Structured Streaming in Spark 2.0 and DStreams

Yuval Itzchakov Mon, 16 May 2016 04:13:55 -0700

Oh, that looks neat! Thx, will read up on that.

On Mon, May 16, 2016, 14:10 Ofir Manor <ofir.ma...@equalum.io> wrote:


> Yuval,
> Not sure what in-scope to land in 2.0, but there is another new infra bit
> to manage state more efficiently called State Store, whose initial version
> is already commited:
>    SPARK-13809 - State Store: A new framework for state management for
> computing Streaming Aggregates
> https://issues.apache.org/jira/browse/SPARK-13809
> Eventually the pull request links into the design doc, that discusses the
> limits of updateStateByKey and mapWithState and how that will be
> handled...
>
> At a quick glance at the code, it seems to be used already in streaming
> aggregations.
>
> Just my two cents,
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Mon, May 16, 2016 at 11:33 AM, Yuval Itzchakov <yuva...@gmail.com>
> wrote:
>
>> Also, re-reading the relevant part from the Structured Streaming
>> documentation (
>> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.335my4b18x6x
>> ):
>> Discretized streams (aka dstream)
>>
>> Unlike Storm, dstream exposes a higher level API similar to RDDs. There
>> are two main challenges with dstream:
>>
>>
>>    1.
>>
>>    Similar to Storm, it exposes a monotonic system (processing) time
>>    metric, and makes support for event time difficult.
>>    2.
>>
>>    Its APIs are tied to the underlying microbatch execution model, and
>>    as a result lead to inflexibilities such as changing the underlying batch
>>    interval would require changing the window size.
>>
>>
>> RQ addresses the above:
>>
>>
>>    1.
>>
>>    RQ operations support both system time and event time.
>>    2.
>>
>>    RQ APIs are decoupled from the underlying execution model. As a
>>    matter of fact, it is possible to implement an alternative engine that is
>>    not microbatch-based for RQ.
>>    3. In addition, due to the declarative specification of operations,
>>    RQ leverages a relational query optimizer and can often generate more
>>    efficient query plans.
>>
>>
>> This doesn't seem to attack the actual underlying implementation for how
>> things like "mapWithState" are going to be translated into RQ, and I think
>> thats the hole that's causing my misunderstanding.
>>
>> On Mon, May 16, 2016 at 1:36 AM Yuval Itzchakov <yuva...@gmail.com>
>> wrote:
>>
>>> Hi Ofir,
>>> Thanks for the elaborated answer. I have read both documents, where they
>>> do a light touch on infinite Dataframes/Datasets. However, they do not go
>>> in depth as regards to how existing transformations on DStreams, for
>>> example, will be transformed into the Dataset APIs. I've been browsing the
>>> 2.0 branch and have yet been able to understand how they correlate.
>>>
>>> Also, placing SparkSession in the sql package seems like a peculiar
>>> choice, since this is going to be the global abstraction over
>>> SparkContext/StreamingContext from now on.
>>>
>>> On Sun, May 15, 2016, 23:42 Ofir Manor <ofir.ma...@equalum.io> wrote:
>>>
>>>> Hi Yuval,
>>>> let me share my understanding based on similar questions I had.
>>>> First, Spark 2.x aims to replace a whole bunch of its APIs with just
>>>> two main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset
>>>> (merging of Dataset and Dataframe - which is why it inherits all the
>>>> SparkSQL goodness), while RDD seems as a low-level API only for special
>>>> cases. The new Dataset should also support both batch and streaming -
>>>> replacing (eventually) DStream as well. See the design docs in SPARK-13485
>>>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro.
>>>> However, as you noted, not all will be fully delivered in 2.0. For
>>>> example, it seems that streaming from / to Kafka using StructuredStreaming
>>>> didn't make it (so far?) to 2.0 (which is a showstopper for me).
>>>> Anyway, as far as I understand, you should be able to apply stateful
>>>> operators (non-RDD) on Datasets (for example, the new event-time window
>>>> processing SPARK-8360). The gap I see is mostly limited streaming sources /
>>>> sinks migrated to the new (richer) API and semantics.
>>>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and
>>>> examples will align with the current offering...
>>>>
>>>>
>>>> Ofir Manor
>>>>
>>>> Co-Founder & CTO | Equalum
>>>>
>>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>>
>>>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com>
>>>> wrote:
>>>>
>>>>> I've been reading/watching videos about the upcoming Spark 2.0 release
>>>>> which
>>>>> brings us Structured Streaming. One thing I've yet to understand is
>>>>> how this
>>>>> relates to the current state of working with Streaming in Spark with
>>>>> the
>>>>> DStream abstraction.
>>>>>
>>>>> All examples I can find, in the Spark repository/different videos is
>>>>> someone
>>>>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when
>>>>> browsing
>>>>> the source, SparkSession seems to be defined inside
>>>>> org.apache.spark.sql, so
>>>>> this gives me a hunch that this is somehow all related to SQL and the
>>>>> likes,
>>>>> and not really to DStreams.
>>>>>
>>>>> What I'm failing to understand is: Will this feature impact how we do
>>>>> Streaming today? Will I be able to consume a Kafka source in a
>>>>> streaming
>>>>> fashion (like we do today when we open a stream using KafkaUtils)?
>>>>> Will we
>>>>> be able to do state-full operations on a Dataset[T] like we do today
>>>>> using
>>>>> MapWithStateRDD? Or will there be a subset of operations that the
>>>>> catalyst
>>>>> optimizer can understand such as aggregate and such?
>>>>>
>>>>> I'd be happy anyone could shed some light on this.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>

Re: Structured Streaming in Spark 2.0 and DStreams

Reply via email to