Re: [Structured Streaming]Data processing and output trigger should be decoupled

2017-08-31 Thread 张万新
I think something like state store can be used to keep the intermediate
data. For aggregations the engines keeps processing batches of data and
update the results in state store(or sth like this), and when a trigger
begins the engines just fetch the current result from state store and
output it to the sink specified by users.

Or at least another way, if the processing time is shorter than the trigger
interval, can there be a way to for the engine to first complete most of
jobs or stages, and when the trigger starts, the final job or stages are
done to get the final result and output it to the sink?

Shixiong(Ryan) Zhu 于2017年8月31日周四 上午1:59写道:

> I don't think that's a good idea. If the engine keeps on processing data
> but doesn't output anything, where to keep the intermediate data?
>
> On Wed, Aug 30, 2017 at 9:26 AM, KevinZwx  wrote:
>
>> Hi,
>>
>> I'm working with structured streaming, and I'm wondering whether there
>> should be some improvements about trigger.
>>
>> Currently, when I specify a trigger, i.e.
>> tigger(Trigger.ProcessingTime("10
>> minutes")), the engine will begin processing data at the time the trigger
>> begins, like 10:00:00, 10:10:00, 10:20:00,..., etc, if the engine takes
>> 10s
>> to process this batch of data, then we will get the output result at
>> 10:00:10...,  then the engine just waits without processing any data. When
>> the next trigger begins, the engine begins to process the data during the
>> interval, and if this time the engine takes 15s to process the batch, we
>> will get result at 10:10:15. This is the problem.
>>
>> In my understanding, the trigger and data processing should be decoupled,
>> the engine should keep on processing data as fast as possible, but only
>> generate output results at each trigger, therefore we can get the result
>> at
>> 10:00:00, 10:10:00, 10:20:00, ... So I'm wondering if there is any
>> solution
>> or plan to work on this?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: [Structured Streaming]Data processing and output trigger should be decoupled

2017-08-30 Thread Shixiong(Ryan) Zhu
I don't think that's a good idea. If the engine keeps on processing data
but doesn't output anything, where to keep the intermediate data?

On Wed, Aug 30, 2017 at 9:26 AM, KevinZwx  wrote:

> Hi,
>
> I'm working with structured streaming, and I'm wondering whether there
> should be some improvements about trigger.
>
> Currently, when I specify a trigger, i.e. tigger(Trigger.ProcessingTime(
> "10
> minutes")), the engine will begin processing data at the time the trigger
> begins, like 10:00:00, 10:10:00, 10:20:00,..., etc, if the engine takes 10s
> to process this batch of data, then we will get the output result at
> 10:00:10...,  then the engine just waits without processing any data. When
> the next trigger begins, the engine begins to process the data during the
> interval, and if this time the engine takes 15s to process the batch, we
> will get result at 10:10:15. This is the problem.
>
> In my understanding, the trigger and data processing should be decoupled,
> the engine should keep on processing data as fast as possible, but only
> generate output results at each trigger, therefore we can get the result at
> 10:00:00, 10:10:00, 10:20:00, ... So I'm wondering if there is any solution
> or plan to work on this?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[Structured Streaming]Data processing and output trigger should be decoupled

2017-08-30 Thread KevinZwx
Hi,

I'm working with structured streaming, and I'm wondering whether there
should be some improvements about trigger.

Currently, when I specify a trigger, i.e. tigger(Trigger.ProcessingTime("10
minutes")), the engine will begin processing data at the time the trigger
begins, like 10:00:00, 10:10:00, 10:20:00,..., etc, if the engine takes 10s
to process this batch of data, then we will get the output result at
10:00:10...,  then the engine just waits without processing any data. When
the next trigger begins, the engine begins to process the data during the
interval, and if this time the engine takes 15s to process the batch, we
will get result at 10:10:15. This is the problem.

In my understanding, the trigger and data processing should be decoupled,
the engine should keep on processing data as fast as possible, but only
generate output results at each trigger, therefore we can get the result at
10:00:00, 10:10:00, 10:20:00, ... So I'm wondering if there is any solution
or plan to work on this?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org