There have been some recent developments and discussions on the schema side
(link below) that warrant a reconsideration of how control tuples get
delivered.
http://apache.markmail.org/search/?q=apex+list%3Aorg.apache.apex.dev+schema+
discovery+support#query:apex%20list%3Aorg.apache.apex.dev%
20sch
Bhupesh, Sandesh, Tushar:
Thanks for volunteering. This task probably needs all three of you to work
closely together.
The subtasks so far are:
https://issues.apache.org/jira/browse/APEXCORE-580
https://issues.apache.org/jira/browse/APEXCORE-581
Please first review the subtasks and see whether
I am also interested working on this feature.
- Tushar.
On Thu, Dec 1, 2016 at 10:27 AM, Bhupesh Chawda wrote:
> I would like to work on https://issues.apache.org/jira/browse/APEXCORE-580.
>
> ~ Bhupesh
>
> On Thu, Dec 1, 2016 at 5:42 AM, Sandesh Hegde
> wrote:
>
>> I am interested in working
I would like to work on https://issues.apache.org/jira/browse/APEXCORE-580.
~ Bhupesh
On Thu, Dec 1, 2016 at 5:42 AM, Sandesh Hegde
wrote:
> I am interested in working on the following subtask
>
> https://issues.apache.org/jira/browse/APEXCORE-581
>
> Thanks
>
>
> On Wed, Nov 30, 2016 at 2:07 P
I am interested in working on the following subtask
https://issues.apache.org/jira/browse/APEXCORE-581
Thanks
On Wed, Nov 30, 2016 at 2:07 PM David Yan wrote:
> I have created an umbrella ticket for control tuple support:
>
> https://issues.apache.org/jira/browse/APEXCORE-579
>
> Currently it
I have created an umbrella ticket for control tuple support:
https://issues.apache.org/jira/browse/APEXCORE-579
Currently it has two subtasks. Please have a look at them and see whether
I'm missing anything or if you have anything to add. You are welcome to add
more subtasks or comment on the exi
+1 for the plan.
I would be interested in contributing to this feature.
~ Bhupesh
On Nov 29, 2016 03:26, "Sandesh Hegde" wrote:
> I am interested in contributing to this feature.
>
> On Mon, Nov 28, 2016 at 1:54 PM David Yan wrote:
>
> > I think we should probably go ahead with option 1 since
I am interested in contributing to this feature.
On Mon, Nov 28, 2016 at 1:54 PM David Yan wrote:
> I think we should probably go ahead with option 1 since this works with
> most use cases and prevents developers from shooting themselves in the foot
> in terms of idempotency.
>
> We can have a c
I think we should probably go ahead with option 1 since this works with
most use cases and prevents developers from shooting themselves in the foot
in terms of idempotency.
We can have a configuration property that enables option 2 later if we have
concrete use cases that call for it.
Please shar
It appears that option 1 is more favored due to unavailability of a use
case which could use option 2.
However, option 2 is problematic in specific cases, like presence of
multiple input ports for example. In case of a linear DAG where control
tuples are flowing in order with the data tuples, it s
Good question Tushar. The callback should be called only once.
The way to implement this is to keep a list of control tuple hashes for the
given streaming window and only do the callback when the operator has not
seen it before.
Other thoughts?
David
On Thu, Nov 10, 2016 at 9:32 AM, Tushar Gosav
Hi David,
What would be the behaviour in case where we have a DAG with following
operators, the number in bracket is number of partitions, X is NxM
partitioning.
A(1) X B(4) X C(2)
If A sends a control tuple, it will be sent to all 4 partition of B,
and from each partition from B it goes to C, i.
Hi Bhupesh,
Since each input port has its own incoming control tuple, I would imagine
there would be an additional DefaultInputPort.processControl method that
operator developers can override.
If we go for option 1, my thinking is that the control tuples would always
be delivered at the next windo
The control tuple could be delivered to the operator only after it is
received from all upstream partitions but still allow other data from an
upstream partition after it's control tuple is received, we don't have to
necessarily block and do complete synchronization like in end window. You
are righ
I have a question regarding the callback for a control tuple. Will it be
similar to InputPort::process() method? Something like
InputPort::processControlTuple(t)
? Or will it be a method of the operator similar to beginWindow()?
When we say that the control tuple will be delivered at window bounda
I don't see how that would work. Suppose you have a file splitter and
multiple partitions of block readers. The "end of file" event cannot be
processed downstream until all block readers are done. I also think that
this is related to the batch demarcation discussion and there should be a
single gen
There is not guarantee about the ordering of events within a streaming
window with multiple upstream partitions. This would require a
synchronization logic similar to what the streaming window provides, hence
I would expect it to be best supported as part of the same window
synchronization.
On Wed
Suppose I am processing data in a file and I want to do something at the
end of a file at the output operator, I would send an end file control
tuple and act on it when I receive it at the output. In a single window I
may end up processing multiple files and if I don't have multiple ports and
logic
With option 2, users can still do idempotent processing by delaying their
processing of the control tuples to end window. They have the flexibility
with this option. In the usual scenarios, you will have one port and given
that control tuples will be sent to all partitions, all the data sent
before
The use cases listed in the original discussion don't call for option 2. It
seems to come with additional complexity and implementation cost.
Can those in favor of option 2 please also provide the use case for it.
Thanks,
Thomas
On Wed, Nov 2, 2016 at 10:36 PM, Siyuan Hua wrote:
> I will vote
I will vote for approach 1.
First of all that one sounds easier to do to me. And I think idempotency is
important. It may run at the cost of higher latency but I think it is ok
And in addition, when in the future if users do need realtime control tuple
processing, we can always add the option on
Pramod,
To answer your questions, the control tuples will be delivered to all
downstream partitions, and an additional emitControl method (actual name
TBD) can be added to DefaultOutputPort without breaking backward
compatibility.
Also, to clarify, each operator should have the ability to block f
A feature that incurs risk with processing order, and more so with
idempotency is a big enough reason to worry about with option 2. Is there
is a critical use case that needs this feature?
Thks
Amol
On Wed, Nov 2, 2016 at 1:25 PM, Pramod Immaneni
wrote:
> I like approach 2 as it gives more fle
As a rule of thumb in any real time operating system, control tuples should
always be handled using Priority Queues.
We may try to control priorities by defining levels. And shall not
be delivered at window boundaries.
In short, control tuples shall never be treated as any other tuples in real
ti
I like approach 2 as it gives more flexibility and also allows for
low-latency options. I think the following are important as well.
1. Delivering control tuples to all downstream partitions.
2. What mechanism will the operator developer use to send the control
tuple? Will it be an additional meho
Hi all,
I would like to renew the discussion of control tuples.
Last time, we were in a debate about whether:
1) the platform should enforce that control tuples are delivered at window
boundaries only
or:
2) the platform should deliver control tuples just as other tuples and it's
the operator
It is not clear how operator will emit custom control tuple at window
boundaries. One way is to cache/accumulate control tuples in the
operator output port till window closes (END_WINDOW is inserted into the
output sink) or only allow an operator to emit control tuples inside the
endWindow(). T
I agree with David. Allowing control tuples within a window (along with
data tuples) creates very dangerous situation where guarantees are
impacted. It is much safer to enable control tuples (send/receive) at
window boundaries (after END_WINDOW of window N, and before BEGIN_WINDOW
for window N+1).
The windowing we discuss here is in general event time based, arrival time
is a special case of it.
I don't think state changes can be made independent of the streaming window
boundary as it would prevent idempotent processing and transitively exactly
once. For that to work, tuples need to be pres
If we allow these flexibilities for control tuples, we are back to square
one and the regular data tuple is specifically for that.
When we talk about control tuples, I think it's safe to make these two
assumptions:
1) The control tuple is always sent and handled at streaming window
boundary.
2)
I think for session tracking, if the session boundaries are allowed to be
not aligned with the streaming window boundaries, the user will have a much
bigger problem with idempotency. And in most cases, session tracking is
event time based, not ingression time or processing time based, so this may
n
I hope I'm not commenting too late on this thread.
>From the above discussion, it looks like the requirement is to have a
custom tuple which has following 2 capabilities to influence streaming
engine on:
1. When to send it (between windows/within window)
2. Where to send it (all partitions, some p
Ability to send custom control tuples within window may be useful, for
example, for sessions tracking, where session boundaries are not aligned
with window boundaries and 500 ms latency is not acceptable for an
application.
Thank you,
Vlad
On 6/25/16 10:52, Thomas Weise wrote:
It should not
It should not matter from where the control tuple is triggered. It will be
good to have a generic mechanism to propagate it and other things can be
accomplished outside the engine. For example, the new comprehensive support
for windowing will all be in Malhar, nothing that the engine needs to know
I did not say that "notably" does not mean "exclusive"
Thks
Amol
On Sat, Jun 25, 2016 at 9:29 AM, Sandesh Hegde
wrote:
> Why restrict the control tuples to input operators?
>
> On Sat, Jun 25, 2016 at 9:07 AM Amol Kekre wrote:
>
> > David,
> > We should avoid control tuple within the window b
Why restrict the control tuples to input operators?
On Sat, Jun 25, 2016 at 9:07 AM Amol Kekre wrote:
> David,
> We should avoid control tuple within the window by simply restricting it
> through API. This can be done by calling something like "sendControlTuple"
> between windows, notably in inp
David,
We should avoid control tuple within the window by simply restricting it
through API. This can be done by calling something like "sendControlTuple"
between windows, notably in input operators.
Thks
Amol
On Sat, Jun 25, 2016 at 7:32 AM, Munagala Ramanath
wrote:
> What would the API look
What would the API look like for option 1 ? Another operator callback
called controlTuple() or does the operator code have to check each
incoming tuple to see if it was data or control ?
Ram
On Fri, Jun 24, 2016 at 11:42 PM, David Yan wrote:
> It looks like option 1 is preferred by the communit
+1
I think this is a great feature, and is needed for all the reasons stated
above, as well as supporting integrations with frameworks like Apache
Beam. Giving users more control by supporting custom serialized objects,
Option 1, would be my preference.
Thanks,
Sasha
On Sat, Jun 25, 2016 at 6:1
For the use cases you mentioned, I think 1) and 2) are more likely to
be controlled directly by the application, 3) and 4) are more likely
going to be triggered externally and directly handled by the engine
and 3) is already being implemented that way (apexcore-163).
The control tuples emitted by
It looks like option 1 is preferred by the community. But let me elaborate
why I brought up the option of piggy backing BEGIN and END_WINDOW
Option 2 implicitly enforces that the operations related to the custom
control tuple be done at the streaming window boundary.
For most operations, it makes
+1 for option 1.
Thank you,
Vlad
On 6/24/16 14:35, Bright Chen wrote:
+1
It also can help to Shutdown the application gracefully.
Bright
On Jun 24, 2016, at 1:35 PM, Siyuan Hua wrote:
+1
I think it's good to have custom control tuple and I prefer the 1 option.
Also I think we should thin
+1
It also can help to Shutdown the application gracefully.
Bright
> On Jun 24, 2016, at 1:35 PM, Siyuan Hua wrote:
>
> +1
>
> I think it's good to have custom control tuple and I prefer the 1 option.
>
> Also I think we should think about couple different callbacks, that could
> be operator l
+1
I think it's good to have custom control tuple and I prefer the 1 option.
Also I think we should think about couple different callbacks, that could
be operator level(triggered when an operator receives an control tuple) or
dag level(triggered when control tuple flow over the whole dag)
Regard
My initial thinking is that the custom control tuples, just like the
existing control tuples, will only be generated from the input operators
and will be propagated downstream to all operators in the DAG. So the NxM
partitioning scenario works just like how other control tuples work, i.e.
the callb
+1 for the feature
I am in favor of option 1, but we may need an helper method to avoid
compiler error on typed port, as calling port.emit(controlTuple) will
be an error if type of control tuple and port does not match. or new
method in outputPort object , emitControlTuple(ControlTuple).
Can you
Overall I am strongly +1 on this ask. During pre-open source days we had
discussed these and had kept it on the table for future. I believe the time
has come
I would prefer option 1 that is sent between previous window's END_WINDOW
and new window's BEGIN_WINDOW control tuple. Piggy backing BEGIN_W
47 matches
Mail list logo