[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2023-03-20 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702980#comment-17702980
 ] 

Matthias J. Sax commented on KAFKA-7224:


With 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-825%3A+introduce+a+new+API+to+control+when+aggregated+results+are+produced]
 added to 3.3, so we still want/need this one?

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-05-12 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105278#comment-17105278
 ] 

Maatari commented on KAFKA-7224:


Got it. Many thanks.

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-05-09 Thread John Roesler (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103420#comment-17103420
 ] 

John Roesler commented on KAFKA-7224:
-

Hi Maatari,

Yeah, it probably seems beside the point because it is beside the point. I 
probably shouldn’t have mentioned it. I guess I was just thinking that when the 
general problem is “I get too many updates in the output”, some of those are 
idempotent, while others are non-idempotent. If we eliminate the idempotent 
updates, then maybe the number of updates on the output side drops below the 
“too many” threshold and the problem goes away. 

Of course, if you want a guarantee, such as a rate limit or that you don’t emit 
_any_ result until some specific time, then of course you need something with 
those semantics, which is orthogonal to whether there are idempotent updates or 
not. 

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-05-08 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102667#comment-17102667
 ] 

Maatari commented on KAFKA-7224:


[~vvcephei]

thank you so much for your input on this. Understood everything well expect the 
last point, related to the emit-on-change. I do not see why 
ktable0.join(ktable1.groupby.reduce)
Can cause idempotent updates ?

I red the KIP KIP-557 and the example with the PageViewID, SessionID, and can 
see the idempotent update here, but not with the code above. Would you please 
elaborate a bit on this ?

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-05-01 Thread John Roesler (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097684#comment-17097684
 ] 

John Roesler commented on KAFKA-7224:
-

Hi all,

Thanks for the good points all around.

Just to close the loop on _this_ ticket (disk-based suppression). It was 
_extremely_ poor performance. So much so that my thinking was that for anyone 
with high enough volume to actually need suppression, it would be too slow to 
be useful. The problem is that we need to check the beginning of the 
suppression buffer on every (or almost every) record, to see if we need to emit 
something. For an in-memory store, this is fine, but for RocksDB in particular, 
the scan performance is very slow. There are fundamental reasons why this is 
the case, which we don't need to get into here.

It might be possible to cleverly engineer our way around the problem, but 
anything I came up with just sounded too complicated to be worth it.

However, this is only necessary if you want the semantics of Suppress (each 
record times out individually, based on stream time). If you instead just want 
to buffer everything on disk and then emit everything you've buffered, say once 
an hour, you can do it much more efficiently in a custom FlatTransformValues 
where you put all incoming data into the store, then schedule a wall-clock 
punctuation to scan the entire store and forward everything.

The one complication is that the wall-clock punctuation currently blocks the 
StreamThread, so you need to have some sense of how long it will take (observed 
empirically) and make sure that you set the {{max.poll.interval.ms}} with 
enough head-room so you won't drop out of the group.

This is bleeding more into the domain of KIP-424, which does seem more like 
what [~maatdeamon] needs (just agreeing with the discussion so far). I don't 
think there was any technical impediment to implementing that one, it was just 
that the KIP discussion petered out (which happens sometimes). I guess, 
building on my last paragraphs, _if_ we had wall-clock-based suppression, 
_then_ it might make more sense to offer on-disk suppression in addition to 
in-memory, as at least the (wall-clock + on-disk) configuration could be 
performant. But it would need much more design. I'm still unsure if on-disk 
suppression is really a good idea to implement in the DSL.

A final thought worth mentioning in this discussion is that KIP-557 ( 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-557%3A+Add+emit+on+change+support+for+Kafka+Streams]
 ) will go a long way toward dropping unnecessary updates. This isn't the same 
thing as suppressing intermediate results, but it will help a great deal to at 
least drop idempotent updates early in the topology and not even have to 
suppress them at the end.

Thanks,

-John

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097043#comment-17097043
 ] 

Maatari commented on KAFKA-7224:


Yes, 

will look at how to go about this. 

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097040#comment-17097040
 ] 

Matthias J. Sax commented on KAFKA-7224:


Thanks for clarification. I missed the point that you can allow `suppress()` to 
also emit if the buffer is full. For this case, having a larger buffer could 
help to reduce intermediate results. My bad.

About KIP-428: I am not sure atm what was proposed, but I agree that only a 
pure wall-clock-time emit strategy does make sense to have a good rate control.

Btw: I think [~vvcephei] actually work on RocksDB support for suppress() but 
the performance was pretty bad and he never finished it.

In any case: If you really need a feature and nobody is working on it, feel 
free to pick it up. Kafka is an open source project after all.

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097041#comment-17097041
 ] 

Maatari commented on KAFKA-7224:


Finally, I don't think providing that would only support my use case, i think 
solving this, go in the direction of the statement

 
{quote}Close the gap between the semantics of KTables in streams and tables in 
relational databases. It is common practice to capture changes as they are made 
to tables in a RDBMS into Kafka topics (JDBC-connect, Debezium, Maxwell). These 
entities typically have multiple one-to-many relationship. Usually RDBMSs offer 
good support to resolve this relationship with a join. Streams falls short here 
and the workaround (group by - join - lateral view) is not well supported as 
well and is not in line with the idea of record based processing.
{quote}
found here

[https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable]

 

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097032#comment-17097032
 ] 

Maatari commented on KAFKA-7224:


{quote}*However, if there was a way to enforce a maximum time a records stay in 
the buffer without being emitted,*
{quote}
{quote}Well, the current suppress does this. Or do you refer to wall-clock time?
{quote}
I think a bit of confusion here as well. What i mean is exactly the last point 
i refer to in my last message. So to clarify, based on the last comment of my 
last message, if for you *wall-clock-time* emit strategy means not being event 
driven, as the author suggested, but driven by the wall-clock only, then yes i 
do refer to the wall clock time when i say this. 



{quote}I cannot follow here. If you buffer and suppress updates to the same key 
and emit update in a certain "frequency" there is no difference if you do this 
in-memory of if you spill to disk. The only difference is, how many unique keys 
the suppress buffer can handle: for in-memory the number of unique keys is 
smaller as all the data must fit into main-memory, while RocksDB would allow to 
process more unique keys. But the number of unique keys is independent to the 
number of intermediate result (that you need to count _per key_ as updates to 
two different keys would never suppress each other).


{quote}
You are spot on my point when you mention that rocksDB would allow to process 
(suppress) more unique keys. Beside that obviously, my thinking was the more 
unique keys i can holds, the more suppression i can do without evicting things. 
However i do not understand your last statement. 


{quote}But the number of unique keys is independent to the number of 
intermediate result (that you need to count _per key_ as updates to two 
different keys would never suppress each other).
{quote}
You do not think that the bigger the suppression buffer, wether in memory or on 
disk, the more suppression you can do ?


So far if i understood you well, it sounds like a combination of KIP-328 + 
KIP-242 (wall-clock-time emit strategy) would solve my use case no ? How to get 
there is another question, but at least making sure to go in the right 
direction is important. I like the approche of keeping the semantic of the 
stream, separate from the operational concern 
[https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers/]

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097025#comment-17097025
 ] 

Maatari commented on KAFKA-7224:


Thank you so much for your clarification it helps a lot. Will try to clarify 
some of my confusing statement. 
{quote}What do you mean by "at the end of the topology"? There is nothing like 
this. Note that the input is not a "finite table" but the "infinite table 
changelog stream".
{quote}
 
I just meant having something like this
{code:java}
ktable0.join(ktable1.groupby.reduce).supress(...){code}
It is my language here that was misleading. I agree with you, it is not a 
finite table. What i want is to significantly mitigate the intermediary 
results. 


{quote}That is by design. Because the input may contain out-of-order data, time 
cannot easily be advanced if the input stream "stalls". Otherwise, the whole 
operation becomes non-deterministic (what might be ok for your use case 
though). This would require some wall-clock time emit strategy though (as you 
mentioned already, ie, KP-424).{quote}
 
As you suggest above it is exactly what would put me in the right direction, 
given my use case. I will specifically adopt your language *wall-clock time 
emit strategy.* Is that really what was intended in KIP-424. In that page 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-424%3A+Allow+suppression+of+intermediate+events+based+on+wall+clock+time]
 The author specifically says: _"However, checks against the wall clock are 
event driven: if no events are received in over 5 seconds, no events will be 
sent downstream"_
Hence, just to clarify, to you mean the same thing when you say wall-clock time 
emit strategy ? Because if that is the case, the same problem as above will, 
me, some records can still stay stuck if nothing else comes in. It is 
important, because i was to ask from your point of you if it is even feasible, 
to have wall-clock time used as i mean. That is, if the time of a key, passed 
the configure time, even if no new record have been ingested, emit the record 
anyway. 



 

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096976#comment-17096976
 ] 

Matthias J. Sax commented on KAFKA-7224:


It seem we use the term "intermediate" result in the same way. However, note, 
that for a "KTable-KTable"  join there is no "final" result: the result is by 
definition an infinite changelog streams: for each update the input tables, and 
new result update record is produced. Hence, the only thing you can do it, to 
say: don't give me every update, but (for the same key) only a subset of 
updates.
{quote}cause if i want to suppress all the intermediary result let say at the 
end of the topology above
{quote}
What do you mean by "at the end of the topology"? There is nothing like this. 
Note that the input is not a "finite table" but the "infinite table changelog 
stream".
{quote}given the frequency with which the database is updated, i can find 
myself with records, stuck in the supression buffer. Indeed it is stream time
{quote}
That is by design. Because the input may contain out-of-order data, time cannot 
easily be advanced if the input stream "stalls". Otherwise, the whole operation 
becomes non-deterministic (what might be ok for your use case though). This 
would require some wall-clock time emit strategy though (as you mentioned 
already, ie, KP-424).
{quote}However, if there was a way to enforce a maximum time a records stay in 
the buffer without being emitted,
{quote}
Well, the current suppress does this. Or do you refer to wall-clock time?
{quote}and if that buffer was rocksDB, then i think i could massively mitigate 
those intermediary result, and produce despite the frequency of the db i am 
ready the data from.
{quote}
I cannot follow here. If you buffer and suppress updates to the same key and 
emit update in a certain "frequency" there is no difference if you do this 
in-memory of if you spill to disk. The only difference is, how many unique keys 
the suppress buffer can handle: for in-memory the number of unique keys is 
smaller as all the data must fit into main-memory, while RocksDB would allow to 
process more unique keys. But the number of unique keys is independent to the 
number of intermediate result (that you need to count _per key_ as updates to 
two different keys would never suppress each other).

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096868#comment-17096868
 ] 

Maatari commented on KAFKA-7224:


I have played a bit with the supress using untilTimeLimit, but with no sucesss, 
cause if i want to supress of the intermediary result let say at the end of the 
topology above, given the frequency with which the database is updated, i can 
find myself with records, stuck in the supression buffer. Indeed it is stream 
time, if it does not progress, then a record might find itself never emitted. 
Beside i would need quite a large time, to have an effective suppression. My 
understanding now is that UntilTimeLimit is event driven, which i did not know, 
when i first posted my message. 

However, if there was a way to enforce a maximum time a records stay in the 
buffer without being emitted, and if that buffer was rocksDB, then i think i 
could massively mitigate those intermediary result, and produce despite the 
frequency of the db i am ready the data from.

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096863#comment-17096863
 ] 

Maatari commented on KAFKA-7224:


What i call intermediate result, is in the following context. Let say you have 
the following topology 
{code:java}
ktable0.join(ktable1.groupby.reduce){code}
Where the reduce just act as the collectList in KSQL. This is a use case we 
have we need like this. There is a repartition topic at the groupby, and 
therefore you would emit, multiple time the same records, while the list 
collected with the reduce will keep increasing, until the entire topic is 
consume. This next generate, multiple results for join as well, as the same key 
on the right of the join will come multiple time. So you end up having 
systematic every growing version of records. That is what i call intermediate 
result. This is a way to build views on normalize data, that build entity with 
reference to all its outgoing links. We use to do that in our databases, but it 
was not scaling. 

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-30 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096237#comment-17096237
 ] 

Matthias J. Sax commented on KAFKA-7224:


This ticket would not reduce intermediate result. Not sure what issue you are 
facing with "too many intermediate" results. Are you using `suppress()` 
already? If yes, what issue do you face?

Also, wall-clock time suppression does not help to reduce intermediate result, 
but it make suppression non-deterministic. I might be helpful for some 
user-cases, ie, output rate control. But I am not sure how it relates to your 
use case?

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-04-29 Thread Maatari (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096139#comment-17096139
 ] 

Maatari commented on KAFKA-7224:


Def something our team is longing for. There are serious use case around. This 
feature would definitely unlock the most critical issue we are facing with our 
kafka stream application, to many intermediary result at this point. We load 
entire databases and build views that represent the complete entities of the 
domain though joins. Although functionally things work, operationally, there is 
just too much intermediary results. Having this in combination to 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-424%3A+Allow+suppression+of+intermediate+events+based+on+wall+clock+time]

Would be the killer

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054271#comment-17054271
 ] 

ASF GitHub Bot commented on KAFKA-7224:
---

vvcephei commented on pull request #6428: KAFKA-7224: [WIP] Persistent Suppress 
[WIP]
URL: https://github.com/apache/kafka/pull/6428
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Assignee: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2020-03-07 Thread John Roesler (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054270#comment-17054270
 ] 

John Roesler commented on KAFKA-7224:
-

I didn't realized I'd left this ticket in progress. I intended to shelve this 
work until there was some concrete ask for it.

After the implementation in the PR, I ran some benchmarks, and I found that the 
performance with rocksdb-backed suppression was _absolutely terrible_ I think 
it was like two orders of magnitude slower. Much slower even than regular 
rocksdb-backed persistent store operations. The key problem was that the 
suppression buffer relies on scans, and scans in RocksDB are absurdly slow. I 
looked into rocksdb optimizations, but didn't find anything remotely promising.

It might be the case that you'd be fine with a huge performance penalty in 
exchange for the "final result" semantics, but it seems like it would have to 
be a very niche use case: low throughput (so the performance is tolerable) but 
large amounts of intermediate results (so that the in-memory buffer wouldn't be 
sufficient).

I wasn't confident that such a use case would actually exist, and on the other 
hand, it felt like a massive potential for frustration to drop such a 
poor-performing component into the codebase, even if I were to pepper the 
javadocs with warnings about it. So I decided just to pause work on it pending 
more information

-John

> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Assignee: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7224) KIP-328: Add spill-to-disk for Suppression

2019-03-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789906#comment-16789906
 ] 

ASF GitHub Bot commented on KAFKA-7224:
---

vvcephei commented on pull request #6428: KAFKA-7224: [WIP] Persistent Suppress 
[WIP]
URL: https://github.com/apache/kafka/pull/6428
 
 
   WIP - no need to review. I'm just getting a copy of this onto github.
   
   I'll call for reviews once I think it's ready.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KIP-328: Add spill-to-disk for Suppression
> --
>
> Key: KAFKA-7224
> URL: https://issues.apache.org/jira/browse/KAFKA-7224
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Assignee: John Roesler
>Priority: Major
>
> As described in 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables]
> Following on KAFKA-7223, implement the spill-to-disk buffering strategy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)