This is a timely question and we've updated the documentation here on capacity planning and sizing for Kafka Streams jobs: http://docs.confluent.io/current/streams/sizing.html <http://docs.confluent.io/current/streams/sizing.html>. Any feedback welcome. It has scenarios with windowed stores too.
Thanks Eno > On 3 May 2017, at 18:51, Garrett Barton <garrett.bar...@gmail.com> wrote: > > That depends on if your using event, processing or ingestion time. > > My understanding is that if you play a record through that is T-6, the only > way that > TimeWindows.of(TimeUnit.MINUTES.toMillis(1)).until(TimeUnit.MINUTES.toMillis(5)) > would actually process that record in your window is if your using > processing time. Otherwise the record is skipped and no data is > generated/calculated for that operation. So depending on what your doing > you would not increase any more memory usage than when consuming from > real-time. > > On Wed, May 3, 2017 at 3:37 AM, João Peixoto <joao.harti...@gmail.com> > wrote: > >> The base question I'm trying to answer is "how much memory does my instance >> need". >> >> Considering a use case where I want to keep a rolling average on a tumbling >> window of 1 minute size allowing for late arrivals up to 5 minutes (lower >> bound) we would have something like this: >> >> TimeWindows.of(TimeUnit.MINUTES.toMillis(1)).until( >> TimeUnit.MINUTES.toMillis(5)) >> >> The aggregate key size is 8 bytes, the average value is 8 bytes and for >> de-duplication purposes we need to keep track of which messages we saw >> already, so a list of keys averaging 10 entries. >> >> If I understand correctly this means that each window will be on average 96 >> bytes in size. >> >> A single topic generates 100 messages/minute, which aggregate into 10 >> independent windows. >> >> On any given point in time the windowed aggregates require 960 bytes of >> memory at least. >> >> Here's the confusing part. Lets say I found an issue with my averaging >> operation and I want to reprocess the last 10 hours worth of messages. >> >> 1. Windows will be regenerated, since most likely they were cleaned up >> already >> 2. The retention policy now has different semantics? If I had a late >> arrival of 6 minutes, all of the sudden the reprocessing will account for >> it right? Since the window is now active due to recreation (Assuming my app >> is capable of processing all messages under 5 minutes) >> 3. I'll be keeping 10 windows * (60 * 10) minutes for the first 5 minutes, >> so my memory requirement is now 576,000 bytes? This is orders of magnitude >> bigger. >> >> I hope this gets my doubts across clearly, feel free to ask more details. >> And thanks in advance >>