Hello Gareth,

A common practice for rolling up aggregations with Kafka Streams is to do
the finest granularity at processor (5 days in your case), and to
coarse-grained rolling up upon query serving through the interactive query
API -- i.e. whenever a query is issued for a 30 day aggregate you do a
range scan on the 5-day-aggregate stores, and compute the rollup on the fly.

If you'd prefer to still materialize all of the granularities since maybe
their query frequency is high enough, maybe just go with three stores but
as three concatenated aggregations (i.e. a stream aggregation into 5-day,s
and the 5-day table aggregation to 10days, and 10-day table aggregation to
30-days).

Guozhang

On Mon, Mar 15, 2021 at 6:11 PM Gareth Collins <gareth.o.coll...@gmail.com>
wrote:

> Hi,
>
> We have a requirement to calculate metrics on a huge number of keys (could
> be hundreds of millions, perhaps billions of keys - attempting caching on
> individual keys in many cases will have almost a 0% cache hit rate). Is
> Kafka Streams with RocksDB and compacting topics the right tool for a task
> like that?
>
> As well, just from playing with Kafka Streams for a week it feels like it
> wants to create a lot of separate stores by default (if I want to calculate
> aggregates on five, ten and 30 days I will get three separate stores by
> default for this state data). Coming from a different distributed storage
> solution, I feel like I want to put them together in one store as I/O has
> always been my bottleneck (1 big read and 1 big write is better than three
> small separate reads and three small separate writes).
>
> But am I perhaps missing something here? I don't want to avoid the DSL that
> Kafka Streams provides if I don't have to. Will the Kafka Streams RocksDB
> solution be so much faster than a distributed read that it won't be the
> bottleneck even with huge amounts of data?
>
> Any info/opinions would be greatly appreciated.
>
> thanks in advance,
> Gareth Collins
>


-- 
-- Guozhang

Reply via email to