Hello, we are working on designs for several streaming applications and a common consideration is the need for occasional external database updates/lookups. For example…we would be processing a stream of events with some kind of local-id, and we occasionally need to resolve the local-id to a global-id using an external service (e.g. such as a database).
A simple approach is the following: 1.) Quantify the expected throughput of a topic/s 2.) Partition the topic/s so that each task isn’t “overwhelmed” 3.) Combine blocking database-calls and an in-memory cache for external storage lookup/updates 4.) If the system isn’t performing fast enough, simply add more partitions and tasks to the application. Obviously we are assuming that the external database can handle the rate of transactions. Another approach is to process the messages asynchronously. That is, database callbacks are attached to something like Futures and the streaming threads aren’t interrupted. Assuming there isn’t shared data between threads, this seems exactly like the first approach. If we have a thread pool with ‘N’ number of threads, our application will never go faster than c/N where ‘c’ is the average latency for a (Database+cache) lookup. This is equivalent to making N partitions and starting N tasks. However, this approach does have the advantage of “decoupling” the threads from the partitions (i.e. nothing precludes us from having more database threads than partitions). The application, however, becomes much more complicated. Any other approaches? I would appreciate any design suggestions anyone has. Thx! -David