Interesting. Our original use case was dealing with consuming from a multi-tenant topic and allow dynamically by passing specific tenants for a period of time, and then being able to replay just those tenants data at a later point in time without having to redeploy topologies or manually making changes to the consumer state/spout instances... but what we've built is fairly flexible and we're guessing it could support lots of use cases we haven't even considered yet.
Here's a snippet from our github README that we're currently working towards getting public hopefully in the next couple of weeks. > Example use case: Multi-tenant processing > > When consuming a multi-tenant commit log you may want to postpone > processing for one or more tenants. Imagine that a subset of your tenants > database infrastructure requires downtime for maintenance. Using the > Kafka-Spout implementation you really only have two options to deal with > this situation: > > 1. > > You could stop your entire topology for all tenants while the > maintenance is performed for the small subset of tenants. > 2. > > You could filter these tenants out from being processed by your > topology, then after the maintenance is complete start a separate > topology/Kafka-Spout instance that somehow knows where to start and stop > consuming, and by way of filter bolts on the consuming topology re-process > only the events for the tenants that were previously filtered. > > Unfortunately both of these solutions are complicated, error prone and > down right painful. The alternative is to represent a use case like this > with a collection of spouts behind a single spout, or what we call a > VirtualSpout instance behind a DynamicSpout that handled the management > of starting and stopping those VirtualSpout instances. > <https://git.dev.pardot.com/pardot/storm-sideline-spout#how-does-it-work>How > does it work? > > The DynamicSpout is really a container of many VirtualSpout instances, > which each handle processing messages from their defined Consumer and > pass them into Apache Storm as a single stream. > > This spout implementation exposes two interfaces for controlling *WHEN* > and *WHAT* messages from Kafka get skipped and marked for processing at a > later point in time. > > The *Trigger Interface* allows you to hook into the spout so that you > start and stop *WHEN* messages are delayed from processing, and *WHEN* > the spout will resume processing messages that it previously delayed. > > The *Filter Interface* allows you to define *WHAT* messages the spout > will mark for delayed processing. > > The spout implementation handles the rest for you! It tracks your filter > criteria as well as offsets within Kafka topics to know where it started > and stopped filtering. It then uses this metadata to replay only those > messages which got filtered > On Fri, Sep 29, 2017 at 9:01 AM, Mitchell Rathbun (BLOOMBERG/ 731 LEX) < [email protected]> wrote: > Basically we are using Zookeeper to coordinate between a producer and > consumer. When the consumer comes up, it needs a recap from the producer. > The producer sends this recap to the consumer through Kafka in chunks. > Ideally we wanted the consumer to be able to jump back to the start of the > last recap in the queue if the producer is down and the last recap was > recent. I think we have come up with some other ways around this that don't > rely on "seek" functionality, but was just wondering if anyone else had > done something similar already. It seems that the new implementation you > mentioned would provide this functionality. > > From: [email protected] > Subject: Re: Seek in KafkaSpout > > I'm curious to your use case around this? It seems odd to need to adjust > it on the fly while a topology is running, or I've misunderstood you! > > If you store your consumer state in Zookeeper, you CAN adjust it between > topology deploys by manually modifying the stored state, and I've done this > to deal w/ maintenance or service issues to roll back to a specific point > in time. Unsure if you're able to do this when consumer state is stored > within Kafka itself. > > As a side note, I've been toying with a Kafka spout implementation that > allows dynamically consuming arbitrary ranges from topics that is to be > open sourced here soon. > > Stephen > > On Fri, Sep 29, 2017 at 8:06 AM, Mitchell Rathbun (BLOOMBERG/ 731 LEX) < > [email protected]> wrote: > >> Looking through the documentation, it seems that KafkaSpout does not >> expose any way to set the offset the spout reads from after the initial >> poll. This functionality is supported in KafkaConsumer through the seek() >> method. Am I correct that this isn't supported? Has anyone found a way to >> mimic the behavior of seek() with KafkaSpout? >> > > >
