Re: Seek in KafkaSpout

Stephen Powis Thu, 28 Sep 2017 17:31:06 -0700

Interesting.   Our original use case was dealing with consuming from a
multi-tenant topic and allow dynamically by passing specific tenants for a
period of time, and then being able to replay just those tenants data at a
later point in time without having to redeploy topologies or manually
making changes to the consumer state/spout instances... but what we've
built is fairly flexible and we're guessing it could support lots of use
cases we haven't even considered yet.


Here's a snippet from our github README that we're currently working
towards getting public hopefully in the next couple of weeks.

> Example use case: Multi-tenant processing
>
> When consuming a multi-tenant commit log you may want to postpone
> processing for one or more tenants. Imagine that a subset of your tenants
> database infrastructure requires downtime for maintenance. Using the
> Kafka-Spout implementation you really only have two options to deal with
> this situation:
>
>    1.
>
>    You could stop your entire topology for all tenants while the
>    maintenance is performed for the small subset of tenants.
>    2.
>
>    You could filter these tenants out from being processed by your
>    topology, then after the maintenance is complete start a separate
>    topology/Kafka-Spout instance that somehow knows where to start and stop
>    consuming, and by way of filter bolts on the consuming topology re-process
>    only the events for the tenants that were previously filtered.
>
> Unfortunately both of these solutions are complicated, error prone and
> down right painful. The alternative is to represent a use case like this
> with a collection of spouts behind a single spout, or what we call a
> VirtualSpout instance behind a DynamicSpout that handled the management
> of starting and stopping those VirtualSpout instances.
> <https://git.dev.pardot.com/pardot/storm-sideline-spout#how-does-it-work>How
> does it work?
>
> The DynamicSpout is really a container of many VirtualSpout instances,
> which each handle processing messages from their defined Consumer and
> pass them into Apache Storm as a single stream.
>
> This spout implementation exposes two interfaces for controlling *WHEN*
> and *WHAT* messages from Kafka get skipped and marked for processing at a
> later point in time.
>
> The *Trigger Interface* allows you to hook into the spout so that you
> start and stop *WHEN* messages are delayed from processing, and *WHEN*
> the spout will resume processing messages that it previously delayed.
>
> The *Filter Interface* allows you to define *WHAT* messages the spout
> will mark for delayed processing.
>
> The spout implementation handles the rest for you! It tracks your filter
> criteria as well as offsets within Kafka topics to know where it started
> and stopped filtering. It then uses this metadata to replay only those
> messages which got filtered
>


On Fri, Sep 29, 2017 at 9:01 AM, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
[email protected]> wrote:

> Basically we are using Zookeeper to coordinate between a producer and
> consumer. When the consumer comes up, it needs a recap from the producer.
> The producer sends this recap to the consumer through Kafka in chunks.
> Ideally we wanted the consumer to be able to jump back to the start of the
> last recap in the queue if the producer is down and the last recap was
> recent. I think we have come up with some other ways around this that don't
> rely on "seek" functionality, but was just wondering if anyone else had
> done something similar already. It seems that the new implementation you
> mentioned would provide this functionality.
>
> From: [email protected]
> Subject: Re: Seek in KafkaSpout
>
> I'm curious to your use case around this?  It seems odd to need to adjust
> it on the fly while a topology is running, or I've misunderstood you!
>
> If you store your consumer state in Zookeeper, you CAN adjust it between
> topology deploys by manually modifying the stored state, and I've done this
> to deal w/ maintenance or service issues to roll back to a specific point
> in time.  Unsure if you're able to do this when consumer state is stored
> within Kafka itself.
>
> As a side note, I've been toying with a Kafka spout implementation that
> allows dynamically consuming arbitrary ranges from topics that is to be
> open sourced here soon.
>
> Stephen
>
> On Fri, Sep 29, 2017 at 8:06 AM, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
> [email protected]> wrote:
>
>> Looking through the documentation, it seems that KafkaSpout does not
>> expose any way to set the offset the spout reads from after the initial
>> poll. This functionality is supported in KafkaConsumer through the seek()
>> method. Am I correct that this isn't supported? Has anyone found a way to
>> mimic the behavior of seek() with KafkaSpout?
>>
>
>
>

Re: Seek in KafkaSpout

Reply via email to