Re: Event Sourcing question

Raman Gupta Wed, 08 May 2019 14:31:11 -0700

If ordering of these events is important, then putting them in the
same topic is not only desired, it's necessary. See
https://www.confluent.io/blog/put-several-event-types-kafka-topic/.
However, think hard about whether ordering is actually important in
your use case or not, as things are certainly simpler when a topic
contains a single message type.

To the original question: your transformer process can be in a stream
as Ryanne suggests, which should take care of most crash situations --
Kafka won't advance the stream consumption offset unless your
transformer completes successfully. Make the transformer idempotent
i.e. if the transformed record already exists in S3, it should just
overwrite it. If you do put both types of events in the same topic,
then the transformer stream can skip "transformed" events and just
execute its transformation on "received" events.

Think hard about how you want to handle failures in the transformer
stream. Some failures are unrecoverable, and you don't want the stream
process to die if they occur, otherwise it won't be able to make
progress past the failing message. For these cases, you'll likely want
to send the data to a dead-letter queue or something similar, so you
can examine why the failure occurred, determine if it is fixable, and
reprocess the event if it is. In our case we actually don't send the
original data to the DLQ, but rather just the meta-data of the failing
message (topic, partition, offset) and error information. For
temporary failures e.g. S3 is unavailable, OutOfMemory errors, and so
forth, its ok for the stream to die, so that the process will start
over at the current offset and retry. This is a great intro to DLQs
with Kafka: https://eng.uber.com/reliable-reprocessing/.

Finally, if you need an aggregated "status" view, you can use KTables
to aggregate all the event information together.

Regards,
Raman

On Wed, May 8, 2019 at 3:44 PM Ryanne Dolan <ryannedo...@gmail.com> wrote:
>
> Pavel, one thing I'd recommend: don't jam multiple event types into a
> single topic. You are better served with multiple topics, each with a
> single schema and event type. In your case, you might have a received topic
> and a transformed topic, with an app consuming received and producing
> transformed.
>
> If your transformer process consumes, produces, and commits in the right
> order, your app can crash and restart without skipping records. Consider
> using Kafka Streams for this purpose, as it takes care of the semantics you
> need to do this correctly.
>
> Ryanne
>
> On Wed, May 8, 2019 at 12:06 PM Pavel Molchanov <
> pavel.molcha...@infodesk.com> wrote:
>
> > I have an architectural question.
> >
> > I am planning to create a data transformation pipeline for document
> > transformation. Each component will send processing events to the Kafka
> > 'events' topic.
> >
> > It will have the following steps:
> >
> > 1) Upload data to the repository (S3 or other storage). Get public URL to
> > the uploaded document. Create 'received' event with the document URL and
> > send the event to the Kafka 'events' topic.
> >
> > 2) Tranformer process will be listening to the Kafka 'events' topic. It
> > will react on the 'received' event in the 'events' topic, will download the
> > document, transform it, push the transformed document to the repository (S3
> > or other storage), create 'transformed' event and send 'transformed' event
> > to the same 'events' topic.
> >
> > Tranformer process can break in the middle (exception, died, crashed,
> > etc.). Upon startup, Tranformer process needs to check 'events' topic for
> > documents that were received but not transformed.
> >
> > Should it read all events from the 'events' topic? Should it join
> > 'received' and 'transformed' events somehow to understand what was received
> > but not transformed?
> >
> > I don't have a clear idea of how it should behave.
> >
> > Please help.
> >
> > *Pavel Molchanov*
> >

Re: Event Sourcing question

Reply via email to