If ordering of these events is important, then putting them in the same topic is not only desired, it's necessary. See https://www.confluent.io/blog/put-several-event-types-kafka-topic/. However, think hard about whether ordering is actually important in your use case or not, as things are certainly simpler when a topic contains a single message type.
To the original question: your transformer process can be in a stream as Ryanne suggests, which should take care of most crash situations -- Kafka won't advance the stream consumption offset unless your transformer completes successfully. Make the transformer idempotent i.e. if the transformed record already exists in S3, it should just overwrite it. If you do put both types of events in the same topic, then the transformer stream can skip "transformed" events and just execute its transformation on "received" events. Think hard about how you want to handle failures in the transformer stream. Some failures are unrecoverable, and you don't want the stream process to die if they occur, otherwise it won't be able to make progress past the failing message. For these cases, you'll likely want to send the data to a dead-letter queue or something similar, so you can examine why the failure occurred, determine if it is fixable, and reprocess the event if it is. In our case we actually don't send the original data to the DLQ, but rather just the meta-data of the failing message (topic, partition, offset) and error information. For temporary failures e.g. S3 is unavailable, OutOfMemory errors, and so forth, its ok for the stream to die, so that the process will start over at the current offset and retry. This is a great intro to DLQs with Kafka: https://eng.uber.com/reliable-reprocessing/. Finally, if you need an aggregated "status" view, you can use KTables to aggregate all the event information together. Regards, Raman On Wed, May 8, 2019 at 3:44 PM Ryanne Dolan <ryannedo...@gmail.com> wrote: > > Pavel, one thing I'd recommend: don't jam multiple event types into a > single topic. You are better served with multiple topics, each with a > single schema and event type. In your case, you might have a received topic > and a transformed topic, with an app consuming received and producing > transformed. > > If your transformer process consumes, produces, and commits in the right > order, your app can crash and restart without skipping records. Consider > using Kafka Streams for this purpose, as it takes care of the semantics you > need to do this correctly. > > Ryanne > > On Wed, May 8, 2019 at 12:06 PM Pavel Molchanov < > pavel.molcha...@infodesk.com> wrote: > > > I have an architectural question. > > > > I am planning to create a data transformation pipeline for document > > transformation. Each component will send processing events to the Kafka > > 'events' topic. > > > > It will have the following steps: > > > > 1) Upload data to the repository (S3 or other storage). Get public URL to > > the uploaded document. Create 'received' event with the document URL and > > send the event to the Kafka 'events' topic. > > > > 2) Tranformer process will be listening to the Kafka 'events' topic. It > > will react on the 'received' event in the 'events' topic, will download the > > document, transform it, push the transformed document to the repository (S3 > > or other storage), create 'transformed' event and send 'transformed' event > > to the same 'events' topic. > > > > Tranformer process can break in the middle (exception, died, crashed, > > etc.). Upon startup, Tranformer process needs to check 'events' topic for > > documents that were received but not transformed. > > > > Should it read all events from the 'events' topic? Should it join > > 'received' and 'transformed' events somehow to understand what was received > > but not transformed? > > > > I don't have a clear idea of how it should behave. > > > > Please help. > > > > *Pavel Molchanov* > >