Have you read this part of the documentation? http://kafka.apache.org/documentation.html#semantics
Just wondering if that solves your use case. On Mon, Feb 10, 2014 at 9:11 AM, Garry Turkington < g.turking...@improvedigital.com> wrote: > Hi, > > I've been doing some prototyping on Kafka for a few months now and like > what I see. It's a good fit for some of my use cases in the areas of data > distribution but also for processing - liking a lot of what I see in Samza. > I'm now working through some of the operational issues and have a question > to the community. > > I have several data sources that I want to push into Kafka but some of the > most important are arriving as a stream of files being dropped either into > a SFTP location or S3. Conceptually the data is really a stream but its > being chunked and made more batch by the deployment model of the > operational servers. So pulling the data into Kafka and seeing it more as a > stream again is a big plus. > > But, I really don't want duplicate messages. I know Kafka provides at > least once semantics and that's fine, I'm happy to have the de-dupe logic > external to Kafka. And if I look at my producer I can build up a protocol > around adding record metadata and using Zookeeper to give me pretty high > confidence that my clients will know if they are reading from a file that > was fully published into Kafka or not. > > I had assumed that this wouldn't be a unique use case but on doing a bunch > of searches I really don't find much in terms of either tools that help or > even just best practice patterns for handling this type of need to support > exactly-once message processing. > > So now I'm thinking that either I just need better web search skills or > that actually this isn't something many others are doing and if so then > there's likely a reason for that. > > Any thoughts? > > Thanks > Garry > >