Hi, On Wed, Sep 24, 2014 at 7:23 PM, Dibyendu Bhattacharya < [email protected]> wrote:
> So you have a single Kafka topic which has very high retention period ( > that decides the storage capacity of a given Kafka topic) and you want to > process all historical data first using Camus and then start the streaming > process ? > I don't necessarily want to process the historical data "using Camus", but I want to keep it forever (longer than Kafka's retention period) and process the stored data and the stream. (I don't really care about how the data got into HDFS, be it Camus or something else, but I assume that Kafka can't store it forever.) Imagine that I receive "all tweets posted to Twitter", they go into my Kafka instance and are archived to HDFS. Now a user logs in and I want to display to that user a) all posts that have ever mentioned him/her and b) continue to update that list from the current stream. (In that order.) This happens for a number of users, so it's a process that needs to be repeatable with different Spark operations. The challenge is, Camus and Spark are two different consumer for Kafka > topic and both maintains their own consumed offset different way. Camus > stores offset in HDFS, and Spark Consumer in ZK. What I understand, you > need something which identify till which point Camus pulled ( for a given > partitions of topic) and want to start Spark receiver from there ? > I think I need such a thing. Also, I think Camus stores those offsets, so in theory it should be possible to consume all HDFS files, read the offset, then start Kafka processing from that offset. That sounds very "lambda architecture"-ish to me, so I was wondering if someone has realized a similar setup. Thanks Tobias
