Re: All-time stream re-processing

Tobias Pfeiffer Wed, 24 Sep 2014 06:28:30 -0700

Hi,

On Wed, Sep 24, 2014 at 7:23 PM, Dibyendu Bhattacharya <
[email protected]> wrote:


> So you have a single Kafka topic which has very high retention period (
> that decides the storage capacity of a given Kafka topic) and you want to
> process all historical data first using Camus and then start the streaming
> process ?
>

I don't necessarily want to process the historical data "using Camus", but
I want to keep it forever (longer than Kafka's retention period) and
process the stored data and the stream. (I don't really care about how the
data got into HDFS, be it Camus or something else, but I assume that Kafka
can't store it forever.)

Imagine that I receive "all tweets posted to Twitter", they go into my
Kafka instance and are archived to HDFS. Now a user logs in and I want to
display to that user a) all posts that have ever mentioned him/her and b)
continue to update that list from the current stream. (In that order.) This
happens for a number of users, so it's a process that needs to be
repeatable with different Spark operations.

The challenge is, Camus and Spark are two different consumer for Kafka
> topic and both maintains their own consumed offset different way. Camus
> stores offset in HDFS, and Spark Consumer in ZK. What I understand, you
> need something which identify till which point Camus pulled ( for a given
> partitions of topic) and want to start Spark receiver from there ?
>

I think I need such a thing. Also, I think Camus stores those offsets, so
in theory it should be possible to consume all HDFS files, read the offset,
then start Kafka processing from that offset. That sounds very "lambda
architecture"-ish to me, so I was wondering if someone has realized a
similar setup.

Thanks
Tobias

Re: All-time stream re-processing

Reply via email to