Re: Hadoop Summit Meetups

Jun Rao Thu, 05 Jun 2014 18:43:21 -0700

The offset of a message in Kafka never changes.

Thanks,


Jun


On Thu, Jun 5, 2014 at 8:27 AM, Nagesh <[email protected]> wrote:

> As Junn Rao said, it is pretty much possible multiple publishers publishes
> to a topic and different group of consumers can consume a message and apply
> group specific logic example raw data processing, aggregation etc., Each
> distinguished group will receive a copy.
>
> But the offset cannot be used UUID as the counter may reset incase you
> restart Kafka for some reasons. Not sure, can someone throw some light?
>
> Regards,
> Nageswara Rao
>
>
> On Thu, Jun 5, 2014 at 8:18 PM, Jun Rao <[email protected]> wrote:
>
> > It sounds like that you want to write to a data store and a data pipe
> > atomically. Since both the data store and the data pipe that you want to
> > use are highly available, the only case that you want to protect is the
> > client failing btw the two writes. One way to do that is to let the
> client
> > publish to Kafka first with the strongest ack. Then, run a few consumers
> to
> > read data from Kafka and then write the data to the data store. Any one
> of
> > those consumers can die and the work will be automatically picked up by
> the
> > remaining ones. You can use partition id and the offset of each message
> as
> > its UUID if needed.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Wed, Jun 4, 2014 at 10:56 AM, Jonathan Hodges <[email protected]>
> > wrote:
> >
> > > Sorry didn't realize the mailing list wasn't copied...
> > >
> > >
> > > ---------- Forwarded message ----------
> > > From: Jonathan Hodges <[email protected]>
> > > Date: Wed, Jun 4, 2014 at 10:56 AM
> > > Subject: Re: Hadoop Summit Meetups
> > > To: Neha Narkhede <[email protected]>
> > >
> > >
> > > We have a number of customer facing online learning applications.
>  These
> > > applications are using heterogeneous technologies with different data
> > > models in underlying data stores such as RDBMS, Cassandra, MongoDB,
> etc.
> > >  We would like to run offline analysis on the data contained in these
> > > learning applications with tools like Hadoop and Spark.
> > >
> > > One thought is to use Kafka as a way for these learning applications to
> > > emit data in near real-time for analytics.  We developed a common model
> > > represented as Avro records in HDFS that spans these learning
> > applications
> > > so that we can accept the same structured message from them.  This
> allows
> > > for comparing apples to apples across these apps as opposed to messy
> > > transformations.
> > >
> > > So this all sounds good until you dig into the details.  One pattern is
> > for
> > > these applications to update state locally in their data stores first
> and
> > > then publish to Kafka.  The problem with this is these two operations
> > > aren't atomic so the local persist can succeed and the publish to Kafka
> > > fail leaving the application and HDFS out of sync.  You can try to add
> > some
> > > retry logic to the clients, but this quickly becomes very complicated
> and
> > > still doesn't solve the underlying problem.
> > >
> > > Another pattern is to publish to Kafka first with -1 and wait for the
> ack
> > > from leader and replicas before persisting locally.  This is probably
> > > better than the other pattern but does add some complexity to the
> client.
> > >  The clients must now generate unique entity IDs/UUID for persistence
> > when
> > > they typically rely on the data store for creating these.  Also the
> > publish
> > > to Kafka can succeed and persist locally can fail leaving the stores
> out
> > of
> > > sync.  In this case the learning application needs to determine how to
> > get
> > > itself in sync.  It can rely on getting this back from Kafka, but it is
> > > possible the local store failure can't be fixed in a timely manner e.g.
> > > hardware failure, constraint, etc.  In this case the application needs
> to
> > > show an error to the user and likely need to do something like send a
> > > delete message to Kafka to remove the earlier published message.
> > >
> > > A third last resort pattern might be go the CDC route with something
> like
> > > Databus.  This would require implementing additional fetchers and
> relays
> > to
> > > support Cassandra and MongoDB.  Also the data will need to be
> transformed
> > > on the Hadoop/Spark side for virtually every learning application since
> > > they have different data models.
> > >
> > > I hope this gives enough detail to start discussing transactional
> > messaging
> > > in Kafka.  We are willing to help in this effort if it makes sense for
> > our
> > > use cases.
> > >
> > > Thanks
> > > Jonathan
> > >
> > >
> > >
> > > On Wed, Jun 4, 2014 at 9:44 AM, Neha Narkhede <[email protected]
> >
> > > wrote:
> > >
> > > > If you are comfortable, share it on the mailing list. If not, I'm
> happy
> > > to
> > > > have this discussion privately.
> > > >
> > > > Thanks,
> > > > Neha
> > > > On Jun 4, 2014 9:42 AM, "Neha Narkhede" <[email protected]>
> > wrote:
> > > >
> > > >> Glad it was useful. It will be great if you can share your
> > requirements
> > > >> on atomicity. A couple of us are very interested in thinking about
> > > >> transactional messaging in Kafka.
> > > >>
> > > >> Thanks,
> > > >> Neha
> > > >> On Jun 4, 2014 6:57 AM, "Jonathan Hodges" <[email protected]>
> wrote:
> > > >>
> > > >>> Hi Neha,
> > > >>>
> > > >>> Thanks so much to you and the Kafka team for putting together the
> > > meetup.
> > > >>>  It was very nice and gave people from out of town like us the
> > ability
> > > to
> > > >>> join in person.
> > > >>>
> > > >>> We are the guys from Pearson Education and we talked a little about
> > > >>> supplying some details on some of our use cases with respect to
> > > atomicity
> > > >>> of source systems eventing data and persisting locally.  Should we
> > just
> > > >>> post to the list or is there somewhere else we should send these
> > > details?
> > > >>>
> > > >>> Thanks again!
> > > >>> Jonathan
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Fri, Apr 11, 2014 at 9:31 AM, Neha Narkhede <
> > > [email protected]>
> > > >>> wrote:
> > > >>>
> > > >>> > Yes, that's a great idea. I can help organize the meetup at
> > LinkedIn.
> > > >>> >
> > > >>> > Thanks,
> > > >>> > Neha
> > > >>> >
> > > >>> >
> > > >>> > On Fri, Apr 11, 2014 at 8:44 AM, Saurabh Agarwal (BLOOMBERG/ 731
> > > >>> LEXIN) <
> > > >>> > [email protected]> wrote:
> > > >>> >
> > > >>> > > great idea. I am interested in attending as well....
> > > >>> > >
> > > >>> > > ----- Original Message -----
> > > >>> > > From: [email protected]
> > > >>> > > To: [email protected]
> > > >>> > > At: Apr 11 2014 11:40:56
> > > >>> > >
> > > >>> > > With the Hadoop Summit in San Jose 6/3 - 6/5 I wondered if any
> of
> > > the
> > > >>> > > LinkedIn geniuses were thinking of putting together a meet-up
> on
> > > any
> > > >>> of
> > > >>> > the
> > > >>> > > associated technologies like Kafka, Samza, Databus, etc.  For
> us
> > > poor
> > > >>> > souls
> > > >>> > > that don't live on the West Coast it was a great experience
> > > >>> attending the
> > > >>> > > Kafka meetup last year.
> > > >>> > >
> > > >>> > > Jonathan
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> -------------------------------------------------------------------------------
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Nageswara Rao.V
>

Re: Hadoop Summit Meetups

Reply via email to