As Junn Rao said, it is pretty much possible multiple publishers publishes
to a topic and different group of consumers can consume a message and apply
group specific logic example raw data processing, aggregation etc., Each
distinguished group will receive a copy.
But the offset cannot be used UUID as the counter may reset incase you
restart Kafka for some reasons. Not sure, can someone throw some light?
Regards,
Nageswara Rao
On Thu, Jun 5, 2014 at 8:18 PM, Jun Rao jun...@gmail.com wrote:
It sounds like that you want to write to a data store and a data pipe
atomically. Since both the data store and the data pipe that you want to
use are highly available, the only case that you want to protect is the
client failing btw the two writes. One way to do that is to let the client
publish to Kafka first with the strongest ack. Then, run a few consumers to
read data from Kafka and then write the data to the data store. Any one of
those consumers can die and the work will be automatically picked up by the
remaining ones. You can use partition id and the offset of each message as
its UUID if needed.
Thanks,
Jun
On Wed, Jun 4, 2014 at 10:56 AM, Jonathan Hodges hodg...@gmail.com
wrote:
Sorry didn't realize the mailing list wasn't copied...
-- Forwarded message --
From: Jonathan Hodges hodg...@gmail.com
Date: Wed, Jun 4, 2014 at 10:56 AM
Subject: Re: Hadoop Summit Meetups
To: Neha Narkhede neha.narkh...@gmail.com
We have a number of customer facing online learning applications. These
applications are using heterogeneous technologies with different data
models in underlying data stores such as RDBMS, Cassandra, MongoDB, etc.
We would like to run offline analysis on the data contained in these
learning applications with tools like Hadoop and Spark.
One thought is to use Kafka as a way for these learning applications to
emit data in near real-time for analytics. We developed a common model
represented as Avro records in HDFS that spans these learning
applications
so that we can accept the same structured message from them. This allows
for comparing apples to apples across these apps as opposed to messy
transformations.
So this all sounds good until you dig into the details. One pattern is
for
these applications to update state locally in their data stores first and
then publish to Kafka. The problem with this is these two operations
aren't atomic so the local persist can succeed and the publish to Kafka
fail leaving the application and HDFS out of sync. You can try to add
some
retry logic to the clients, but this quickly becomes very complicated and
still doesn't solve the underlying problem.
Another pattern is to publish to Kafka first with -1 and wait for the ack
from leader and replicas before persisting locally. This is probably
better than the other pattern but does add some complexity to the client.
The clients must now generate unique entity IDs/UUID for persistence
when
they typically rely on the data store for creating these. Also the
publish
to Kafka can succeed and persist locally can fail leaving the stores out
of
sync. In this case the learning application needs to determine how to
get
itself in sync. It can rely on getting this back from Kafka, but it is
possible the local store failure can't be fixed in a timely manner e.g.
hardware failure, constraint, etc. In this case the application needs to
show an error to the user and likely need to do something like send a
delete message to Kafka to remove the earlier published message.
A third last resort pattern might be go the CDC route with something like
Databus. This would require implementing additional fetchers and relays
to
support Cassandra and MongoDB. Also the data will need to be transformed
on the Hadoop/Spark side for virtually every learning application since
they have different data models.
I hope this gives enough detail to start discussing transactional
messaging
in Kafka. We are willing to help in this effort if it makes sense for
our
use cases.
Thanks
Jonathan
On Wed, Jun 4, 2014 at 9:44 AM, Neha Narkhede neha.narkh...@gmail.com
wrote:
If you are comfortable, share it on the mailing list. If not, I'm happy
to
have this discussion privately.
Thanks,
Neha
On Jun 4, 2014 9:42 AM, Neha Narkhede neha.narkh...@gmail.com
wrote:
Glad it was useful. It will be great if you can share your
requirements
on atomicity. A couple of us are very interested in thinking about
transactional messaging in Kafka.
Thanks,
Neha
On Jun 4, 2014 6:57 AM, Jonathan Hodges hodg...@gmail.com wrote:
Hi Neha,
Thanks so much to you and the Kafka team for putting together the
meetup.
It was very nice and gave people from out of town like us the
ability
to
join in person.
We are the guys from Pearson Education and we