RE: Kafka + Storm

Georgy Abraham Mon, 18 Aug 2014 00:32:05 -0700

For all these reasons Kafka is one of the most used data ingestion into storm 
and in the latest version the most used Kafka spout is integrated onto storm 
code base.

-----Original Message-----
From: Corey Nolet
Sent: 15-08-2014 AM 09:25
To: [email protected]
Subject: Re: Kafka + Storm

Kafka is also distributed in nature, which is not something easily achieved by 
queuing brokers like ActiveMQ or JMS (1.0) in general. Kafka allows data to be 
partitioned across many machines which can grow as necessary as your data 
grows. 

On Thu, Aug 14, 2014 at 11:20 PM, Justin Workman <[email protected]> 
wrote:

Absolutely!

Sent from my iPhone

On Aug 14, 2014, at 9:02 PM, anand nalya <[email protected]> wrote:

I agree, not for the long run but for small bursts in data production rate, say 
peak hours, Kafka can help in providing a somewhat consistent load on Storm 
cluster.

From: Justin Workman
Sent: ‎15-‎08-‎2014 07:53
To: [email protected]
Subject: Re: Kafka + Storm

I suppose not directly.  It depends on the lifetime of your Kafka queues and on 
your latency requirements. You need to make sure you have enough "doctors" or 
in storm language workers, in your storm cluster to process your messages 
within your SLA. 

For our case we, we have a 3 hour lifetime or ttl configured for our queues. 
Meaning records in the queue older than 3 hours are purged. We also have an 
internal SLA ( team goal, not published to the business ;)) of 10 seconds from 
event to end of stream and available for end user consumption. 

So we need to make sure we have enough storm workers to to meet; 1) the normal 
SLA and 2) be able to "catch up" on the queues when we have to take storm down 
for maintenance and such and the queues build. 

There are many knobs you can tune for both storm and Kafka. We have spent many 
hours tuning things to meet our SLAs.

Justin

Sent from my iPhone

On Aug 14, 2014, at 8:05 PM, anand nalya <[email protected]> wrote:

Also, since Kafka acts as a buffer, storm is not directly affected by the speed 
of your data sources/producers.

From: Justin Workman
Sent: ‎15-‎08-‎2014 07:12
To: [email protected]
Subject: Re: Kafka + Storm

Good analogy!

Sent from my iPhone

On Aug 14, 2014, at 7:36 PM, "Adaryl \"Bob\" Wakefield, MBA" 
<[email protected]> wrote:

Ah so Storm is the hospital and Kafka is the waiting room where everybody 
queues up to be seen in turn yes?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Justin Workman 

Sent: Thursday, August 14, 2014 7:47 PM

To: [email protected]

Subject: Re: Kafka + Storm

If you are familiar with Weblogic or ActiveMQ, it is similar. Let's see if I 
can explain, I am definitely not a subject matter expert on this. 

Within Kafka you can create "queues", ie a webclicks queue. Your web servers 
can then send click events to this queue in Kafka. The web servers, or agent 
writing the events to this queue are referred to as the "producer".  Each 
event, or message in Kafka is assigned an id. 

On the other side there are "consumers", in storms case this would be the storm 
Kafka spout, that can subscribe to this webclicks queue to consume the messages 
that are in the queue. The consumer can consume a single message from the 
queue, or a batch of messages, as storm does. The consumer keeps track of the 
latest offset, Kafka message id, that it has consumed. This way the next time 
the consumer checks to see if there are more messages to consume it will ask 
for messages with a message id greater than its last offset. 

This helps with the reliability of the event stream and helps guarantee that 
your events/message make it start to finish through your stream, assuming the 
events get to Kafka ;)

Hope this helps and makes some sort of sense. Again, sent from my iPhone ;)

Justin

Sent from my iPhone

On Aug 14, 2014, at 6:28 PM, "Adaryl \"Bob\" Wakefield, MBA" 
<[email protected]> wrote:

I get your reasoning at a high level. I should have specified that I wasn’t 
sure what Kafka does. I don’t have a hard software engineering background. I 
know that Kafka is “a message queuing” system, but I don’t really know what 
that means.

(I can’t believe you wrote all that from your iPhone....)

B.

From: Justin Workman 

Sent: Thursday, August 14, 2014 7:22 PM

To: [email protected]

Subject: Re: Kafka + Storm

Personally, we looked at several options, including writing our own storm 
source. There are limited storm sources with community support out there. For 
us, it boiled down to the following;

1) community support and what appeared to be a standard method. Storm has now 
included the kafka source as a bundled component to storm. This made the 
implementation much faster, because the code was done. 

2) the durability (replication and clustering) of Kafka. We have a three hour 
retention period on our queues, so if we need to do maintenance on storm or 
deploy an updated topology, we don't need to stop or replay any sources

3) the ability to have other tools attach to the Kafka queues to consume the 
same events for other purposes. 

4) to compliment point #1, it's easy to write to Kafka. So it was little effort 
to start sending our desired data to Kafka. 

These are our main reasons ( I'm sure there were more ). Each use case is going 
to be different and Kafka might not be the best choice for everyone. For us it 
made sense. 

Justin 

Sent from my iPhone

On Aug 14, 2014, at 6:08 PM, "Adaryl \"Bob\" Wakefield, MBA" 
<[email protected]> wrote:

Can someone tell me why people put Kafka in front of Storm? Can’t Storm ingest 
messages without having Kafka in the middle?

B.

RE: Kafka + Storm

Reply via email to