It sounds like you want to use Spark / Spark Streaming to do that kind of 
batching output.

From: Milind Vaidya <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, May 11, 2016 at 4:24 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Getting Kafka Offset in Storm Bolt

Yeah. We have some microbatching in place for other topologies. This one is 
little ambitious, in the sense each message is 1~2KB in size so grouping them 
to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my guess, I 
am not sure how much does S3 support or what is optimum). Once that chunk is 
uploaded, all of them can be acked. But isn't it overkill ? I guess storm is 
not even meant to support that kind of a use case.

On Wed, May 11, 2016 at 12:59 PM, Nathan Leung 
<[email protected]<mailto:[email protected]>> wrote:
You can micro batch kafka contents into a file that's replicated (e.g. HDFS) 
and then ack all of the input tuples after the file has been closed.

On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya 
<[email protected]<mailto:[email protected]>> wrote:
in case of failure to upload a file or disk corruption leading to loss of file, 
we have only current offset in Kafka Spout but have no record as to which 
offsets were lost in the file which need to be replayed. So these can be stored 
externally in zookeeper and could be used to account for lost data. For them to 
save in ZK, they should be available in a bolt.

On Wed, May 11, 2016 at 11:10 AM, Nathan Leung 
<[email protected]<mailto:[email protected]>> wrote:
Why not just ack the tuple once it's been written to a file.  If your topology 
fails then the data will be re-read from Kafka.  Kafka spout already does this 
for you.  Then uploading files to S3 is the responsibility of another job.  For 
example, a storm topology that monitors the output folder.

Monitoring the data from Kafka all the way out to S3 seems unnecessary.

On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya 
<[email protected]<mailto:[email protected]>> wrote:

It does not matter, in the sense I am ready to upgrade if this thing is in the 
roadmap.

None the less

kafka_2.9.2-0.8.1.1 apache-storm-0.9.4



On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal 
<[email protected]<mailto:[email protected]>> wrote:
which version of storm-kafka, are you using?

On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya 
<[email protected]<mailto:[email protected]>> wrote:
Anybody ? Anything about this ?

On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya 
<[email protected]<mailto:[email protected]>> wrote:
Is there any way I can know what Kafka offset corresponds to current tuple I am 
processing in a bolt ?

Use case : Need to batch events from Kafka, persists them to a local file and 
eventually upload it to the S3. To manager failure cases, need to know the 
Kafka offset for a message, so that it can be persisted to Zookeeper and will 
be used to write / upload file.





--
Regards,
Abhishek Agarwal







This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***

Reply via email to