>> Now because of any reason machine with hourly aggregated data goes down I 
>> want missing hour tupples to replay from my queue.

An Hour is too long since Storm Spout would timeout way before that. Even 
though it is configurable I do not think it would be the right way of doing it.



There are many NoSQL products that come to mind that can perform aggregations 
(CouchBase, ElasticSearch, and most other K/V type of NoSQLs). You would put 
put storm in front of a NoSQL to reduce the data throughput (event 
consolidation) or perform extremely non-standard aggregations (that are not 
covered by a simple map-reduce script) or to if you must get real-time stats. 
Since you said your stats are not real-time, this leaves us with the following 
questions:

1. What is your raw event throughput ?

2. What type of aggregations are you trying to perform ?



Regards,

Itai


________________________________
From: Nipun Batra <[email protected]>
Sent: Wednesday, October 15, 2014 9:28 AM
To: [email protected]
Subject: Re: Batch ID TxId

Hi Yuval

Thanks for responding, Here is what I have in mind I was thinking to aggregate 
the data on hourly basis in memory and persisting every hour. Now because of 
any reason machine with hourly aggregated data goes down I want missing hour 
tupples to replay from my queue.  Any suggestions?

Regards
Nipun



On Tue, Oct 14, 2014 at 4:33 PM, Yuval Oren 
<[email protected]<mailto:[email protected]>> wrote:
Nipun,

That seems to be contrary to the typical storm pattern of continuous 
processing. Is there a reason you can't continuously read new data? That might 
also scale better.

--
Yuval Oren
N3TWORK

On Oct 14, 2014, at 8:52 AM, Nipun Batra 
<[email protected]<mailto:[email protected]>> wrote:

Hi

I have non ending data feed and I want to define a batch on hourly basis i.e. 
set batch id for all the tuples coming in at particular hour. if I write my 
custom spout how do I set batch ID / Tx Id

Later the data feed will be consumed from Kafka topic, If I plan to use Kafka 
Spout again is there a way to batch OR TxID by hour.

I have looked at many examples but I am not able to find it.  Will appreciate 
if you can point me to right direction OR any example of custom spout setting 
batch id

I apologize if this is already asked, I tried to look around but found nothing.

Thank you in advance
Nipun



Reply via email to