Thank you, Forian, for your very clear and extremely clarifying response! |>oug
> From: Florian Hussonnois <[email protected]> > To: [email protected] > Cc: > Date: Wed, 25 Feb 2015 14:41:17 +0100 > Subject: Re: Do I need Kafka to have a reliable Storm spout? > Hi, > Actually, the tuples aren't persisted into "zookeeper". If your "spout" > emits a tuple with a unique id, it will be automatically follow internally > by storm (i.e ackers) . Thus, in case the emitted tuple comes to fail > because of a bolt failure, Storm invokes the method 'fail' on the origin > spout task with the unique id as argument. > > It's then up to you to re-emit the failed tuple. > > In sample codes, spouts use a Map to track which tuples are fully > processed by your entire topology in order to be able to re-emit in case of > a bolt failure. > > However, if the failure doesn't come from a bolt but from your spout, the > in memory Map will be lost and your topology will not be able to remit > failed tuples. > > For a such scenario you can rely on Kafka. In fact, the Kafka Spout store > its read offset into zookeeper. In that way, if a spout task goes down it > will be able to read its offset from zookeeper after restarting. > Hope this help you. > > 2015-02-24 18:58 GMT+01:00 Douglas Alan <[email protected]>: > > As I understand things, ZooKeeper will persist tuples emitted by bolts so > if a bolt crashes (or a computer with the bolt crashes, or the entire > cluster crashes), the tuple emitted by the bolt will not be lost. Once > everything is restarted, the tuples will be fetched from ZooKeeper, and > everything will continue on as if nothing bad ever happened. > > What I don't yet understand is if the same thing is true for spouts. If a > spout emits a tuple (i.e., the emit() function within a spout is executed), > and the computer the spout is running on crashes shortly thereafter, will > that tuple be resurrected by ZooKeeper? Or do we need Kafka in order to > guarantee this? > > |>oug > > P.S. I understand that the tuple emitted by the spout must be assigned a > unique ID in the call to emit(). > > P.P.S. I see sample code in books that uses something like > ConcurrentHashMap<UUID, Values> to track which spouted tuples have not yet > been acked. Is this somehow automatically persisted with ZooKeeper? I > suspect not and if not, then I shouldn't really be doing that, should I? > What should I being doing instead? Using Kafka? > >
