It can be very tricky to detect failures in a distributed system. In fact, it's not always possible. Suppose your thrift RPC is just taking a really long time - at some point it will time out (based on some config parameter that you set). However the timeout is client-side and the server may have completed the request and was just about to respond when the client gave up. Or, maybe the request failed. Or maybe the server is still working on it. But, that discussion goes well beyond the scope of the original question in this thread :)
-- Ilya Sent from my iPhone On Jan 25, 2011, at 7:09, Ted Dunning <[email protected]> wrote: > On Tue, Jan 25, 2011 at 12:01 AM, Phillip B Oldham <[email protected] >> wrote: > >> I suppose it would be left up to the client then to test whether a >> failed response actually completed... adding a fair amount of work to >> the client. >> > > Yes. It does. And if you don't design the system well, then you may not > even be able to tell if it has completed. > >> >> Would zookeeper be able to "buffer" requests? For instance, if there >> were two nodes behind it and they were both momentarily unresponsive, >> could zookeeper (& the client) keep the connection active and wait for >> a node to signal itself available and complete the request? >> > > ZK isn't really involved in your request. It is merely helping you > coordinate things. It is completely reasonable to keep "last completed > transaction id" in ZK, but that doesn't really solve the problem. You don't > want to wait forever, after all.
