Hi all, I have an 8-node cluster (1 name node, 7 data nodes), running accumulo 1.4.2, zookeeper 3.3.6, and hadoop 1.0.3, and I have it optimized for ingest performance. My question has to do is how to make the performance degrade gracefully under node failure.
1) When nodes fail, I assume that what happens is that Accumulo needs to migrate those tablets, and hadoop needs to replicate the underlying data blocks. This seems to have a rather catastrophic effect on ingest rates. Is there a way to make more gradually migrate tablets (starting with more active ones) and replicate data blocks in order to not interfere with ingestion as severely? 2) What happens to BatchWriters when a tablet server fails that it is attempting to write to? Will I need to start catching MutationRejected exceptions, will it block, or is there some other failure mode? 3) This I believe is a separate issue from node failure, but I was seeing some very odd zookeeper behavior, involving a number of timeouts. I currently have zookeeper running on all 7 data nodes, with the batchwriters running on the name node. Basically, I was getting a number of the following: client session timed out ... opening socket connection socket connection established session establishment complete ... client session timed out ... repeat I would also occasionally get session expired for /accumulo/fe7... as well as Zookeper.KeeperException$Connectionloss Exception: KeeperErrorCode = Connectionloss for /accumulo/f37.../tables/3b/state at accumulo.core.zookeeper.ZooCache$2.run accumulo.core.zookeeper.ZooCache.retry accumulo.core.zookeeper.ZooCach.get core.clientimpl.tables.getTableState core.clientimpl.multiTableBatchWriter.getBatchWriter myIngestorProcess.run Does anyone know if this is an Accumulo problem, a Zookeeper problem, or something else (network overly busy, etc.)? Thanks, Dvaid
