Hi all,

I have an 8-node cluster (1 name node, 7 data nodes), running accumulo 1.4.2, 
zookeeper 3.3.6, and hadoop 1.0.3, and I have it optimized for ingest 
performance. My question has to do is how to make the performance degrade 
gracefully under node failure.

1) When nodes fail, I assume that what happens is that Accumulo needs to 
migrate those tablets, and hadoop needs to replicate the underlying data 
blocks. This seems to have a rather catastrophic effect on ingest rates. Is 
there a way to make more gradually migrate tablets (starting with more active 
ones) and replicate data blocks in order to not interfere with ingestion as 
severely?

2) What happens to BatchWriters when a tablet server fails that it is 
attempting to write to? Will I need to start catching MutationRejected 
exceptions, will it block, or is there some other failure mode?

3) This I believe is a separate issue from node failure, but I was seeing some 
very odd zookeeper behavior, involving a number of timeouts. I currently have 
zookeeper running on all 7 data nodes, with the batchwriters running on the 
name node. Basically, I was getting a number of the following:
client session timed out ...
opening socket connection
socket connection established
session establishment complete
...
client session timed out ...
repeat

I would also occasionally get
session expired for /accumulo/fe7...
as well as
Zookeper.KeeperException$Connectionloss
Exception: KeeperErrorCode = Connectionloss
for /accumulo/f37.../tables/3b/state
at accumulo.core.zookeeper.ZooCache$2.run
accumulo.core.zookeeper.ZooCache.retry
accumulo.core.zookeeper.ZooCach.get
core.clientimpl.tables.getTableState
core.clientimpl.multiTableBatchWriter.getBatchWriter
myIngestorProcess.run

Does anyone know if this is an Accumulo problem, a Zookeeper problem, or 
something else (network overly busy, etc.)?

Thanks,
Dvaid


Reply via email to