On 5/21/14, 12:00 PM, thomasa wrote:
Increasing the timeout settings helped a little, but when I tried to increase
the number of map tasks for the workers I ran into instability issues.

After re-reading my original post, I think I left out some important
details. The type of job I am trying to run is a map reduce ingest that uses
batch writers to populate an accumulo table. On previous, smaller clouds, I
have had control of disk allocation and made sure to assign a disk per
worker to avoid write conflicts. On this larger cloud, the disk management
is transparent to me, but I believe the physical disks backing the vms are
seen as one large virtual pool. Write times on the big, unstable cloud are
very fast, 3-4xtimes that of our smaller clouds, but that is seen when I dd
a file on just one vm. I think when all 150+ nodes are writing to disk, more
than one node will try to write to the same physical disk and cause
problematic iowait% (20-50% at least).

You could always try your `dd` trick across many nodes at once using pdsh or pssh. That may be a quick way to confirm your hypothesis.

So, given my situation, what is the best way to configure accumulo knowing
that the workers share disks and will have write conflicts? Do I just bump
resources down for ingest for stability then ramp them up for non-ingest
jobs?

The simple change you could make would be to just reduce the amount of memory available for each NodeManager to use (yarn.nodemanager.resource.memory-mb in yarn-site.xml), which in turn, would reduce the number of concurrent Containers run by the NodeManagers, and ultimately reduce the amount of data being sent to Accumulo.

Depending on the data and your ingest process, there may be more you can do on each client, but that's getting a bit into the weeds.


--
View this message in context: 
http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-Last-Contact-times-tp9950p10005.html
Sent from the Users mailing list archive at Nabble.com.

Reply via email to