Increasing the timeout settings helped a little, but when I tried to increase the number of map tasks for the workers I ran into instability issues.
After re-reading my original post, I think I left out some important details. The type of job I am trying to run is a map reduce ingest that uses batch writers to populate an accumulo table. On previous, smaller clouds, I have had control of disk allocation and made sure to assign a disk per worker to avoid write conflicts. On this larger cloud, the disk management is transparent to me, but I believe the physical disks backing the vms are seen as one large virtual pool. Write times on the big, unstable cloud are very fast, 3-4xtimes that of our smaller clouds, but that is seen when I dd a file on just one vm. I think when all 150+ nodes are writing to disk, more than one node will try to write to the same physical disk and cause problematic iowait% (20-50% at least). So, given my situation, what is the best way to configure accumulo knowing that the workers share disks and will have write conflicts? Do I just bump resources down for ingest for stability then ramp them up for non-ingest jobs? -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Losing-tservers-Unusually-high-Last-Contact-times-tp9950p10005.html Sent from the Users mailing list archive at Nabble.com.
