One thing you can do is reduce the replication factor for the WAL. We have found that makes a pretty significant different in write performance. That can be modified with the tserver.wal.replication property. Setting it to 2 instead of the default (probably 3) should give you some performance improvement, of course at some cost to durability.
Adam On Wed, Dec 4, 2013 at 5:14 AM, Peter Tillotson <[email protected]>wrote: > I've been trying to get the most out of streaming data into Accumulo 1.5 > (Hadoop Cloudera CDH4). Having tried a number of settings, re-writing > client code etc I finally switched off the Write Ahead Log > (table.walog.enabled=false) and saw a huge leap in ingest performance. > > Ingest with table.walog.enabled= true: ~6 MB/s > Ingest with table.walog.enabled= false: ~28 MB/s > > That is a factor of about x4.67 speed improvement. > > Now my use case could probably live without or work around not having a > wal, but I wondered if this was a known issue?? > (didn't see anything in jira), wal seem to be a significant rate limiter > this is either endemic to Accumulo or an HDFS / setup issue. Though given > everything is in HDFS these days and otherwise IO flies it looks like > Accumulo WAL is the most likely culprit. > > I don't believe this to be an IO issue on the box, with wal off the is > significantly more IO (up to 80M/s reported by dstat), with wal on (up to > 12M/s reported by dstat). Testing the box with FIO sequential write is > 160M/s. > > Further info: > Hadoop 2.00 (Cloudera cdh4) > Accumulo (1.5.0) > Zookeeper ( with Netty, minor improvement of <1MB/s ) > Filesystem ( HDFS is ZFS, compression=on, dedup=on, otherwise ext4 ) > > With large imports from scratch now I start off CPU bound and as more > shuffling is needed this becomes Disk bound later in the import as > expected. So I know pre-splitting would probably sort it. > > Tnx > > P >
