If the average is around 1k per k/v entry, then I would say that 400MB/s is very good performance for incremental/streaming ingest into Accumulo on that cluster. However, I suspect that your entries are probably not that big on average. Do you have a measurement for MB/s ingest?
Adam On Apr 9, 2014 4:42 PM, "Mike Hugo" <[email protected]> wrote: > > > > On Tue, Apr 8, 2014 at 4:35 PM, Adam Fuchs <[email protected]> wrote: > >> MIke, >> >> What version of Accumulo are you using, how many tablets do you have, and >> how many threads are you using for minor and major compaction pools? Also, >> how big are the keys and values that you are using? >> >> > 1.4.5 > 6 threads each for min and major compaction > Keys and values are not that large, there may be a few outliers but I > would estimate that most of them are < 1k > > > >> Here are a few settings that may help you: >> 1. WAL replication factor (tserver.wal.replication). This defaults to 3 >> replicas (the HDFS default), but if you set it to 2 it will give you a >> performance boost without a huge hit to reliability. >> 2. Ingest buffer size (tserver.memory.maps.max), also known as the >> in-memory map size. Increasing this generally improves the efficiency of >> minor compactions and reduces the number of major compactions that will be >> required down the line. 4-8 GB is not unreasonable. >> 3. Make sure your WAL settings are such that the size of a log >> (tserver.walog.max.size) multiplied by the number of active logs >> (table.compaction.minor.logs.threshold) is greater than the in-memory map >> size. You probably want to accomplish this by bumping up the number of >> active logs. >> 4. Increase the buffer size on the BatchWriter that the clients use. This >> can be done with the setBatchWriterOptions method on the >> AccumuloOutputFormat. >> >> > Thanks for the tips, I try these out > > >> Cheers, >> Adam >> >> >> >> On Tue, Apr 8, 2014 at 4:47 PM, Mike Hugo <[email protected]> wrote: >> >>> Hello, >>> >>> We have an ingest process that operates via Map Reduce, processing a >>> large set of XML files and inserting mutations based on that data into a >>> set of tables. >>> >>> On a 5 node cluster (each node has 64G ram, 20 cores, and ~600GB SSD) I >>> get 400k inserts per second with 20 mapper tasks running concurrently. >>> Increasing the number of concurrent mapper tasks to 40 doesn't have any >>> effect (besides causing a little more backup in compactions). >>> >>> I've increased the table.compaction.major.ratio and increased the number >>> of concurrent allowed compactions for both min and max compaction but each >>> of those only had negligible impact on ingest rates. >>> >>> Any advice on other settings I can tweak to get things to move more >>> quickly? Or is 400k/second a reasonable ingest rate? Are we at a point >>> where we should consider generating r files like the bulk ingest example? >>> >>> Thanks in advance for any advice. >>> >>> Mike >>> >> >> >
