On Tue, Apr 8, 2014 at 5:35 PM, David Medinets <[email protected]>wrote:
> 20 cores and just one SSD? Is there a standard recommendation for a core > to SSD ratio? > > Other questions: > > How are you sharding your data (i.e., what does your row look like)? > we do something kind of like the entity attribute / graph tables example from the accumulo manual > Are you pre-spliting the table? > no > How many tablets are ingesting at the same time? > 4 > Are you writing from the map-reduce directly to Accumulo or writing to > rFiles first? > directly to accumulo > Are the Accumulo nodes and the Hadoop nodes on the same servers? > yes > Do you see the server load spike during ingest? > > How much memory are you allocating to the tservers? > 2GB > How large are the entries on average? > What are the largest entries? > Does the data skew towards large entries? > entries are small. probably <1k for key/value combined in most instances > Are you querying at the same time as ingesting? > > no > > > On Tue, Apr 8, 2014 at 5:35 PM, Adam Fuchs <[email protected]> wrote: > >> MIke, >> >> What version of Accumulo are you using, how many tablets do you have, and >> how many threads are you using for minor and major compaction pools? Also, >> how big are the keys and values that you are using? >> >> Here are a few settings that may help you: >> 1. WAL replication factor (tserver.wal.replication). This defaults to 3 >> replicas (the HDFS default), but if you set it to 2 it will give you a >> performance boost without a huge hit to reliability. >> 2. Ingest buffer size (tserver.memory.maps.max), also known as the >> in-memory map size. Increasing this generally improves the efficiency of >> minor compactions and reduces the number of major compactions that will be >> required down the line. 4-8 GB is not unreasonable. >> 3. Make sure your WAL settings are such that the size of a log >> (tserver.walog.max.size) multiplied by the number of active logs >> (table.compaction.minor.logs.threshold) is greater than the in-memory map >> size. You probably want to accomplish this by bumping up the number of >> active logs. >> 4. Increase the buffer size on the BatchWriter that the clients use. This >> can be done with the setBatchWriterOptions method on the >> AccumuloOutputFormat. >> >> Cheers, >> Adam >> >> >> >> On Tue, Apr 8, 2014 at 4:47 PM, Mike Hugo <[email protected]> wrote: >> >>> Hello, >>> >>> We have an ingest process that operates via Map Reduce, processing a >>> large set of XML files and inserting mutations based on that data into a >>> set of tables. >>> >>> On a 5 node cluster (each node has 64G ram, 20 cores, and ~600GB SSD) I >>> get 400k inserts per second with 20 mapper tasks running concurrently. >>> Increasing the number of concurrent mapper tasks to 40 doesn't have any >>> effect (besides causing a little more backup in compactions). >>> >>> I've increased the table.compaction.major.ratio and increased the number >>> of concurrent allowed compactions for both min and max compaction but each >>> of those only had negligible impact on ingest rates. >>> >>> Any advice on other settings I can tweak to get things to move more >>> quickly? Or is 400k/second a reasonable ingest rate? Are we at a point >>> where we should consider generating r files like the bulk ingest example? >>> >>> Thanks in advance for any advice. >>> >>> Mike >>> >> >> >
