Regarding the referenced paper, pre-splitting the tables, using an optimized zookeeper deployment, and increasing concurrent minor / major compactions are good things. I'm not sure that we want to recommend turning off the write ahead logs and replication for production deployments.
-----Original Message----- From: Jeremy Kepner [mailto:[email protected]] Sent: Thursday, July 13, 2017 10:05 AM To: [email protected] Subject: Re: maximize usage of cluster resources during ingestion https://arxiv.org/abs/1406.4923 contains a number of tricks for maximizing ingest performance. On Thu, Jul 13, 2017 at 08:13:40AM -0400, Jonathan Wonders wrote: > Keep in mind that Accumulo puts a much different kind of load on HDFS > than the DFSIO benchmark. It might be more appropriate to use a tool > like dstat to monitor HDD utilization and queue depth. HDD throughput > benchmarks usually will involve high queue depths as disks are much > more effective when they can pipeline and batch updates. Accumulo's > WAL workload will typically call hflush or hsync periodically which > interrupts the IO pipeline much like memory barriers can interrupt CPU > pipelining except more severe. This is necessary to provide > durability guarantees, but definitely comes at a cost to throughput. > Any database that has these durability guarantees will suffer > similarly to an extent. For Accumulo, it is probably worse than for > non-distributed databases because the flush or sync must happen at > each replica prior to the mutation being added into the in-memory map. > > I think one of the reasons the recommendation was made to add more > tablet servers is because each tablet server only writes to one WAL at > a time and each block will live on N disk based on replication factor. > If you have a replication factor of 3, there will be 10x3 blocks being > appended to at any given time (excluding compactions). Since you have > 120 disks, not all will be participating in write-ahead-logging, so > you should not count the IO capacity of these extra disks towards > expected ingest throughput. 10 tablet servers per node is probably > too many because there would likely be a lot of contention > flushing/syncing WALs. I'm not sure how smart HDFS is about how it > distributes the WAL load. You might see more benefit with 2-4 > tservers per node. This would mostly likely require more batch writer > threads in the client as well. > > I'm not too surprised that snappy did not help because the WALs are > not compressed and are likely a bigger bottleneck than compaction > since you have many disks not participating in WAL. > > > On Wed, Jul 12, 2017 at 11:16 AM, Josh Elser <[email protected]> wrote: > > > You probably want to split the table further than just 4 tablets per > > tablet server. Try 10's of tablets per server. > > > > Also, merging the content from (who I assume is) your coworker on > > this stackoverflow post[1], I don't believe the suggestion[2] to > > verify WAL max size, minc threshold, and native maps size was brought up > > yet. > > > > Also, did you look at the JVM GC logs for the TabletServers like was > > previously suggested to you? > > > > [1] https://stackoverflow.com/questions/44928354/accumulo-tablet > > -server-doesnt-utilize-all-available-resources-on-host-machine/ > > [2] https://accumulo.apache.org/1.8/accumulo_user_manual.html#_n > > ative_maps_configuration > > > > On 7/12/17 10:12 AM, Massimilian Mattetti wrote: > > > >> Hi all, > >> > >> I ran a few experiments in the last days trying to identify what is > >> the bottleneck for the ingestion process. > >> - Running 10 tservers per node instead of only one gave me a very > >> neglectable performance improvement of about 15%. > >> - Running the ingestor processes from the two masters give the same > >> performance as running one ingestor process in each tablet server > >> (10 > >> ingestors) > >> - neither the network limit (10 Gb network) nor the disk throughput > >> limit has been reached (1GB/s per node reached while running the > >> TestDFSIO benchmark on HDFS) > >> - CPU is always around 20% on each tserver > >> - changing compression from GZ to snappy did not provide any > >> benefit > >> - increasing the tserver.total.mutation.queue.maxto 200MB actually > >> decreased the performance I am going to run some ingestion > >> experiment with Kudu over the next few days, but any other > >> suggestion on how improve the performance on Accumulo is very > >> welcome. > >> Thanks. > >> > >> Best Regards, > >> Massimiliano > >> > >> > >> > >> From: Jonathan Wonders <[email protected]> > >> To: [email protected], Dave Marion <[email protected]> > >> Date: 07/07/2017 04:02 > >> Subject: Re: maximize usage of cluster resources during ingestion > >> ------------------------------------------------------------------- > >> ----- > >> > >> > >> > >> I've personally never seen full CPU utilization during pure ingest. > >> Typically the bottleneck has been I/O related. The majority of > >> steady-state CPU utilization under a heavy ingest load is probably > >> due to compression unless you have custom constraints running. This > >> can depend on the compression algorithm you have selected. There > >> is probably a measurable contribution from inserting into the > >> in-memory map. Otherwise, not much computation occurs during ingest per > >> mutation. > >> > >> On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <[email protected]_ > >> <mailto:[email protected]>> wrote: > >> That's a good point. I would also look at increasing > >> tserver.total.mutation.queue.max. Are you seeing hold times? If > >> not, I would keep pushing harder until you do, then move to > >> multiple tablet servers. Do you have any GC logs? > >> > >> > >> On July 6, 2017 at 4:47 AM Cyrille Savelief <[email protected]_ > >> <mailto:[email protected]>> wrote: > >> > >> Are you sure Accumulo is not waiting for your app's data? There > >> might be GC pauses in your ingest code (we have already experienced that). > >> > >> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti > >> <[email protected]_ <mailto:[email protected]>> a écrit : > >> Thank you all for the suggestions. > >> > >> About the native memory map I checked the logs on each tablet > >> server and it was loaded correctly (of course the > >> tserver.memory.maps.native.enabled > >> was set to true), so the GC pauses should not be the problem > >> eventually. I managed to get much better ingestion graph by > >> reducing the native map size to *2GB* and increasing the Batch > >> Writer threads number from the default (3 was really bad for my > >> configuration) to *10* (I think it does not make sense having more threads > >> than tablet servers, am I right?). > >> > >> The configuration that I used for the table is: > >> "table.file.replication": "2", > >> "table.compaction.minor.logs.threshold": "3", > >> "table.durability": "flush", > >> "table.split.threshold": "1G" > >> > >> while for the tablet servers is: > >> "tserver.wal.blocksize": "1G", > >> "tserver.walog.max.size": "2G", > >> "tserver.memory.maps.max": "2G", > >> "tserver.compaction.minor.concurrent.max": "50", > >> "tserver.compaction.major.concurrent.max": "20", > >> "tserver.wal.replication": "2", > >> "tserver.compaction.major.thread.files.open.max": "15" > >> > >> The new graph: > >> > >> > >> I still have the problem of a CPU usage that is less than*20%.* So > >> I am thinking to run multiple tablet servers per node (like 5 or > >> 10) in order to maximize the CPU usage. Besides that I do not have > >> any other idea on how to stress those servers with ingestion. > >> Any suggestions are very welcome. Meanwhile, thank you all again > >> for your help. > >> > >> > >> Best Regards, > >> Massimiliano > >> > >> > >> > >> From: Jonathan Wonders <[email protected]_ <mailto: > >> [email protected]>> > >> To: [email protected]_ <mailto:[email protected]> > >> Date: 06/07/2017 04:01 > >> Subject: Re: maximize usage of cluster resources during ingestion > >> ------------------------------------------------------------------- > >> ----- > >> > >> > >> > >> Hi Massimilian, > >> > >> Are you seeing held commits during the ingest pauses? Just based > >> on having looked at many similar graphs in the past, this might be > >> one of the major culprits. A tablet server has a memory region > >> with a bounded size > >> (tserver.memory.maps.max) where it buffers data that has not yet > >> been written to RFiles (through the process of minor compaction). > >> The region is segmented by tablet and each tablet can have a buffer > >> that is undergoing ingest as well as a buffer that is undergoing > >> minor compaction. A memory manager decides when to initiate minor > >> compactions for the tablet buffers and the default implementation > >> tries to keep the memory region 80-90% full while preferring to > >> compact the largest tablet buffers. Creating larger RFiles during minor > >> compaction should lead to less major compactions. > >> During a minor compaction, the tablet buffer still "consumes" > >> memory within the in memory map and high ingest rates can lead to > >> exhausing the remaining capacity. The default memory manage uses > >> an adaptive strategy to predict the expected memory usage and makes > >> compaction decisions that should maintain some free memory. Batch > >> writers can be bursty and a bit unpredictable which could throw off > >> these estimates. Also, depending on the ingest profile, sometimes > >> an in-memory tablet buffer will consume a large percentage of the > >> total buffer. This leads to long minor compactions when the buffer > >> size is large which can allow ingest enough time to exhaust the > >> buffer before that memory can be reclaimed. When a tablet server > >> has to block ingest, it can affect client ingest rates to other > >> tablet servers due to the way that batch writers work. This can > >> lead to other tablet servers underestimating future ingest rates which can > >> further exacerbate the problem. > >> > >> There are some configuration changes that could reduce the severity > >> of held commits, although they might reduce peak ingest rates. > >> Reducing the in memory map size can reduce the maximum pause time due to > >> held commits. > >> Adding additional tablets should help avoid the problem of a single > >> tablet buffer consuming a large percentage of the memory region. > >> It might be better to aim for ~20 tablets per server if your > >> problem allows for it. It is also possible to replace the memory > >> manager with a custom one. I've tried this in the past and have > >> seen stability improvements by making the memory thresholds less > >> aggressive (50-75% full). This did reduce peak ingest rate in some cases, > >> but that was a reasonable tradeoff. > >> > >> Based on your current configuration, if a tablet server is serving > >> 4 tablets and has a 32GB buffer, your first minor compactions will > >> be at least 8GB and they will probably grow larger over time until > >> the tablets naturally split. Consider how long it would take to > >> write this RFile compared to your peak ingest rate. As others have > >> suggested, make sure to use the native maps. Based on your current > >> JVM heap size, using the Java in-memory map would probably lead to OOME or > >> very bad GC performance. > >> > >> Accumulo can trace minor compaction durations so you can get a feel > >> for max pause times or measure the effect of configuration changes. > >> > >> Cheers, > >> --Jonathan > >> > >> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <[email protected]_ > >> <mailto:[email protected]>> wrote: > >> > >> Based on what Cyrille said, I would look at garbage collection, > >> specifically I would look at how much of your newly allocated > >> objects spill into the old generation before they are flushed to > >> disk. Additionally, I would turn off the debug log or log to SSD’s > >> if you have them. Another thought, seeing that you have 256GB RAM / > >> node, is to run multiple tablet servers per node. Do you have 10 > >> threads on your Batch Writers? What about the Batch Writer latency, > >> is it too low such that you are not filling the buffer? > >> > >> *From:* Massimilian Mattetti [mailto:[email protected]_ <mailto: > >> [email protected]>] * > >> Sent:* Wednesday, July 05, 2017 8:37 AM* > >> To:* [email protected]_ <mailto:[email protected]>* > >> Subject:* maximize usage of cluster resources during ingestion > >> > >> Hi all, > >> > >> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. > >> Each server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are > >> used as masters (running HDFS NameNodes, Accumulo Master and > >> Monitor). The other 10 machines has 12 Disks of 1 TB (11 used by > >> HDFS DataNode process) and are running Accumulo TServer processes. > >> All the machines are connected via a 10Gb network and 3 of them are > >> running ZooKeeper. I have run some heavy ingestion test on this > >> cluster but I have never been able to reach more than *20% *CPU > >> usage on each Tablet Server. I am running an ingestion process > >> (using batch writers) on each data node. The table is pre-split in > >> order to have 4 tablets per tablet server. Monitoring the network I > >> have seen that data is received/sent from each node with a peak > >> rate of about 120MB/s / 100MB/s while the aggregated disk write throughput > >> on each tablet servers is around 120MB/s. > >> > >> The table configuration I am playing with are: > >> "table.file.replication": "2", > >> "table.compaction.minor.logs.threshold": "10", > >> "table.durability": "flush", > >> "table.file.max": "30", > >> "table.compaction.major.ratio": "9", > >> "table.split.threshold": "1G" > >> > >> while the tablet server configuration is: > >> "tserver.wal.blocksize": "2G", > >> "tserver.walog.max.size": "8G", > >> "tserver.memory.maps.max": "32G", > >> "tserver.compaction.minor.concurrent.max": "50", > >> "tserver.compaction.major.concurrent.max": "8", > >> "tserver.total.mutation.queue.max": "50M", > >> "tserver.wal.replication": "2", > >> "tserver.compaction.major.thread.files.open.max": "15" > >> > >> the tablet server heap has been set to 32GB > >> > >> From Monitor UI > >> > >> > >> As you can see I have a lot of valleys in which the ingestion rate > >> reaches 0. > >> What would be a good procedure to identify the bottleneck which > >> causes the 0 ingestion rate periods? > >> Thanks. > >> > >> Best Regards, > >> Max > >> > >> > >> > >> > >> > >> > >>
