https://arxiv.org/abs/1406.4923 contains a number of tricks for maximizing ingest performance.
On Thu, Jul 13, 2017 at 08:13:40AM -0400, Jonathan Wonders wrote: > Keep in mind that Accumulo puts a much different kind of load on HDFS than > the DFSIO benchmark. It might be more appropriate to use a tool like dstat > to monitor HDD utilization and queue depth. HDD throughput benchmarks > usually will involve high queue depths as disks are much more effective > when they can pipeline and batch updates. Accumulo's WAL workload will > typically call hflush or hsync periodically which interrupts the IO > pipeline much like memory barriers can interrupt CPU pipelining except more > severe. This is necessary to provide durability guarantees, but definitely > comes at a cost to throughput. Any database that has these durability > guarantees will suffer similarly to an extent. For Accumulo, it is > probably worse than for non-distributed databases because the flush or sync > must happen at each replica prior to the mutation being added into the > in-memory map. > > I think one of the reasons the recommendation was made to add more tablet > servers is because each tablet server only writes to one WAL at a time and > each block will live on N disk based on replication factor. If you have a > replication factor of 3, there will be 10x3 blocks being appended to at any > given time (excluding compactions). Since you have 120 disks, not all will > be participating in write-ahead-logging, so you should not count the IO > capacity of these extra disks towards expected ingest throughput. 10 > tablet servers per node is probably too many because there would likely be > a lot of contention flushing/syncing WALs. I'm not sure how smart HDFS is > about how it distributes the WAL load. You might see more benefit with 2-4 > tservers per node. This would mostly likely require more batch writer > threads in the client as well. > > I'm not too surprised that snappy did not help because the WALs are not > compressed and are likely a bigger bottleneck than compaction since you > have many disks not participating in WAL. > > > On Wed, Jul 12, 2017 at 11:16 AM, Josh Elser <[email protected]> wrote: > > > You probably want to split the table further than just 4 tablets per > > tablet server. Try 10's of tablets per server. > > > > Also, merging the content from (who I assume is) your coworker on this > > stackoverflow post[1], I don't believe the suggestion[2] to verify WAL max > > size, minc threshold, and native maps size was brought up yet. > > > > Also, did you look at the JVM GC logs for the TabletServers like was > > previously suggested to you? > > > > [1] https://stackoverflow.com/questions/44928354/accumulo-tablet > > -server-doesnt-utilize-all-available-resources-on-host-machine/ > > [2] https://accumulo.apache.org/1.8/accumulo_user_manual.html#_n > > ative_maps_configuration > > > > On 7/12/17 10:12 AM, Massimilian Mattetti wrote: > > > >> Hi all, > >> > >> I ran a few experiments in the last days trying to identify what is the > >> bottleneck for the ingestion process. > >> - Running 10 tservers per node instead of only one gave me a very > >> neglectable performance improvement of about 15%. > >> - Running the ingestor processes from the two masters give the same > >> performance as running one ingestor process in each tablet server (10 > >> ingestors) > >> - neither the network limit (10 Gb network) nor the disk throughput limit > >> has been reached (1GB/s per node reached while running the TestDFSIO > >> benchmark on HDFS) > >> - CPU is always around 20% on each tserver > >> - changing compression from GZ to snappy did not provide any benefit > >> - increasing the tserver.total.mutation.queue.maxto 200MB actually > >> decreased the performance > >> I am going to run some ingestion experiment with Kudu over the next few > >> days, but any other suggestion on how improve the performance on Accumulo > >> is very welcome. > >> Thanks. > >> > >> Best Regards, > >> Massimiliano > >> > >> > >> > >> From: Jonathan Wonders <[email protected]> > >> To: [email protected], Dave Marion <[email protected]> > >> Date: 07/07/2017 04:02 > >> Subject: Re: maximize usage of cluster resources during ingestion > >> ------------------------------------------------------------------------ > >> > >> > >> > >> I've personally never seen full CPU utilization during pure ingest. > >> Typically the bottleneck has been I/O related. The majority of steady-state > >> CPU utilization under a heavy ingest load is probably due to compression > >> unless you have custom constraints running. This can depend on the > >> compression algorithm you have selected. There is probably a measurable > >> contribution from inserting into the in-memory map. Otherwise, not much > >> computation occurs during ingest per mutation. > >> > >> On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <[email protected]_ > >> <mailto:[email protected]>> wrote: > >> That's a good point. I would also look at increasing > >> tserver.total.mutation.queue.max. Are you seeing hold times? If not, I > >> would keep pushing harder until you do, then move to multiple tablet > >> servers. Do you have any GC logs? > >> > >> > >> On July 6, 2017 at 4:47 AM Cyrille Savelief <[email protected]_ > >> <mailto:[email protected]>> wrote: > >> > >> Are you sure Accumulo is not waiting for your app's data? There might be > >> GC pauses in your ingest code (we have already experienced that). > >> > >> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti <[email protected]_ > >> <mailto:[email protected]>> a écrit : > >> Thank you all for the suggestions. > >> > >> About the native memory map I checked the logs on each tablet server and > >> it was loaded correctly (of course the tserver.memory.maps.native.enabled > >> was set to true), so the GC pauses should not be the problem eventually. I > >> managed to get much better ingestion graph by reducing the native map size > >> to *2GB* and increasing the Batch Writer threads number from the default (3 > >> was really bad for my configuration) to *10* (I think it does not make > >> sense having more threads than tablet servers, am I right?). > >> > >> The configuration that I used for the table is: > >> "table.file.replication": "2", > >> "table.compaction.minor.logs.threshold": "3", > >> "table.durability": "flush", > >> "table.split.threshold": "1G" > >> > >> while for the tablet servers is: > >> "tserver.wal.blocksize": "1G", > >> "tserver.walog.max.size": "2G", > >> "tserver.memory.maps.max": "2G", > >> "tserver.compaction.minor.concurrent.max": "50", > >> "tserver.compaction.major.concurrent.max": "20", > >> "tserver.wal.replication": "2", > >> "tserver.compaction.major.thread.files.open.max": "15" > >> > >> The new graph: > >> > >> > >> I still have the problem of a CPU usage that is less than*20%.* So I am > >> thinking to run multiple tablet servers per node (like 5 or 10) in order to > >> maximize the CPU usage. Besides that I do not have any other idea on how to > >> stress those servers with ingestion. > >> Any suggestions are very welcome. Meanwhile, thank you all again for your > >> help. > >> > >> > >> Best Regards, > >> Massimiliano > >> > >> > >> > >> From: Jonathan Wonders <[email protected]_ <mailto: > >> [email protected]>> > >> To: [email protected]_ <mailto:[email protected]> > >> Date: 06/07/2017 04:01 > >> Subject: Re: maximize usage of cluster resources during ingestion > >> ------------------------------------------------------------------------ > >> > >> > >> > >> Hi Massimilian, > >> > >> Are you seeing held commits during the ingest pauses? Just based on > >> having looked at many similar graphs in the past, this might be one of the > >> major culprits. A tablet server has a memory region with a bounded size > >> (tserver.memory.maps.max) where it buffers data that has not yet been > >> written to RFiles (through the process of minor compaction). The region is > >> segmented by tablet and each tablet can have a buffer that is undergoing > >> ingest as well as a buffer that is undergoing minor compaction. A memory > >> manager decides when to initiate minor compactions for the tablet buffers > >> and the default implementation tries to keep the memory region 80-90% full > >> while preferring to compact the largest tablet buffers. Creating larger > >> RFiles during minor compaction should lead to less major compactions. > >> During a minor compaction, the tablet buffer still "consumes" memory within > >> the in memory map and high ingest rates can lead to exhausing the remaining > >> capacity. The default memory manage uses an adaptive strategy to predict > >> the expected memory usage and makes compaction decisions that should > >> maintain some free memory. Batch writers can be bursty and a bit > >> unpredictable which could throw off these estimates. Also, depending on > >> the ingest profile, sometimes an in-memory tablet buffer will consume a > >> large percentage of the total buffer. This leads to long minor compactions > >> when the buffer size is large which can allow ingest enough time to exhaust > >> the buffer before that memory can be reclaimed. When a tablet server has to > >> block ingest, it can affect client ingest rates to other tablet servers due > >> to the way that batch writers work. This can lead to other tablet servers > >> underestimating future ingest rates which can further exacerbate the > >> problem. > >> > >> There are some configuration changes that could reduce the severity of > >> held commits, although they might reduce peak ingest rates. Reducing the > >> in memory map size can reduce the maximum pause time due to held commits. > >> Adding additional tablets should help avoid the problem of a single tablet > >> buffer consuming a large percentage of the memory region. It might be > >> better to aim for ~20 tablets per server if your problem allows for it. It > >> is also possible to replace the memory manager with a custom one. I've > >> tried this in the past and have seen stability improvements by making the > >> memory thresholds less aggressive (50-75% full). This did reduce peak > >> ingest rate in some cases, but that was a reasonable tradeoff. > >> > >> Based on your current configuration, if a tablet server is serving 4 > >> tablets and has a 32GB buffer, your first minor compactions will be at > >> least 8GB and they will probably grow larger over time until the tablets > >> naturally split. Consider how long it would take to write this RFile > >> compared to your peak ingest rate. As others have suggested, make sure to > >> use the native maps. Based on your current JVM heap size, using the Java > >> in-memory map would probably lead to OOME or very bad GC performance. > >> > >> Accumulo can trace minor compaction durations so you can get a feel for > >> max pause times or measure the effect of configuration changes. > >> > >> Cheers, > >> --Jonathan > >> > >> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <[email protected]_ > >> <mailto:[email protected]>> wrote: > >> > >> Based on what Cyrille said, I would look at garbage collection, > >> specifically I would look at how much of your newly allocated objects spill > >> into the old generation before they are flushed to disk. Additionally, I > >> would turn off the debug log or log to SSD’s if you have them. Another > >> thought, seeing that you have 256GB RAM / node, is to run multiple tablet > >> servers per node. Do you have 10 threads on your Batch Writers? What about > >> the Batch Writer latency, is it too low such that you are not filling the > >> buffer? > >> > >> *From:* Massimilian Mattetti [mailto:[email protected]_ <mailto: > >> [email protected]>] * > >> Sent:* Wednesday, July 05, 2017 8:37 AM* > >> To:* [email protected]_ <mailto:[email protected]>* > >> Subject:* maximize usage of cluster resources during ingestion > >> > >> Hi all, > >> > >> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each > >> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as > >> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10 > >> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are > >> running Accumulo TServer processes. All the machines are connected via a > >> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy > >> ingestion test on this cluster but I have never been able to reach more > >> than *20% *CPU usage on each Tablet Server. I am running an ingestion > >> process (using batch writers) on each data node. The table is pre-split in > >> order to have 4 tablets per tablet server. Monitoring the network I have > >> seen that data is received/sent from each node with a peak rate of about > >> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet > >> servers is around 120MB/s. > >> > >> The table configuration I am playing with are: > >> "table.file.replication": "2", > >> "table.compaction.minor.logs.threshold": "10", > >> "table.durability": "flush", > >> "table.file.max": "30", > >> "table.compaction.major.ratio": "9", > >> "table.split.threshold": "1G" > >> > >> while the tablet server configuration is: > >> "tserver.wal.blocksize": "2G", > >> "tserver.walog.max.size": "8G", > >> "tserver.memory.maps.max": "32G", > >> "tserver.compaction.minor.concurrent.max": "50", > >> "tserver.compaction.major.concurrent.max": "8", > >> "tserver.total.mutation.queue.max": "50M", > >> "tserver.wal.replication": "2", > >> "tserver.compaction.major.thread.files.open.max": "15" > >> > >> the tablet server heap has been set to 32GB > >> > >> From Monitor UI > >> > >> > >> As you can see I have a lot of valleys in which the ingestion rate > >> reaches 0. > >> What would be a good procedure to identify the bottleneck which causes > >> the 0 ingestion rate periods? > >> Thanks. > >> > >> Best Regards, > >> Max > >> > >> > >> > >> > >> > >> > >>
