RE: maximize usage of cluster resources during ingestion

dlmarion Thu, 13 Jul 2017 07:56:34 -0700

Regarding the referenced paper, pre-splitting the tables, using an optimized 
zookeeper deployment, and increasing concurrent minor / major compactions are 
good things. I'm not sure that we want to recommend turning off the write ahead 
logs and replication for production deployments.


-----Original Message-----
From: Jeremy Kepner [mailto:[email protected]] 
Sent: Thursday, July 13, 2017 10:05 AM
To: [email protected]
Subject: Re: maximize usage of cluster resources during ingestion

https://arxiv.org/abs/1406.4923  contains a number of tricks for maximizing 
ingest performance.

On Thu, Jul 13, 2017 at 08:13:40AM -0400, Jonathan Wonders wrote:
> Keep in mind that Accumulo puts a much different kind of load on HDFS 
> than the DFSIO benchmark.  It might be more appropriate to use a tool 
> like dstat to monitor HDD utilization and queue depth.  HDD throughput 
> benchmarks usually will involve high queue depths as disks are much 
> more effective when they can pipeline and batch updates. Accumulo's 
> WAL workload will typically call hflush or hsync periodically which 
> interrupts the IO pipeline much like memory barriers can interrupt CPU 
> pipelining except more severe.  This is necessary to provide 
> durability guarantees, but definitely comes at a cost to throughput.  
> Any database that has these durability guarantees will suffer 
> similarly to an extent.  For Accumulo, it is probably worse than for 
> non-distributed databases because the flush or sync must happen at 
> each replica prior to the mutation being added into the in-memory map.
> 
> I think one of the reasons the recommendation was made to add more 
> tablet servers is because each tablet server only writes to one WAL at 
> a time and each block will live on N disk based on replication factor.  
> If you have a replication factor of 3, there will be 10x3 blocks being 
> appended to at any given time (excluding compactions).  Since you have 
> 120 disks, not all will be participating in write-ahead-logging, so 
> you should not count the IO capacity of these extra disks towards 
> expected ingest throughput.  10 tablet servers per node is probably 
> too many because there would likely be a lot of contention 
> flushing/syncing WALs.  I'm not sure how smart HDFS is about how it 
> distributes the WAL load.  You might see more benefit with 2-4 
> tservers per node.  This would mostly likely require more batch writer 
> threads in the client as well.
> 
> I'm not too surprised that snappy did not help because the WALs are 
> not compressed and are likely a bigger bottleneck than compaction 
> since you have many disks not participating in WAL.
> 
> 
> On Wed, Jul 12, 2017 at 11:16 AM, Josh Elser <[email protected]> wrote:
> 
> > You probably want to split the table further than just 4 tablets per 
> > tablet server. Try 10's of tablets per server.
> >
> > Also, merging the content from (who I assume is) your coworker on 
> > this stackoverflow post[1], I don't believe the suggestion[2] to 
> > verify WAL max size, minc threshold, and native maps size was brought up 
> > yet.
> >
> > Also, did you look at the JVM GC logs for the TabletServers like was 
> > previously suggested to you?
> >
> > [1] https://stackoverflow.com/questions/44928354/accumulo-tablet
> > -server-doesnt-utilize-all-available-resources-on-host-machine/
> > [2] https://accumulo.apache.org/1.8/accumulo_user_manual.html#_n
> > ative_maps_configuration
> >
> > On 7/12/17 10:12 AM, Massimilian Mattetti wrote:
> >
> >> Hi all,
> >>
> >> I ran a few experiments in the last days trying to identify what is 
> >> the bottleneck for the ingestion process.
> >> - Running 10 tservers per node instead of only one gave me a very 
> >> neglectable performance improvement of about 15%.
> >> - Running the ingestor processes from the two masters give the same 
> >> performance as running one ingestor process in each tablet server 
> >> (10
> >> ingestors)
> >> - neither the network limit (10 Gb network) nor the disk throughput 
> >> limit has been reached (1GB/s per node reached while running the 
> >> TestDFSIO benchmark on HDFS)
> >> - CPU is always around 20% on each tserver
> >> - changing compression from GZ to snappy did not provide any 
> >> benefit
> >> - increasing the tserver.total.mutation.queue.maxto 200MB actually 
> >> decreased the performance I am going to run some ingestion 
> >> experiment with Kudu over the next few days, but any other 
> >> suggestion on how improve the performance on Accumulo is very 
> >> welcome.
> >> Thanks.
> >>
> >> Best Regards,
> >> Massimiliano
> >>
> >>
> >>
> >> From: Jonathan Wonders <[email protected]>
> >> To: [email protected], Dave Marion <[email protected]>
> >> Date: 07/07/2017 04:02
> >> Subject: Re: maximize usage of cluster resources during ingestion
> >> -------------------------------------------------------------------
> >> -----
> >>
> >>
> >>
> >> I've personally never seen full CPU utilization during pure ingest.
> >> Typically the bottleneck has been I/O related. The majority of 
> >> steady-state CPU utilization under a heavy ingest load is probably 
> >> due to compression unless you have custom constraints running. This 
> >> can depend on the compression algorithm you have selected.  There 
> >> is probably a measurable contribution from inserting into the 
> >> in-memory map.  Otherwise, not much computation occurs during ingest per 
> >> mutation.
> >>
> >> On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <[email protected]_ 
> >> <mailto:[email protected]>> wrote:
> >> That's a good point. I would also look at increasing 
> >> tserver.total.mutation.queue.max. Are you seeing hold times? If 
> >> not, I would keep pushing harder until you do, then move to 
> >> multiple tablet servers. Do you have any GC logs?
> >>
> >>
> >> On July 6, 2017 at 4:47 AM Cyrille Savelief <[email protected]_ 
> >> <mailto:[email protected]>> wrote:
> >>
> >> Are you sure Accumulo is not waiting for your app's data? There 
> >> might be GC pauses in your ingest code (we have already experienced that).
> >>
> >> Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti 
> >> <[email protected]_ <mailto:[email protected]>> a écrit :
> >> Thank you all for the suggestions.
> >>
> >> About the native memory map I checked the logs on each tablet 
> >> server and it was loaded correctly (of course the 
> >> tserver.memory.maps.native.enabled
> >> was set to true), so the GC pauses should not be the problem 
> >> eventually. I managed to get much better ingestion graph by 
> >> reducing the native map size to *2GB* and increasing the Batch 
> >> Writer threads number from the default (3 was really bad for my 
> >> configuration) to *10* (I think it does not make sense having more threads 
> >> than tablet servers, am I right?).
> >>
> >> The configuration that I used for the table is:
> >> "table.file.replication": "2",
> >> "table.compaction.minor.logs.threshold": "3",
> >> "table.durability": "flush",
> >> "table.split.threshold": "1G"
> >>
> >> while for the tablet servers is:
> >> "tserver.wal.blocksize": "1G",
> >>   "tserver.walog.max.size": "2G",
> >> "tserver.memory.maps.max": "2G",
> >> "tserver.compaction.minor.concurrent.max": "50",
> >> "tserver.compaction.major.concurrent.max": "20",
> >> "tserver.wal.replication": "2",
> >>   "tserver.compaction.major.thread.files.open.max": "15"
> >>
> >> The new graph:
> >>
> >>
> >> I still have the problem of a CPU usage that is less than*20%.* So 
> >> I am thinking to run multiple tablet servers per node (like 5 or 
> >> 10) in order to maximize the CPU usage. Besides that I do not have 
> >> any other idea on how to stress those servers with ingestion.
> >> Any suggestions are very welcome. Meanwhile, thank you all again 
> >> for your help.
> >>
> >>
> >> Best Regards,
> >> Massimiliano
> >>
> >>
> >>
> >> From: Jonathan Wonders <[email protected]_ <mailto:
> >> [email protected]>>
> >> To: [email protected]_ <mailto:[email protected]>
> >> Date: 06/07/2017 04:01
> >> Subject: Re: maximize usage of cluster resources during ingestion
> >> -------------------------------------------------------------------
> >> -----
> >>
> >>
> >>
> >> Hi Massimilian,
> >>
> >> Are you seeing held commits during the ingest pauses?  Just based 
> >> on having looked at many similar graphs in the past, this might be 
> >> one of the major culprits.  A tablet server has a memory region 
> >> with a bounded size
> >> (tserver.memory.maps.max) where it buffers data that has not yet 
> >> been written to RFiles (through the process of minor compaction). 
> >> The region is segmented by tablet and each tablet can have a buffer 
> >> that is undergoing ingest as well as a buffer that is undergoing 
> >> minor compaction. A memory manager decides when to initiate minor 
> >> compactions for the tablet buffers and the default implementation 
> >> tries to keep the memory region 80-90% full while preferring to 
> >> compact the largest tablet buffers. Creating larger RFiles during minor 
> >> compaction should lead to less major compactions.
> >> During a minor compaction, the tablet buffer still "consumes" 
> >> memory within the in memory map and high ingest rates can lead to 
> >> exhausing the remaining capacity.  The default memory manage uses 
> >> an adaptive strategy to predict the expected memory usage and makes 
> >> compaction decisions that should maintain some free memory.  Batch 
> >> writers can be bursty and a bit unpredictable which could throw off 
> >> these estimates.  Also, depending on the ingest profile, sometimes 
> >> an in-memory tablet buffer will consume a large percentage of the 
> >> total buffer.  This leads to long minor compactions when the buffer 
> >> size is large which can allow ingest enough time to exhaust the 
> >> buffer before that memory can be reclaimed. When a tablet server 
> >> has to block ingest, it can affect client ingest rates to other 
> >> tablet servers due to the way that batch writers work.  This can 
> >> lead to other tablet servers underestimating future ingest rates which can 
> >> further exacerbate the problem.
> >>
> >> There are some configuration changes that could reduce the severity 
> >> of held commits, although they might reduce peak ingest rates.  
> >> Reducing the in memory map size can reduce the maximum pause time due to 
> >> held commits.
> >> Adding additional tablets should help avoid the problem of a single 
> >> tablet buffer consuming a large percentage of the memory region.  
> >> It might be better to aim for ~20 tablets per server if your 
> >> problem allows for it.  It is also possible to replace the memory 
> >> manager with a custom one.  I've tried this in the past and have 
> >> seen stability improvements by making the memory thresholds less 
> >> aggressive (50-75% full).  This did reduce peak ingest rate in some cases, 
> >> but that was a reasonable tradeoff.
> >>
> >> Based on your current configuration, if a tablet server is serving 
> >> 4 tablets and has a 32GB buffer, your first minor compactions will 
> >> be at least 8GB and they will probably grow larger over time until 
> >> the tablets naturally split.  Consider how long it would take to 
> >> write this RFile compared to your peak ingest rate.  As others have 
> >> suggested, make sure to use the native maps.  Based on your current 
> >> JVM heap size, using the Java in-memory map would probably lead to OOME or 
> >> very bad GC performance.
> >>
> >> Accumulo can trace minor compaction durations so you can get a feel 
> >> for max pause times or measure the effect of configuration changes.
> >>
> >> Cheers,
> >> --Jonathan
> >>
> >> On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <[email protected]_ 
> >> <mailto:[email protected]>> wrote:
> >>
> >> Based on what Cyrille said, I would look at garbage collection, 
> >> specifically I would look at how much of your newly allocated 
> >> objects spill into the old generation before they are flushed to 
> >> disk. Additionally, I would turn off the debug log or log to SSD’s 
> >> if you have them. Another thought, seeing that you have 256GB RAM / 
> >> node, is to run multiple tablet servers per node. Do you have 10 
> >> threads on your Batch Writers? What about the Batch Writer latency, 
> >> is it too low such that you are not filling the buffer?
> >>
> >> *From:* Massimilian Mattetti [mailto:[email protected]_ <mailto:
> >> [email protected]>] *
> >> Sent:* Wednesday, July 05, 2017 8:37 AM*
> >> To:* [email protected]_ <mailto:[email protected]>*
> >> Subject:* maximize usage of cluster resources during ingestion
> >>
> >> Hi all,
> >>
> >> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. 
> >> Each server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are 
> >> used as masters (running HDFS NameNodes, Accumulo Master and 
> >> Monitor). The other 10 machines has 12 Disks of 1 TB (11 used by 
> >> HDFS DataNode process) and are running Accumulo TServer processes. 
> >> All the machines are connected via a 10Gb network and 3 of them are 
> >> running ZooKeeper. I have run some heavy ingestion test on this 
> >> cluster but I have never been able to reach more than *20% *CPU 
> >> usage on each Tablet Server. I am running an ingestion process 
> >> (using batch writers) on each data node. The table is pre-split in 
> >> order to have 4 tablets per tablet server. Monitoring the network I 
> >> have seen that data is received/sent from each node with a peak 
> >> rate of about 120MB/s / 100MB/s while the aggregated disk write throughput 
> >> on each tablet servers is around 120MB/s.
> >>
> >> The table configuration I am playing with are:
> >> "table.file.replication": "2",
> >> "table.compaction.minor.logs.threshold": "10",
> >> "table.durability": "flush",
> >> "table.file.max": "30",
> >> "table.compaction.major.ratio": "9",
> >> "table.split.threshold": "1G"
> >>
> >> while the tablet server configuration is:
> >> "tserver.wal.blocksize": "2G",
> >> "tserver.walog.max.size": "8G",
> >> "tserver.memory.maps.max": "32G",
> >> "tserver.compaction.minor.concurrent.max": "50",
> >> "tserver.compaction.major.concurrent.max": "8",
> >> "tserver.total.mutation.queue.max": "50M",
> >> "tserver.wal.replication": "2",
> >> "tserver.compaction.major.thread.files.open.max": "15"
> >>
> >> the tablet server heap has been set to 32GB
> >>
> >>  From Monitor UI
> >>
> >>
> >> As you can see I have a lot of valleys in which the ingestion rate 
> >> reaches 0.
> >> What would be a good procedure to identify the bottleneck which 
> >> causes the 0 ingestion rate periods?
> >> Thanks.
> >>
> >> Best Regards,
> >> Max
> >>
> >>
> >>
> >>
> >>
> >>
> >>

RE: maximize usage of cluster resources during ingestion

Reply via email to