Hi all,
I ran a few experiments in the last days trying to identify what is the
bottleneck for the ingestion process.
- Running 10 tservers per node instead of only one gave me a very
neglectable performance improvement of about 15%.
- Running the ingestor processes from the two masters give the same
performance as running one ingestor process in each tablet server (10
ingestors)
- neither the network limit (10 Gb network) nor the disk throughput
limit has been reached (1GB/s per node reached while running the
TestDFSIO benchmark on HDFS)
- CPU is always around 20% on each tserver
- changing compression from GZ to snappy did not provide any benefit
- increasing the tserver.total.mutation.queue.maxto 200MB actually
decreased the performance
I am going to run some ingestion experiment with Kudu over the next few
days, but any other suggestion on how improve the performance on
Accumulo is very welcome.
Thanks.
Best Regards,
Massimiliano
From: Jonathan Wonders <[email protected]>
To: [email protected], Dave Marion <[email protected]>
Date: 07/07/2017 04:02
Subject: Re: maximize usage of cluster resources during ingestion
------------------------------------------------------------------------
I've personally never seen full CPU utilization during pure ingest.
Typically the bottleneck has been I/O related. The majority of
steady-state CPU utilization under a heavy ingest load is probably due
to compression unless you have custom constraints running. This can
depend on the compression algorithm you have selected. There is
probably a measurable contribution from inserting into the in-memory
map. Otherwise, not much computation occurs during ingest per mutation.
On Thu, Jul 6, 2017 at 8:18 AM, Dave Marion <[email protected]_
<mailto:[email protected]>> wrote:
That's a good point. I would also look at increasing
tserver.total.mutation.queue.max. Are you seeing hold times? If not, I
would keep pushing harder until you do, then move to multiple tablet
servers. Do you have any GC logs?
On July 6, 2017 at 4:47 AM Cyrille Savelief <[email protected]_
<mailto:[email protected]>> wrote:
Are you sure Accumulo is not waiting for your app's data? There might be
GC pauses in your ingest code (we have already experienced that).
Le jeu. 6 juil. 2017 à 10:32, Massimilian Mattetti
<[email protected]_ <mailto:[email protected]>> a écrit :
Thank you all for the suggestions.
About the native memory map I checked the logs on each tablet server and
it was loaded correctly (of course the
tserver.memory.maps.native.enabled was set to true), so the GC pauses
should not be the problem eventually. I managed to get much better
ingestion graph by reducing the native map size to *2GB* and increasing
the Batch Writer threads number from the default (3 was really bad for
my configuration) to *10* (I think it does not make sense having more
threads than tablet servers, am I right?).
The configuration that I used for the table is:
"table.file.replication": "2",
"table.compaction.minor.logs.threshold": "3",
"table.durability": "flush",
"table.split.threshold": "1G"
while for the tablet servers is:
"tserver.wal.blocksize": "1G",
"tserver.walog.max.size": "2G",
"tserver.memory.maps.max": "2G",
"tserver.compaction.minor.concurrent.max": "50",
"tserver.compaction.major.concurrent.max": "20",
"tserver.wal.replication": "2",
"tserver.compaction.major.thread.files.open.max": "15"
The new graph:
I still have the problem of a CPU usage that is less than*20%.* So I am
thinking to run multiple tablet servers per node (like 5 or 10) in order
to maximize the CPU usage. Besides that I do not have any other idea on
how to stress those servers with ingestion.
Any suggestions are very welcome. Meanwhile, thank you all again for
your help.
Best Regards,
Massimiliano
From: Jonathan Wonders <[email protected]_
<mailto:[email protected]>>
To: [email protected]_ <mailto:[email protected]>
Date: 06/07/2017 04:01
Subject: Re: maximize usage of cluster resources during ingestion
------------------------------------------------------------------------
Hi Massimilian,
Are you seeing held commits during the ingest pauses? Just based on
having looked at many similar graphs in the past, this might be one of
the major culprits. A tablet server has a memory region with a bounded
size (tserver.memory.maps.max) where it buffers data that has not yet
been written to RFiles (through the process of minor compaction). The
region is segmented by tablet and each tablet can have a buffer that is
undergoing ingest as well as a buffer that is undergoing minor
compaction. A memory manager decides when to initiate minor compactions
for the tablet buffers and the default implementation tries to keep the
memory region 80-90% full while preferring to compact the largest tablet
buffers. Creating larger RFiles during minor compaction should lead to
less major compactions. During a minor compaction, the tablet buffer
still "consumes" memory within the in memory map and high ingest rates
can lead to exhausing the remaining capacity. The default memory manage
uses an adaptive strategy to predict the expected memory usage and makes
compaction decisions that should maintain some free memory. Batch
writers can be bursty and a bit unpredictable which could throw off
these estimates. Also, depending on the ingest profile, sometimes an
in-memory tablet buffer will consume a large percentage of the total
buffer. This leads to long minor compactions when the buffer size is
large which can allow ingest enough time to exhaust the buffer before
that memory can be reclaimed. When a tablet server has to block ingest,
it can affect client ingest rates to other tablet servers due to the way
that batch writers work. This can lead to other tablet servers
underestimating future ingest rates which can further exacerbate the
problem.
There are some configuration changes that could reduce the severity of
held commits, although they might reduce peak ingest rates. Reducing
the in memory map size can reduce the maximum pause time due to held
commits. Adding additional tablets should help avoid the problem of a
single tablet buffer consuming a large percentage of the memory region.
It might be better to aim for ~20 tablets per server if your problem
allows for it. It is also possible to replace the memory manager with a
custom one. I've tried this in the past and have seen stability
improvements by making the memory thresholds less aggressive (50-75%
full). This did reduce peak ingest rate in some cases, but that was a
reasonable tradeoff.
Based on your current configuration, if a tablet server is serving 4
tablets and has a 32GB buffer, your first minor compactions will be at
least 8GB and they will probably grow larger over time until the tablets
naturally split. Consider how long it would take to write this RFile
compared to your peak ingest rate. As others have suggested, make sure
to use the native maps. Based on your current JVM heap size, using the
Java in-memory map would probably lead to OOME or very bad GC performance.
Accumulo can trace minor compaction durations so you can get a feel for
max pause times or measure the effect of configuration changes.
Cheers,
--Jonathan
On Wed, Jul 5, 2017 at 7:16 PM, Dave Marion <[email protected]_
<mailto:[email protected]>> wrote:
Based on what Cyrille said, I would look at garbage collection,
specifically I would look at how much of your newly allocated objects
spill into the old generation before they are flushed to disk.
Additionally, I would turn off the debug log or log to SSD’s if you have
them. Another thought, seeing that you have 256GB RAM / node, is to run
multiple tablet servers per node. Do you have 10 threads on your Batch
Writers? What about the Batch Writer latency, is it too low such that
you are not filling the buffer?
*From:* Massimilian Mattetti [mailto:[email protected]_
<mailto:[email protected]>] *
Sent:* Wednesday, July 05, 2017 8:37 AM*
To:* [email protected]_ <mailto:[email protected]>*
Subject:* maximize usage of cluster resources during ingestion
Hi all,
I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
masters (running HDFS NameNodes, Accumulo Master and Monitor). The other
10 machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and
are running Accumulo TServer processes. All the machines are connected
via a 10Gb network and 3 of them are running ZooKeeper. I have run some
heavy ingestion test on this cluster but I have never been able to reach
more than *20% *CPU usage on each Tablet Server. I am running an
ingestion process (using batch writers) on each data node. The table is
pre-split in order to have 4 tablets per tablet server. Monitoring the
network I have seen that data is received/sent from each node with a
peak rate of about 120MB/s / 100MB/s while the aggregated disk write
throughput on each tablet servers is around 120MB/s.
The table configuration I am playing with are:
"table.file.replication": "2",
"table.compaction.minor.logs.threshold": "10",
"table.durability": "flush",
"table.file.max": "30",
"table.compaction.major.ratio": "9",
"table.split.threshold": "1G"
while the tablet server configuration is:
"tserver.wal.blocksize": "2G",
"tserver.walog.max.size": "8G",
"tserver.memory.maps.max": "32G",
"tserver.compaction.minor.concurrent.max": "50",
"tserver.compaction.major.concurrent.max": "8",
"tserver.total.mutation.queue.max": "50M",
"tserver.wal.replication": "2",
"tserver.compaction.major.thread.files.open.max": "15"
the tablet server heap has been set to 32GB
From Monitor UI
As you can see I have a lot of valleys in which the ingestion rate
reaches 0.
What would be a good procedure to identify the bottleneck which causes
the 0 ingestion rate periods?
Thanks.
Best Regards,
Max