Performance and large numbers of servers

Arthur Naseef Tue, 28 Jun 2022 10:55:24 -0700

Hello - I'm getting started with Ignite and looking seriously at using it
for a specific use-case.


Working on a Proof-Of-Concept (POC), I am finding a question related to
performance, and wondering if the solution, using Ignite Services, is a
good fit for the use-case.

In my testing, I am getting the following timings:

   - Startup of 20,000 ignite services takes 30 seconds
   - Startup of 50,000 ignite services takes 250 seconds
   - The 2.5x increase from 20,000 to 50,000 yielded > 8x cost in startup
   time (appears to be exponential growth)

Watching the JVM during this time, I see the following:

   - Heap usage is not significant (do not see signs of GC)
   - CPU usage is only slightly increased - on the order of 20% total
   (system has 12 cores/24 threads)
   - Network utilization is reasonable
   - Futex system call (measured with "strace -r") appears to be taking the
   most time by far.

The use-case involves the following:

   - Startup of up-to hundreds-of-thousands of services at cluster spin-up
   - Frequent, small adjustments to the services running over time
   - Need to rebalance when a new node joins the cluster, or an old one
   leaves the cluster
   - Once the services are deployed, we do not plan to make cross-cluster
   calls into the services (i.e. we do *not* plan to use ignite's
   services().serviceProxy() on these)
   - Jobs don't look like a fit because these (1) are "long-running"
   (actually periodically scheduled tasks) and (2) they need to redistribute
   even after they start running

This is starting to get long.  I have more details to share.  Here is the
repo with the code being used to test, and a link to a wiki page with some
of the details:

https://github.com/opennms-forge/distributed-scheduling-poc/

https://github.com/opennms-forge/distributed-scheduling-poc/wiki/Ignite-Startup-Performance


Questions I have in mind:

   - Are services a good fit here?  We expect to reach upwards of 500,000
   services in a cluster with multiple nodes.
   - Any thoughts on tracking down the bottleneck and alleviating it?  (I
   have started taking timing measurements in the Ignite code)

Stopping here - please ask questions and I'll gladly fill in details.  Any
tips are welcome, including ideas for tracking down just where the
bottleneck exists.

Art

Performance and large numbers of servers

Reply via email to