Hello - I'm getting started with Ignite and looking seriously at using it
for a specific use-case.

Working on a Proof-Of-Concept (POC), I am finding a question related to
performance, and wondering if the solution, using Ignite Services, is a
good fit for the use-case.

In my testing, I am getting the following timings:

   - Startup of 20,000 ignite services takes 30 seconds
   - Startup of 50,000 ignite services takes 250 seconds
   - The 2.5x increase from 20,000 to 50,000 yielded > 8x cost in startup
   time (appears to be exponential growth)

Watching the JVM during this time, I see the following:

   - Heap usage is not significant (do not see signs of GC)
   - CPU usage is only slightly increased - on the order of 20% total
   (system has 12 cores/24 threads)
   - Network utilization is reasonable
   - Futex system call (measured with "strace -r") appears to be taking the
   most time by far.

The use-case involves the following:

   - Startup of up-to hundreds-of-thousands of services at cluster spin-up
   - Frequent, small adjustments to the services running over time
   - Need to rebalance when a new node joins the cluster, or an old one
   leaves the cluster
   - Once the services are deployed, we do not plan to make cross-cluster
   calls into the services (i.e. we do *not* plan to use ignite's
   services().serviceProxy() on these)
   - Jobs don't look like a fit because these (1) are "long-running"
   (actually periodically scheduled tasks) and (2) they need to redistribute
   even after they start running

This is starting to get long.  I have more details to share.  Here is the
repo with the code being used to test, and a link to a wiki page with some
of the details:

https://github.com/opennms-forge/distributed-scheduling-poc/

https://github.com/opennms-forge/distributed-scheduling-poc/wiki/Ignite-Startup-Performance


Questions I have in mind:

   - Are services a good fit here?  We expect to reach upwards of 500,000
   services in a cluster with multiple nodes.
   - Any thoughts on tracking down the bottleneck and alleviating it?  (I
   have started taking timing measurements in the Ignite code)

Stopping here - please ask questions and I'll gladly fill in details.  Any
tips are welcome, including ideas for tracking down just where the
bottleneck exists.

Art

Reply via email to