Hi, I have a big problem. Ignite is failing catastrophically for me.
This is the scenario; We start a Cluster of 15 Ignite Server Nodes. These are initially empty. Then some Kafka feeds are enabled that streams data into 4 independent caches -- simultaneously (using DataStreamers) Each cache is configured with 1 primary and 2 backups – and as a PARTITIONED cache. These attempt to load ~0.5M entries into each cache. These Kafka feeds are streamed from a Client Node on 4 Threads into the caches Almost always a Node will fail during this operation. And this will lead to a catastrophic, cascading failure of the entire Cluster. But on the failing Nodes, there is no information whatsoever as to what caused the failure. Nothing. No OOM. No Exceptions. Nothing. The logs simply stop. I have GC logging enabled, and there are no long pauses. Thus, I am baffled I have tried increasing memory. I have tried increasing timeouts to ridiculous numbers; ``` COMPUTE_TASK_TIMEOUT=5000 DISCOVERY_ACK_TIMEOUT=30000 DISCOVERY_JOIN_TIMEOUT=120000 DISCOVERY_MAX_ACK_TIMEOUT=37000 DISCOVERY_NETWORK_TIMEOUT=120000 FAILURE_DETECTION_TIMEOUT=120000 IGNITE_LOG_LEVEL=INFO IGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=200000 IGNITE_QUIET=false ``` But nothing helps. What can I do to get better information out of Ignite?? It is basically failing silently. Is there some tuning parameters that I am missing? I would be happy to supply further config information. This is with Ignite 2.0.0 We have invested quite a bit of effort to get Ignite running for our application. And this is a show-stopper for us. NOTE: this does not happen with the smaller feeds that we have in our dev environment. Thanks, -- Chris -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Ignite-failing-catastrophically-tp13357.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
