In recent I'm running Spark MLLIb KMeans with Apach Ignite 2.6.0 shared RDD on ten AWS r4.2xlarge workers. It works and runs to finish on 1 billion points (within memory), but failed with 2 billion points (exceeding available memory)
My code for loading data to Ignite Shared RDD is here: https://github.com/jiazou-bigdata/SparkBench/blob/master/perf-bench/src/main/scala/edu/rice/bench/KMeansDataGenerator.scala#L64 Then My code for running Spark MLLIB KMeans on the Shared RDD is here: https://github.com/jiazou-bigdata/SparkBench/blob/master/perf-bench/src/main/scala/edu/rice/bench/IgniteRDDKMeans.scala For running 2 billion points, I enabled swap, the configuration file for Ignite server is here: https://github.com/jiazou-bigdata/SparkBench/blob/master/ignite/server/example-cache.xml I have run the program to load 2 billion points to memory for several times, but all failed. One error I met for several times while running 2 billion points is when loading large data to Ignite shared RDD, one Ignite worker failed without obvious reason, the screen message is the same with the one in this post: http://apache-ignite-users.70518.x6.nabble.com/Node-pause-for-no-obvious-reason-td21923.html The ending part of the log file is like this: [14:17:13,231][INFO][grid-timeout-worker-#23][IgniteKernal] FreeList [name=null, buckets=256, dataPages=12247613, reusePages=0] [14:17:28,710][WARNING][jvm-pause-detector-worker][] Possible too long JVM pause: 11193 milliseconds. [14:17:28,834][INFO][tcp-disco-sock-reader-#4][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/172.31.88.4:45550, rmtPort=45550 [14:17:28,834][INFO][tcp-disco-sock-reader-#9][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/172.31.81.91:42661, rmtPort=42661 [14:17:28,948][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/172.31.81.91, rmtPort=59539] [14:17:28,948][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/172.31.81.91, rmtPort=59539] [14:17:29,039][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/172.31.81.91:59539, rmtPort=59539] [14:17:29,167][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems). [14:17:29,167][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/172.31.81.91:59539, rmtPort=59539 [14:17:29,167][WARNING][disco-event-worker-#41][GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=0c1716fe-3b94-440e-905c-36fdca708ea4, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.90.9], sockAddrs=[ip-172-31-90-9/172.31.90.9:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=8, intOrder=8, lastExchangeTime=1542291449160, loc=true, ver=2.6.0#20180710-sha1:669feacc, isClient=false] [14:17:29,393][SEVERE][tcp-disco-srvr-#2][] Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated unexpectedly.]] java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated unexpectedly. at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) [14:17:29,439][SEVERE][tcp-disco-srvr-#2][] JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated unexpectedly.]] If I disable swap and enable persistence, I can not start Ignite server, complaining that node having the same consistent ID has already added to topology: Caused by: class org.apache.ignite.spi.IgniteSpiException: Failed to add node to topology because it has the same hash code for partitioned affinity as one of existing nodes May I know if and how Apache Ignite can run with Spark for large data that exceeds memory? Any suggestions are highly appreciated! Thanks, Jia -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
