Run Spark with Ignite Shared RDD on Large Volume of Data

Jia Zou Thu, 15 Nov 2018 08:33:56 -0800

In recent I'm running Spark MLLIb KMeans with Apach Ignite 2.6.0 shared RDD
on ten AWS r4.2xlarge workers.
It works and runs to finish on 1 billion points (within memory), but failed
with 2 billion points (exceeding available memory)


My code for loading data to Ignite Shared RDD is here:

https://github.com/jiazou-bigdata/SparkBench/blob/master/perf-bench/src/main/scala/edu/rice/bench/KMeansDataGenerator.scala#L64

Then My code for running Spark MLLIB KMeans on the Shared RDD is here:

https://github.com/jiazou-bigdata/SparkBench/blob/master/perf-bench/src/main/scala/edu/rice/bench/IgniteRDDKMeans.scala

For running 2 billion points, I enabled swap, the configuration file for
Ignite server is here:
https://github.com/jiazou-bigdata/SparkBench/blob/master/ignite/server/example-cache.xml

I have run the program to load 2 billion points to memory for several times,
but all failed.

One error I met for several times while running 2 billion points is when
loading large data to Ignite shared RDD, one Ignite worker failed without
obvious reason, the screen message is the same with the one in this post:
http://apache-ignite-users.70518.x6.nabble.com/Node-pause-for-no-obvious-reason-td21923.html

The ending part of the log file is like this:

[14:17:13,231][INFO][grid-timeout-worker-#23][IgniteKernal] FreeList
[name=null, buckets=256, dataPages=12247613, reusePages=0]
[14:17:28,710][WARNING][jvm-pause-detector-worker][] Possible too long JVM
pause: 11193 milliseconds.
[14:17:28,834][INFO][tcp-disco-sock-reader-#4][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/172.31.88.4:45550, rmtPort=45550
[14:17:28,834][INFO][tcp-disco-sock-reader-#9][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/172.31.81.91:42661, rmtPort=42661
[14:17:28,948][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
accepted incoming connection [rmtAddr=/172.31.81.91, rmtPort=59539]
[14:17:28,948][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
spawning a new thread for connection [rmtAddr=/172.31.81.91, rmtPort=59539]
[14:17:29,039][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Started
serving remote node connection [rmtAddr=/172.31.81.91:59539, rmtPort=59539]
[14:17:29,167][WARNING][tcp-disco-msg-worker-#3][TcpDiscoverySpi] Node is
out of topology (probably, due to short-time network problems).
[14:17:29,167][INFO][tcp-disco-sock-reader-#11][TcpDiscoverySpi] Finished
serving remote node connection [rmtAddr=/172.31.81.91:59539, rmtPort=59539
[14:17:29,167][WARNING][disco-event-worker-#41][GridDiscoveryManager] Local
node SEGMENTED: TcpDiscoveryNode [id=0c1716fe-3b94-440e-905c-36fdca708ea4,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.31.90.9],
sockAddrs=[ip-172-31-90-9/172.31.90.9:47500, /0:0:0:0:0:0:0:1%lo:47500,
/127.0.0.1:47500], discPort=47500, order=8, intOrder=8,
lastExchangeTime=1542291449160, loc=true, ver=2.6.0#20180710-sha1:669feacc,
isClient=false]
[14:17:29,393][SEVERE][tcp-disco-srvr-#2][] Critical system error detected.
Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
unexpectedly.
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[14:17:29,439][SEVERE][tcp-disco-srvr-#2][] JVM will be halted immediately
due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]


If I disable swap and enable persistence, I can not start Ignite server,
complaining that node having the same consistent ID  has already added to
topology:

Caused by: class org.apache.ignite.spi.IgniteSpiException: Failed to add
node to topology because it has the same hash code for partitioned affinity
as one of existing nodes 


May I know if and how Apache Ignite can run with Spark for large data that
exceeds memory? Any suggestions are highly appreciated!


Thanks,
Jia











--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Run Spark with Ignite Shared RDD on Large Volume of Data

Reply via email to