I'm running a six nodes Ignite 2.6 cluster.
The config for each server is as follows
<bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="segmentationPolicy" value="RESTART_JVM"/>
<property name="peerClassLoadingEnabled" value="true"/>
<property name="failureDetectionTimeout" value="60000"/>
<property name="dataStorageConfiguration">
<bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="storagePath" value="/data/dc1/ignite"/>
<property name="walPath" value="/data/da1"/>
<property name="walArchivePath" value="/data/da1/archive"/>
<property name="defaultDataRegionConfiguration">
<bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="default_Region"/>
<property name="initialSize" value="#{100L * 1024 * 1024
* 1024}"/>
<property name="maxSize" value="#{300L * 1024 * 1024 *
1024}"/>
<property name="persistenceEnabled" value="true"/>
<property name="checkpointPageBufferSize" value="#{8L *
1024 * 1024 * 1024}"/>
</bean>
</property>
<property name="walMode" value="BACKGROUND"/>
<property name="walFlushFrequency" value="5000"/>
<property name="checkpointFrequency" value="600000"/>
</bean>
</property>
<property name="discoverySpi">
<bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="networkTimeout" value="60000" />
<property name="localPort" value="49500"/>
<property name="ipFinder">
<bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>10.29.42.231:49500</value>
<value>10.29.42.233:49500</value>
<value>10.29.42.234:49500</value>
<value>10.29.42.235:49500</value>
<value>10.29.42.236:49500</value>
<value>10.29.42.232:49500</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
<property name="gridLogger">
<bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
<constructor-arg type="java.lang.String"
value="config/ignite-log4j2.xml"/>
</bean>
</property>
</bean>
</beans>
I also enabled direct io plugin.
When I try to ingest data into Ignite using Spark dataframe API, the cluster
will be very slow after the Spark driver connects to the cluster and some of
the server nodes will go down eventually with this error:
Local node SEGMENTED: TcpDiscoveryNode
[id=8ce23742-702e-4309-934a-affd80bf3653, addrs=[10.29.42.232, 127.0.0.1],
sockAddrs=[/10.29.42.232:49500, /127.0.0.1:49500], discPort=49500, order=2,
intOrder=2, lastExchangeTime=1541571124026, loc=true,
ver=2.6.0#20180709-sha1:5faffcee, isClient=false]
2018-11-07T06:12:04,032][INFO ][disco-pool-#457][TcpDiscoverySpi] Finished
node ping [nodeId=844fab1e-4189-4f10-bc84-b069bc18a267, res=true, time=6ms]
[2018-11-07T06:12:04,033][ERROR][tcp-disco-srvr-#2][] Critical system error
detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#2 is terminated
unexpectedly.
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5687)
[ignite-core-2.6.0.jar:2.6.0]
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[ignite-core-2.6.0.jar:2.6.0]
[2018-11-07T06:12:04,036][ERROR][tcp-disco-srvr-#2][] JVM will be halted
immediately due to the failure: [failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread
tcp-disco-srvr-#2 is terminated unexpectedly.]]
I examined the GC log and all nodes don't have long GC pause.
The network interconnectivity between all these nodes is fine.
The complete logs for all six servers and client are in the attachment.
>From my observation, the PME process when a new thick client in Spark
dataframe API joins topology is very slow and can leads to many problems.
I think the proposal suggested by Nikolay to change thick clients to java
thin clients is a good way to improve this.
http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Spark-Data-Frame-through-Thin-Client-td36814.html
iglog.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1346/iglog.zip>
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/