Re: Node FAILED

crenique Fri, 11 May 2018 11:04:29 -0700

Thanks Anton for the information.
Here I re-summarized and added more details and both server and client logs
when the incident happened.



[Cluster configuration]

Windows Azure VM scale set
Windows Server 2016 10.0 amd64 VM x 40 nodes

VM information: Java(TM) SE Runtime Environment 1.8.0_162-b12 Oracle
Corporation Java HotSpot(TM) 64-Bit Server VM 25.162-b12

1 ignite server, 2 ignite clients per each node
Full topology is, 2018/05/09 17:53:30.564 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=120,
servers=40, clients=80, CPUs=640, heap=560.0GB]


[Ignite cache configuration]

.NET ignite 2.3

ignite-config.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-config.xml>  
var lifecycleHandler = new LifecycleHandler();
IgniteConfiguration igniteConfiguration = new IgniteConfiguration()
{
        SpringConfigUrl = "ignite-config.xml",
        ClientMode = clientMode,
        JvmOptions = jvmOptions,
        LifecycleHandlers = new[] { lifecycleHandler },
        BinaryConfiguration = binaryConfiguration
};

m_ignite = Ignition.Start(igniteConfiguration);
m_ignite.Stopping += async (sender, args) => 
{
        Console.WriteLine(">>> Ignite node stopping ...");  /<= No console log 
has
been printed/
};

m_ignite.Stopped += async (sender, args) =>
{
        Console.WriteLine(">>> Ignite node stopped."); /<= No console log has 
been
printed/
};


CacheConfiguration cacheConfig = new CacheConfiguration(cacheName,
queryEntity)
{
        SqlSchema = "PUBLIC",
        Backups = 2,
        DataRegionName = "Default_Region",

        CopyOnRead = false,
        EvictionPolicy = new LruEvictionPolicy
        {
                MaxSize = 100000,
                MaxMemorySize = 1024 * 1024 * 1024 * 2
        }
};

Key type: string
Value type: BinaryObject

One partitioned cache upto 1 million
Average cache entry size is between 2kbytes and 6kbytes +


[Problem]

Ignite server process has been dropping out of topology one by one over
time.

2018/05/09 17:53:30.564 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=120,
servers=40, clients=80, CPUs=640, heap=560.0GB]
...
2018/05/10 08:20:44.254 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=123,
servers=37, clients=80, CPUs=640, heap=530.0GB]
...
2018/05/10 11:29:43.461 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=128,
servers=32, clients=80, CPUs=640, heap=480.0GB]
...
2018/05/10 19:30:08.519 [INFO
][disco-event-worker-#56][GridDiscoveryManager] Topology snapshot [ver=139,
servers=21, clients=80, CPUs=640, heap=370.0GB]


Now we lost 19 ignite servers out of 40 total from the topology.
It seems like the ignite dotnet server progress got frozen when an ignite
server was dropped.
ignite-jstack-node26.txt
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-jstack-node26.txt>
  


[JVM Options]

SERVER

"-Duser.timezone=UTC",
"-DIGNITE_QUIET=false",
"-Djava.net.preferIPv4Stack=true",
"-Djava.awt.headless=true",
"-Xms10g",
"-Xmx10g",
"-XX:+AlwaysPreTouch",
"-XX:+UseG1GC",
"-XX:+ScavengeBeforeFullGC",
"-XX:+DisableExplicitGC"

CLIENT

"-Duser.timezone=UTC",
"-DIGNITE_QUIET=false",
"-Djava.net.preferIPv4Stack=true",
"-Djava.awt.headless=true",
"-Xms2g",
"-Xmx2g",
"-XX:+AlwaysPreTouch",
"-XX:+UseG1GC",
"-XX:+ScavengeBeforeFullGC",
"-XX:+DisableExplicitGC"


[Logs]

SERVER

2018/05/10 18:49:56.066 [INFO ][grid-timeout-worker-#39][IgniteKernal] 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=8a2ce76e, uptime=23:20:53.173]
    ^-- H/N/C [hosts=40, nodes=104, CPUs=640]
    ^-- CPU [cur=2.13%, avg=1.23%, GC=0%]
    ^-- PageMemory [pages=37787]
    ^-- Heap [used=2571MB, free=74.89%, comm=10240MB]
    ^-- Non heap [used=77MB, free=-1%, comm=80MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
    ^-- Outbound messages queue [size=13]
2018/05/10 18:49:56.343 [INFO ][grid-timeout-worker-#39][IgniteKernal]
FreeList [name=null, buckets=256, dataPages=35238, reusePages=12]

/ PROCESS FROZEN HERE AT 2018/05/10 18:50 !!!
 The dotnet ignite server process is still alive, but metrics stopped, no 
other logs, no CPU usage since. /

ignite-server-node26.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-server-node26.zip>
  


CLIENT

2018/05/10 18:50:29.383 [WARN ][Thread-2561][IgniteH2Indexing] Failed to
send message [node=TcpDiscoveryNode
[id=8a2ce76e-1bf2-4259-8592-81c11af9064f, addrs=[10.0.0.23, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500,
CDNode00000Q.hlbdeyzzwm2ujgdsre0nhzw3sg.dx.internal.cloudapp.net/10.0.0.23:47500],
discPort=47500, order=69, intOrder=69, lastExchangeTime=1525894136807,
loc=false, ver=2.3.0#19700101-sha1:00000000, isClient=false],
msg=GridQueryCancelRequest [qryReqId=5208317], errMsg=Failed to send message
(node left topology): TcpDiscoveryNode
[id=8a2ce76e-1bf2-4259-8592-81c11af9064f, addrs=[10.0.0.23, 127.0.0.1],
sockAddrs=[/127.0.0.1:47500,
CDNode00000Q.hlbdeyzzwm2ujgdsre0nhzw3sg.dx.internal.cloudapp.net/10.0.0.23:47500],
discPort=47500, order=69, intOrder=69, lastExchangeTime=1525894136807,
loc=false, ver=2.3.0#19700101-sha1:00000000, isClient=false]]

ignite-client-node08.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t1784/ignite-client-node08.zip>
  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Node FAILED

Reply via email to