Re: Failed to wait for initial partition map exchange

2019-01-21 Thread Ilya Kasnacheev
Hello!

It seems that the only way is making checkpoints more often by decreasing
checkpointFrequency value (specified in msec).

Smaller WAL - faster startup.

Regards,
-- 
Ilya Kasnacheev


сб, 19 янв. 2019 г. в 09:42, Justin Ji :

> Ilya -
>
> Thank for your reply.
>
> Is there any configuration can help us reduce the WAL time in 2.6.0.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Failed to wait for initial partition map exchange

2019-01-18 Thread Justin Ji
Ilya - 

Thank for your reply.

Is there any configuration can help us reduce the WAL time in 2.6.0.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Failed to wait for initial partition map exchange

2019-01-18 Thread Justin Ji
I also have this problem with ignite 2.6.0. 
Spend more than 350 seconds to restart a ignite node. 

Here is the system usage during partition map exchange: 

The CPU is at a low level and memory is enough. 

Here is the logs, we can find that applying WAL changes takes a long time,
about 375664ms. 

>>>__   
>>>   /  _/ ___/ |/ /  _/_  __/ __/ 
>>>  _/ // (7 7// /  / / / _/ 
>>> /___/\___/_/|_/___/ /_/ /___/ 
>>> 
>>> ver. 2.6.0#20180710-sha1:669feacc 
>>> 2018 Copyright(C) Apache Software Foundation 
>>> 
>>> Ignite documentation: http://ignite.apache.org

2019-01-18 09:17:27:979 [main] INFO  o.a.i.i.IgniteKernal%ignite-server:478
- Config URL: file:/opt/ignite/config/ignite-config-benchmark-cn.xml 
2019-01-18 09:17:27:996 [main] INFO  o.a.i.i.IgniteKernal%ignite-server:478
- IgniteConfiguration [igniteInstanceName=ignite-server, pubPoolSize=8,
svcPoolSize=8, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=16,
mgmtPoolSize=4, igfsPoolSize=8, dataStreamerPoolSize=8,
utilityCachePoolSize=8, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
qryPoolSize=16, igniteHome=/opt/ignite/apache-ignite-fabric,
igniteWorkDir=/opt/ignite/apache-ignite-fabric/work,
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6f94fa3e,
nodeId=67815159-8a6d-4c3a-b626-a82e5bea1a65,
marsh=org.apache.ignite.internal.binary.BinaryMarshaller@72ade7e3,
marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1,
metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
discoSpi=ZookeeperDiscoverySpi [zkRootPath=/ignite/discovery,
zkConnectionString=10.0.230.12:2181,10.0.223.106:2181,10.0.247.246:2181,
joinTimeout=1, sesTimeout=3, clientReconnectDisabled=false,
internalLsnr=null,
stats=org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryStatistics@560348e6],
segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true,
allResolversPassReq=true, segChkFreq=1, commSpi=TcpCommunicationSpi
[connectGate=null, connPlc=null, enableForcibleNodeKill=false,
enableTroubleshootingLog=false,
srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@1df8b5b8,
locAddr=null, locHost=null, locPort=47174, locPortRange=100, shmemPort=-1,
directBuf=true, directSndBuf=false, idleConnTimeout=60,
connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768,
sockRcvBuf=32768, msgQueueLimit=1024, slowClientQueueLimit=0, nioSrvr=null,
shmemSrv=null, usePairedConnections=false, connectionsPerNode=1,
tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32,
unackedMsgsBufSize=0, sockWriteTimeout=2000, lsnr=null, boundTcpPort=-1,
boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null,
ctxInitLatch=java.util.concurrent.CountDownLatch@23202fce[Count = 1],
stopping=false,
metricsLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationMetricsListener@7b993c65],
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@37911f88,
colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [lsnr=null],
indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@4ea5b703,
addrRslvr=null, clientMode=false, rebalanceThreadPoolSize=4,
txCfg=org.apache.ignite.configuration.TransactionConfiguration@2a7ed1f,
cacheSanityCheckEnabled=true, discoStartupDelay=6, deployMode=SHARED,
p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100,
timeSrvPortRange=100, failureDetectionTimeout=1,
clientFailureDetectionTimeout=3, metricsLogFreq=6, hadoopCfg=null,
connectorCfg=org.apache.ignite.configuration.ConnectorConfiguration@3fa247d1,
odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration
[seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null,
grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null,
binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStorageConfiguration
[sysRegionInitSize=41943040, sysCacheMaxSize=104857600, pageSize=0,
concLvl=0, dfltDataRegConf=DataRegionConfiguration [name=default,
maxSize=3119855206, initSize=268435456, swapPath=null,
pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100,
metricsEnabled=false, metricsSubIntervalCount=5,
metricsRateTimeInterval=6, persistenceEnabled=false,
checkpointPageBufSize=0], storagePath=/data/ignite/persistence,
checkpointFreq=36, lockWaitTime=1, checkpointThreads=4,
checkpointWriteOrder=SEQUENTIAL, walHistSize=10, walSegments=10,
walSegmentSize=67108864, walPath=/ignite-wal/ignite/wal,
walArchivePath=/ignite-wal/ignite/wal/archive, metricsEnabled=false,
walMode=BACKGROUND, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000,
walFsyncDelay=1000, walRecordIterBuffSize=67108864,
alwaysWriteFullPages=false,
fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@4f4c4b1a,
metricsSubIntervalCnt=5, metricsRateTimeInterval=6,
walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=true,
walCompactionEnabled=false], activeOnStart=true, autoActi

Re: Failed to wait for initial partition map exchange

2019-01-18 Thread Justin Ji
I also have this problem with ignite 2.6.0.
Spend more than 350 seconds to restart a ignite node.

Here is the system usage during partition map exchange:
 
The CPU is at a low level and memory is enough.

Here is the logs, we can find that applying WAL changes takes a long time,
about 375664ms.

>>>__  
>>>   /  _/ ___/ |/ /  _/_  __/ __/
>>>  _/ // (7 7// /  / / / _/
>>> /___/\___/_/|_/___/ /_/ /___/
>>>
>>> ver. 2.6.0#20180710-sha1:669feacc
>>> 2018 Copyright(C) Apache Software Foundation
>>>
>>> Ignite documentation: http://ignite.apache.org

2019-01-18 09:17:27:979 [main] INFO  o.a.i.i.IgniteKernal%ignite-server:478
- Config URL: file:/opt/ignite/config/ignite-config-benchmark-cn.xml
2019-01-18 09:17:27:996 [main] INFO  o.a.i.i.IgniteKernal%ignite-server:478
- IgniteConfiguration [igniteInstanceName=ignite-server, pubPoolSize=8,
svcPoolSize=8, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=16,
mgmtPoolSize=4, igfsPoolSize=8, dataStreamerPoolSize=8,
utilityCachePoolSize=8, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
qryPoolSize=16, igniteHome=/opt/ignite/apache-ignite-fabric,
igniteWorkDir=/opt/ignite/apache-ignite-fabric/work,
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6f94fa3e,
nodeId=67815159-8a6d-4c3a-b626-a82e5bea1a65,
marsh=org.apache.ignite.internal.binary.BinaryMarshaller@72ade7e3,
marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1,
metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
discoSpi=ZookeeperDiscoverySpi [zkRootPath=/ignite/discovery,
zkConnectionString=10.0.230.12:2181,10.0.223.106:2181,10.0.247.246:2181,
joinTimeout=1, sesTimeout=3, clientReconnectDisabled=false,
internalLsnr=null,
stats=org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryStatistics@560348e6],
segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true,
allResolversPassReq=true, segChkFreq=1, commSpi=TcpCommunicationSpi
[connectGate=null, connPlc=null, enableForcibleNodeKill=false,
enableTroubleshootingLog=false,
srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@1df8b5b8,
locAddr=null, locHost=null, locPort=47174, locPortRange=100, shmemPort=-1,
directBuf=true, directSndBuf=false, idleConnTimeout=60,
connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768,
sockRcvBuf=32768, msgQueueLimit=1024, slowClientQueueLimit=0, nioSrvr=null,
shmemSrv=null, usePairedConnections=false, connectionsPerNode=1,
tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32,
unackedMsgsBufSize=0, sockWriteTimeout=2000, lsnr=null, boundTcpPort=-1,
boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null,
ctxInitLatch=java.util.concurrent.CountDownLatch@23202fce[Count = 1],
stopping=false,
metricsLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationMetricsListener@7b993c65],
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@37911f88,
colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [lsnr=null],
indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@4ea5b703,
addrRslvr=null, clientMode=false, rebalanceThreadPoolSize=4,
txCfg=org.apache.ignite.configuration.TransactionConfiguration@2a7ed1f,
cacheSanityCheckEnabled=true, discoStartupDelay=6, deployMode=SHARED,
p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100,
timeSrvPortRange=100, failureDetectionTimeout=1,
clientFailureDetectionTimeout=3, metricsLogFreq=6, hadoopCfg=null,
connectorCfg=org.apache.ignite.configuration.ConnectorConfiguration@3fa247d1,
odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration
[seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null,
grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null,
binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStorageConfiguration
[sysRegionInitSize=41943040, sysCacheMaxSize=104857600, pageSize=0,
concLvl=0, dfltDataRegConf=DataRegionConfiguration [name=default,
maxSize=3119855206, initSize=268435456, swapPath=null,
pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100,
metricsEnabled=false, metricsSubIntervalCount=5,
metricsRateTimeInterval=6, persistenceEnabled=false,
checkpointPageBufSize=0], storagePath=/data/ignite/persistence,
checkpointFreq=36, lockWaitTime=1, checkpointThreads=4,
checkpointWriteOrder=SEQUENTIAL, walHistSize=10, walSegments=10,
walSegmentSize=67108864, walPath=/ignite-wal/ignite/wal,
walArchivePath=/ignite-wal/ignite/wal/archive, metricsEnabled=false,
walMode=BACKGROUND, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000,
walFsyncDelay=1000, walRecordIterBuffSize=67108864,
alwaysWriteFullPages=false,
fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@4f4c4b1a,
metricsSubIntervalCnt=5, metricsRateTimeInterval=6,
walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=true,
walCo

Re: Failed to wait for initial partition map exchange

2018-09-25 Thread Ilya Kasnacheev
Hello!

Regarding PME problems.
OOM will cause this. High GC could cause this under some circumstances.
High CPU or Disk usage should not cause this. Network inavailability (such
as closed communication port) could also cause it.

But the prime cause is programming errors. Either those are errors on
Apache Ignite side (caused by some strange circumstances since all normal
cases should be normally tested), or they are in your code.

Such as deadlocks. If you have deadlocks in your code exposed to Apache
Ignite, or you are manage to lock up Apache Ignite in other ways
(listeners, invokes and continuous queries are notorious for that, since
there are limitations on operations you can use from within them), you can
catch infinite PME very easily.

However, it's hard to say without reviewing logs and thread dumps./

Regards,
-- 
Ilya Kasnacheev


чт, 13 сент. 2018 г. в 1:31, ndipiazza3565 :

> I'm trying to build up a list of possible causes for this issue.
>
> I'm only really interested in the issues that occur after successful
> production deployments. Meaning the environment has been up for some time
> successfully, but then later on our ignite nodes will not start and stick
>
> But as of now, a certain bad behavior from a single node in the ignite
> cluster can cause a deadlock
>
> * Anything that causes one of the ignite nodes to become unresponsive
>   * oom
>   * high gc
>   * high cpu
>   * high disk usage
> * Network issues?
>
> I'm trying to get a list of the causes for this issue so I can troubleshoot
> further.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Failed to wait for initial partition map exchange

2018-09-12 Thread ndipiazza3565
No. Persistence is disabled in my case. 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Failed to wait for initial partition map exchange

2018-09-12 Thread eugene miretsky
Do you have persistence enabled?

On Wed, Sep 12, 2018 at 6:31 PM ndipiazza3565 <
nicholas.dipia...@lucidworks.com> wrote:

> I'm trying to build up a list of possible causes for this issue.
>
> I'm only really interested in the issues that occur after successful
> production deployments. Meaning the environment has been up for some time
> successfully, but then later on our ignite nodes will not start and stick
>
> But as of now, a certain bad behavior from a single node in the ignite
> cluster can cause a deadlock
>
> * Anything that causes one of the ignite nodes to become unresponsive
>   * oom
>   * high gc
>   * high cpu
>   * high disk usage
> * Network issues?
>
> I'm trying to get a list of the causes for this issue so I can troubleshoot
> further.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Failed to wait for initial partition map exchange

2018-09-12 Thread ndipiazza3565
I'm trying to build up a list of possible causes for this issue.

I'm only really interested in the issues that occur after successful
production deployments. Meaning the environment has been up for some time
successfully, but then later on our ignite nodes will not start and stick 

But as of now, a certain bad behavior from a single node in the ignite
cluster can cause a deadlock 

* Anything that causes one of the ignite nodes to become unresponsive 
  * oom
  * high gc
  * high cpu
  * high disk usage
* Network issues?

I'm trying to get a list of the causes for this issue so I can troubleshoot
further. 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Failed to wait for initial partition map exchange

2017-07-03 Thread vkulichenko
This looks like a network issue. Please check that connections can be
established between any nodes in topology and nothing is blocked by
firewall. Note that connectivity must be bidirectional, even for clients
(i.e. server node can establish TCP connection with client node). Also note
that both discovery and communication are required to work and they use
different ports (47500 and 47100 by default).

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p14283.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-06-13 Thread Vladimir
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[OS: Linux
4.1.12-61.1.18.el7uek.x86_64 amd64]]
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[PID: 7608]]
[20:36:22] VM information: Java(TM) SE Runtime Environment 1.8.0_102-b14
Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.102-b14
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[Language runtime: Java
Platform API Specification ver. 1.8]]
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[VM information: Java(TM) SE
Runtime Environment 1.8.0_102-b14 Oracle Corporation Java HotSpot(TM) 64-Bit
Server VM 25.102-b14]]
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[VM total memory: 2.1GB]]
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[Remote Management [restart:
off, REST: on, JMX (remote: off)]]]
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[System cache's MemoryPolicy
size is configured to 40 MB. Use MemoryConfiguration.systemCacheMemorySize
property to change the setting.]]
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[Configured caches [in
'default' memoryPolicy: ['ignite-sys-cache', 'ignite-atomics-sys-cache'
INFO  2017-06-13 20:36:22 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[Local node user attribute
[IgSupport_LogicClusterGroups=com.bpcbt.common.support.ignite.beans.IgLogicClusterGroups@0]]]
WARN  2017-06-13 20:36:22 [pub-#14%svip%]
org.apache.ignite.internal.GridDiagnostic - [[Initial heap size is 154MB
(should be no less than 512MB, use -Xms512m -Xmx512m).]]
[20:36:22] Initial heap size is 154MB (should be no less than 512MB, use
-Xms512m -Xmx512m).
[20:36:23] Configured plugins:
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.internal.processors.plugin.IgnitePluginProcessor -
[[Configured plugins:]]
[20:36:23]   ^-- None
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.internal.processors.plugin.IgnitePluginProcessor - [[  ^--
None]]
[20:36:23] 
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.internal.processors.plugin.IgnitePluginProcessor - [[]]
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - [[Successfully
bound communication NIO server to TCP port [port=9343, locHost=/127.0.0.1,
selectorsCnt=4, selectorSpins=0, pairedConn=false]]]
WARN  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi - [[Message
queue limit is set to 0 which may lead to potential OOMEs when running cache
operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth
on sender and receiver sides.]]
[20:36:23] Message queue limit is set to 0 which may lead to potential OOMEs
when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to
message queues growth on sender and receiver sides.
WARN  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.spi.checkpoint.noop.NoopCheckpointSpi - [[Checkpoints are
disabled (to enable configure any GridCheckpointSpi implementation)]]
WARN  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.internal.managers.collision.GridCollisionManager -
[[Collision resolution is disabled (all jobs will be activated upon
arrival).]]
[20:36:23] Security status [authentication=off, tls/ssl=off]
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[Security status
[authentication=off, tls/ssl=off]]]
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.internal.processors.rest.protocols.tcp.GridTcpRestProtocol
- [[Command protocol successfully started [name=TCP binary,
host=0.0.0.0/0.0.0.0, port=11214]]]
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[Non-loopback local IPs:
192.168.122.1, 192.168.209.65]]
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.internal.IgniteKernal%svip - [[Enabled local MACs:
0021F6321229, 5254000A6937]]
INFO  2017-06-13 20:36:23 [localhost-startStop-1]
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - [[Successfully bound
to TCP port [port=9463, localHost=/127.0.0.1,
locNodeId=7fc5ea30-913d-4ff5-932b-62d81c6027db]]]
INFO  2017-06-13 20:36:24 [localhost-startStop-1]
org.apache.ignite.internal.processors.cache.GridCacheProcessor - [[Started
cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc, mode=REPLICATED]]]
INFO  2017-06-13 20:36:24 [localhost-startStop-1]
org.apache.ignite.internal.processors.cache.GridCacheProcessor - [[Started
cache [name=ignite-atomics-sys-cache, memoryPolicyName=sysMemPlc,
mode=REPLICATED]]]
INFO  2017-06-13 2

Re: Failed to wait for initial partition map exchange

2017-06-07 Thread vkulichenko
Please attach full logs.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p13487.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-06-07 Thread Vladimir
Unfortunately, it’s not a warning. New node cannot join the cluster. And
there is no heavy load on the cluster, no CPU/memory consumption. No any
network problem because I tested all the nodes on single machine.



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p13458.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-06-06 Thread vkulichenko
Vladimir,

As I told before, this message is very generic and can be caused by multiple
reasons (sometimes it's not even an issue BTW - it's warning, not an error).
First things to check is memory consumption, GC and network.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p13439.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-06-06 Thread Vladimir
I have met similar problem too. A node cannot start:

WARN  16:08:36.158 [main]
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager:
Still waiting for initial partition map exchange
[fut=GridDhtPartitionsExchangeFuture [dummy=false, forcePreload=false,
reassign=false, discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode
[id=fa2d0f11-80cf-4891-9533-7a36392157f6, addrs=[127.0.0.1],
sockAddrs=[/127.0.0.1:9463], discPort=9463, order=8, intOrder=6,
lastExchangeTime=1496754516085, loc=true, ver=2.0.0#20170430-sha1:d4eef3c6,
isClient=false], topVer=8, nodeId8=fa2d0f11, msg=null, type=NODE_JOINED,
tstamp=1496754274106], crd=TcpDiscoveryNode
[id=7751c8dc-9bf6-4a21-9ea5-376f2be83913, addrs=[127.0.0.1],
sockAddrs=[/127.0.0.1:9460], discPort=9460, order=1, intOrder=1,
lastExchangeTime=1496754273954, loc=false, ver=2.0.0#20170430-sha1:d4eef3c6,
isClient=false], exchId=GridDhtPartitionExchangeId
[topVer=AffinityTopologyVersion [topVer=8, minorTopVer=0], nodeId=fa2d0f11,
evt=NODE_JOINED], added=false, initFut=GridFutureAdapter
[ignoreInterrupts=false, state=DONE, res=true, hash=1259192920], init=true,
lastVer=null, partReleaseFut=GridCompoundFuture [rdc=null, initFlag=1,
lsnrCalls=4, done=true, cancelled=false, err=null, futs=[true, true, true,
true]], affChangeMsg=null, skipPreload=false, clientOnlyExchange=false,
initTs=1496754276137, centralizedAff=false, changeGlobalStateE=null,
exchangeOnChangeGlobalState=false, forcedRebFut=null, evtLatch=0,
remaining=[bf7a4e48-4867-437b-9cc5-dd598b8795f3,
7751c8dc-9bf6-4a21-9ea5-376f2be83913, 373fd690-4171-4503-ab1b-d622c3ec6fc7],
srvNodes=[TcpDiscoveryNode [id=7751c8dc-9bf6-4a21-9ea5-376f2be83913,
addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:9460], discPort=9460, order=1,
intOrder=1, lastExchangeTime=1496754273954, loc=false,
ver=2.0.0#20170430-sha1:d4eef3c6, isClient=false], TcpDiscoveryNode
[id=bf7a4e48-4867-437b-9cc5-dd598b8795f3, addrs=[127.0.0.1],
sockAddrs=[/127.0.0.1:9461], discPort=9461, order=6, intOrder=4,
lastExchangeTime=1496754273954, loc=false, ver=2.0.0#20170430-sha1:d4eef3c6,
isClient=false], TcpDiscoveryNode [id=373fd690-4171-4503-ab1b-d622c3ec6fc7,
addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:9462], discPort=9462, order=7,
intOrder=5, lastExchangeTime=1496754273954, loc=false,
ver=2.0.0#20170430-sha1:d4eef3c6, isClient=false], TcpDiscoveryNode
[id=fa2d0f11-80cf-4891-9533-7a36392157f6, addrs=[127.0.0.1],
sockAddrs=[/127.0.0.1:9463], discPort=9463, order=8, intOrder=6,
lastExchangeTime=1496754516085, loc=true, ver=2.0.0#20170430-sha1:d4eef3c6,
isClient=false]], super=GridFutureAdapter [ignoreInterrupts=false,
state=INIT, res=null, hash=1079801970]]]
WARN  16:08:36.499 [exchange-worker-#37%svip%]
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager:
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion
[topVer=8, minorTopVer=0], node=fa2d0f11-80cf-4891-9533-7a36392157f6].
Dumping pending objects that might be the cause: 

What can it be?



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p13414.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-05-17 Thread vkulichenko
Generally, you can't add new node while holding a lock. If you're holding
locks for long period of time, I would recommend you to revisit the
architecture. In any case, make sure that you properly release all the
locks. Also I think you should to upgrade to the latest version, I believe
there were some fixes in this area.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p12960.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-05-16 Thread jaipal
Val,

It doesn't look like a memory issue.We are using explicit locking (lock
entry , do processing and then unlock).Under high load, When we bring down
one of the Ignite Server nodes and then again add it back to the cluster,We
are getting this issue.We observed the exceptions like Failed to wait for
partition map exchange.sometimes it also gives Failed to wait for partition
release future and then dumps the pending cache futures,exchange futures and
transactions.Once we encounter this issue,Cluster hangs with out processing.
We need to clean start all Ignite Nodes to resolve this.We are observing
this issue quite often and it is preventing us from scaling up and down.
Please find the attached file that contains the  ignite configuration we
used.Is something wrong with cache configuration or with way we are using
locking or with  Ignite ?

ignite.xml
  



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p12957.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-05-16 Thread vkulichenko
Jai,

So what is the result of investigation? Does it look like memory issue or
not? As I said earlier, the issue itself doesn't have generic solution, you
need to find out the reason.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p12890.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-05-16 Thread jaipal
Hii val,

I tried recommended settings and configurations. still I observe this issue
and it can be reproduced easily.

Is there any fix available to this issue.

Regards
Jai



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p12885.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2017-04-20 Thread Valentin Kulichenko
Usually it's caused by GC or network issues. First, I would recommend to
check if you're not running out of memory, and then collect and investigate
GC logs:
https://apacheignite.readme.io/docs/jvm-and-system-tuning#debugging-memory-usage-issues-and-gc-pauses

-Val

On Tue, Apr 18, 2017 at 2:10 PM, jaipal  wrote:

> We are facing the similar issue with 1.9 version..Is there any recommended
> configuration parametres to overccome this issue
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Failed-to-wait-for-initial-partition-
> map-exchange-tp6252p12028.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>


Re: Failed to wait for initial partition map exchange

2017-04-18 Thread jaipal
We are facing the similar issue with 1.9 version..Is there any recommended
configuration parametres to overccome this issue



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p12028.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-12-01 Thread vkulichenko
Hi Yuci,

1.8 in in final testing/bug fixing stage, so it should become available
soon. However, "Failed to wait for initial partition map exchange" is a very
generic error and can be caused by different circumstances. I would
recommend to build from 1.8 branch and run your tests on it ti check if the
issue is reproduced or not.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p9338.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-12-01 Thread yucigou
Hi Vladislav,

We have the same issue, "Failed to wait for initial partition map exchange",
when stopping or restarting one of the nodes in our cluster. In other words,
when stopping or restarting just one of the nodes, the whole cluster screws
up, i.e., the whole cluster stops working.

Just had a look at the ticket you mentioned,
https://issues.apache.org/jira/browse/IGNITE-3748, and it as said has been
fixed for version 1.8 Wonder if you know when version 1.8 would be released?
This has been in our production, and we really want this fix to be applied
as soon as possible.

Many thanks,
Yuci



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p9333.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-09-06 Thread vkulichenko
Hi,

This exception can happen for different reasons. Please attach full logs
from all nodes.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp7454p7569.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-09-05 Thread Jason
No fix or workaround for this so far.



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p7536.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-09-05 Thread Ignitebie
Jason
Scenario that you shared is same as ours and we face similar issue for large
heap servers grid where one of node in restart keeps waiting on initial
partition map exchange.

Have you received a fix or were able to hack a solution. Please share.

Thanks.




--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p7535.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-09-03 Thread Ignitebie
Hi,

Can someone please take a look at this. We have to restart our grid or at
times clean marshaller dirs for eacj client and server as a woraround.

Thanks.



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp7454p7517.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-24 Thread Vladislav Pyatkov
Thanks Jason, for the information.

I have created Jira ticket[1].

[1]: https://issues.apache.org/jira/browse/IGNITE-3748

On Fri, Aug 19, 2016 at 6:27 PM, Jason  wrote:

> Thanks Vladislav.
>
> Attached the logs & thread_dumps for the new change. Please take a look.
>
> Apache.config
> 
> default-config.xml
>  n7171/default-config.xml>
> logs.zip
> 
> thread_dump.zip
>  >
>
> Thanks,
> -Jason
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Failed-to-wait-for-initial-partition-
> map-exchange-tp6252p7171.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>



-- 
Vladislav Pyatkov


Re: Failed to wait for initial partition map exchange

2016-08-19 Thread Jason
Thanks Vladislav.

Attached the logs & thread_dumps for the new change. Please take a look.

Apache.config
  
default-config.xml
  
logs.zip
  
thread_dump.zip
  

Thanks,
-Jason



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p7171.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-19 Thread Vladislav Pyatkov
Also, try to get dump (using jstack) from the node, which adding to the
topology.

On Fri, Aug 19, 2016 at 3:39 PM, Vladislav Pyatkov 
wrote:

> Jason,
>
> Please, add system property -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true
> and attach new logs.
> It prints dump on log when message "Failed to wait for partition eviction"
> appeared.
>
> On Thu, Aug 18, 2016 at 6:11 PM, Jason  wrote:
>
>> Thanks for your suggestion, Vladislav.
>>
>> After use them, the same issue still happens.
>> There must be some logic which causes the race contention between eviction
>> and cache partition re-balance.
>>
>> Attached the logs for all the server nodes and all the config files.
>> please
>> help take a look, if any further info, don't hesitate to ask for.
>>
>> Apache.config
>> 
>> cache_config_in_client.cs
>> > cache_config_in_client.cs>
>> CO3SCH010254814_ignite-c90e3d10.log
>> > CO3SCH010254814_ignite-c90e3d10.log>
>> CO3SCH050481640_ignite-ae9e33dc.log
>> > CO3SCH050481640_ignite-ae9e33dc.log>
>> CO3SCH050500219_ignite-da04136c.log
>> > CO3SCH050500219_ignite-da04136c.log>
>> CO3SCH050511031_ignite-b58b1935.log
>> > CO3SCH050511031_ignite-b58b1935.log>
>> CO3SCH050520537_ignite-5672fb54.log
>> > CO3SCH050520537_ignite-5672fb54.log>
>> default-config.xml
>> > default-config.xml>
>>
>> thanks,
>> -Jason
>>
>>
>>
>> --
>> View this message in context: http://apache-ignite-users.705
>> 18.x6.nabble.com/Failed-to-wait-for-initial-partition-map-
>> exchange-tp6252p7153.html
>> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>>
>
>
>
> --
> Vladislav Pyatkov
>



-- 
Vladislav Pyatkov


Re: Failed to wait for initial partition map exchange

2016-08-19 Thread Vladislav Pyatkov
Jason,

Please, add system property -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true
and attach new logs.
It prints dump on log when message "Failed to wait for partition eviction"
appeared.

On Thu, Aug 18, 2016 at 6:11 PM, Jason  wrote:

> Thanks for your suggestion, Vladislav.
>
> After use them, the same issue still happens.
> There must be some logic which causes the race contention between eviction
> and cache partition re-balance.
>
> Attached the logs for all the server nodes and all the config files. please
> help take a look, if any further info, don't hesitate to ask for.
>
> Apache.config
> 
> cache_config_in_client.cs
>  n7153/cache_config_in_client.cs>
> CO3SCH010254814_ignite-c90e3d10.log
>  n7153/CO3SCH010254814_ignite-c90e3d10.log>
> CO3SCH050481640_ignite-ae9e33dc.log
>  n7153/CO3SCH050481640_ignite-ae9e33dc.log>
> CO3SCH050500219_ignite-da04136c.log
>  n7153/CO3SCH050500219_ignite-da04136c.log>
> CO3SCH050511031_ignite-b58b1935.log
>  n7153/CO3SCH050511031_ignite-b58b1935.log>
> CO3SCH050520537_ignite-5672fb54.log
>  n7153/CO3SCH050520537_ignite-5672fb54.log>
> default-config.xml
>  n7153/default-config.xml>
>
> thanks,
> -Jason
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Failed-to-wait-for-initial-partition-
> map-exchange-tp6252p7153.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>



-- 
Vladislav Pyatkov


Re: Failed to wait for initial partition map exchange

2016-08-18 Thread Jason
Thanks for your suggestion, Vladislav.

After use them, the same issue still happens. 
There must be some logic which causes the race contention between eviction
and cache partition re-balance.

Attached the logs for all the server nodes and all the config files. please
help take a look, if any further info, don't hesitate to ask for.

Apache.config
  
cache_config_in_client.cs

  
CO3SCH010254814_ignite-c90e3d10.log

  
CO3SCH050481640_ignite-ae9e33dc.log

  
CO3SCH050500219_ignite-da04136c.log

  
CO3SCH050511031_ignite-b58b1935.log

  
CO3SCH050520537_ignite-5672fb54.log

  
default-config.xml
  

thanks,
-Jason



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p7153.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-18 Thread Vladislav Pyatkov
Leave a comment not in the thread
http://apache-ignite-users.70518.x6.nabble.com/Fail-to-join-topology-and-repeat-join-process-tt6987.html#a7148

Duplicate here:

If you think what dedlock there, you can increase
IGNITE_LONG_OPERATIONS_DUMP_TIMEOUT (through jvm system properties) and
networkTimeout (through Ignite configuration xml) to several minutes.

-DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=30





On Wed, Aug 17, 2016 at 5:48 PM, Jason  wrote:

> hi Val,
>
> I reduce the server nodes to 5 with big cache in off_heap and can
> definitely
> reproduce this issue when the new node tries to join the topology.
> For the new joining node, it takes hundreds of seconds for syncing the
> cache
> partitions, and it says it has finished with the log "Completed (final)
> rebalancing [cache=cache_raw_gbievent", but still "Failed to wait for
> partition map exchange".
>
> From the log, seems that there're two waiting partition future: one is the
> partition exchange map and the other one is the cache eviction.
>
> I've attached the full logs for 5 server nodes and the config files for
> them.
> Would you like to help take a look at and provide some suggestion? If any
> further info, don't hesitate to ask for and I can easily reproduce it to
> provide.
>
> FYI, CO3SCH050520537 is the new added node and you can use its time as a
> reference.
>
> Any advice or suggestion should be appreciated.
>
> Apache.config
> 
> default-config.xml
>  n7135/default-config.xml>
> logs.zip
> 
>
> Thanks,
> -Jason
>
>
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Failed-to-wait-for-initial-partition-
> map-exchange-tp6252p7135.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>



-- 
Vladislav Pyatkov


Re: Failed to wait for initial partition map exchange

2016-08-17 Thread Jason
hi Val,

I reduce the server nodes to 5 with big cache in off_heap and can definitely
reproduce this issue when the new node tries to join the topology.
For the new joining node, it takes hundreds of seconds for syncing the cache
partitions, and it says it has finished with the log "Completed (final)
rebalancing [cache=cache_raw_gbievent", but still "Failed to wait for
partition map exchange".

>From the log, seems that there're two waiting partition future: one is the
partition exchange map and the other one is the cache eviction.

I've attached the full logs for 5 server nodes and the config files for
them. 
Would you like to help take a look at and provide some suggestion? If any
further info, don't hesitate to ask for and I can easily reproduce it to
provide.

FYI, CO3SCH050520537 is the new added node and you can use its time as a
reference.

Any advice or suggestion should be appreciated.

Apache.config
  
default-config.xml
  
logs.zip
  

Thanks,
-Jason





--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p7135.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-14 Thread Jason
hi Val,

have tried the 1.7.0, and still the same error.

seems that it can be reproduced definitely when the cache is very big like
over 15G per node in the off-heap.

Any logic can be correlated with this?

Thanks,
-Jason



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p7046.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.



Re: Failed to wait for initial partition map exchange

2016-08-12 Thread vkulichenko
Hi Jason,

Can you try with 1.7? There were couple of serious fixes in communication
SPI implementation which could potentially cause these issues.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p7029.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-06 Thread Jason
Thanks Val.

But seems that the failed node cannot be isolated by setting the
FailureDetectionTimeout, say 2000ms.

When the cluster with 22 server nodes and 146 client runs some time, e.g.
half day, and if there's failed server nodes, e.g. network issue, the new
started nodes will be hang with "Failed to wait for initial partition map
exchange" error.

I've done some debugging for the hang node, found that after send the
GridDhtAffinityAssignmentRequest in the requestFromNextNode method, then
it's blocked by GridDhtAssignmentFetchFuture in the "exchange worker" thread
but no response incoming, and this keeps for almost one day, and still
cannot recover.

And i've also checked the log in the Node that's requested in the
requestFromNextNode, no error except the normal metric.

BTW, the attached is the thread_dump for this failed node and the requested
node, please help took a look, and any suggestion will be appreciated.

FYI, we've spent over one month on testing the Ignite so far, but seems that
if this cannot be rooted cause and resolved, we can only give up all our
effort in Ignite now.

thread_dump_for_failed_to_wait_for_initial_partition_map_exchange.txt

  
thread_dump_for_requested_node.log

  
failed_node.log
  
reqeusted_node.log
  


Thanks,
-Jason






--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6830.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-05 Thread vkulichenko
Hi Jason,

Try the failure detection timeout. This will address vast majority of such
issues.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6809.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-03 Thread Jason
Thanks Val.

When will this bug be fixed? seems that it's very risky to run a cluster
stably with this, especially for the data fabric in memory.

BTW, in the Enterprise version, this has been fixed?

BTW again, is there any simple way to mitigate this first before it's really
fixed?

Thanks,
-Jason



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6730.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-03 Thread vkulichenko
Jason,

If there is OOM or long GC pause, usually a node can't send any heartbeats
and will be removed after failure detection timeout [1]. However, we have a
ticket [2] for some improvements in this area.

[1]
https://apacheignite.readme.io/v1.6/docs/cluster-config#failure-detection-timeout
[2] https://issues.apache.org/jira/browse/IGNITE-3616

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6726.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-08-02 Thread Jason
hi Val,

seems that when there's assertion or OOM in one node, it doesn't exit,
right? so it still sends heartbeat to others, then the whole cluster are
waiting for it to recover (hanging).

Is there any easy way to let one node restart immediately when encounter
unrecoverable errors, like OOM or severe assertion?

Thanks,
-Jason  





--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6691.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-07-29 Thread vkulichenko
Hi Jason,

Take a look at failure detection timeout [1]. This is a single setting that
defines the period of time after the node which loses connectivity (due to
network issues, GC, etc.) is considered failed.

Also there is a mechanism for removing slow clients [2].

[1]
https://apacheignite.readme.io/docs/cluster-config#failure-detection-timeout
[2]
https://apacheignite.readme.io/docs/clients-vs-servers#managing-slow-clients

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6633.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-07-29 Thread Jason
Thanks Val.

If there's assertion, OOM error in one node, client or server, cannot it be
identified as a failed node and be isolated? Or there's some policy or
parameter that can be tuned? 

If not, the whole cluster should be very vulnerable to failures in the
distributed environment, right?
E.g. one client node runs on a very busy machine and runs out of memory and
fails, then the whole cluster will be hang?

Thanks,
-Jason







--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6624.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Failed to wait for initial partition map exchange

2016-07-13 Thread vkulichenko
Hi Jason,

There are a lot of possible reasons for that. Most likely something bad is
happening (assertion, out of memory error, etc.) which freezes the cluster
for some reason. I would recommend to collect full log files and thread
dumps from all the nodes (servers and clients) and investigate them. Are
there any exceptions in logs? Are there any threads suspiciously hanging on
some operations?

If you attach the info here, I will be able to take a look.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Failed-to-wait-for-initial-partition-map-exchange-tp6252p6280.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.