Re: Spark to Ignite Data load, Ignite node crashashing
Hi, Looks like it was killed by kernel. Check logs for OOM Killer: grep -i 'killed process' /var/log/messages If process was killed by Linux, correct your config, you might be set too much memory for Ignite paged memory, set to lower values [1] If not, try to find in logs by PID, maybe it was killed due to other reason. [1] https://apacheignite.readme.io/docs/memory-configuration Thanks! -Dmitry -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Spark to Ignite Data load, Ignite node crashashing
attaching log of the tow nodes crashing everytime, I have 4 nodes but the other two nodes ver rarely crashed. All nodes(VM) are 4CPU/16GB RAm/200GB HDD(Shared Storage) node 3: [16:35:21,938][INFO][main][IgniteKernal] >>>__ >>> / _/ ___/ |/ / _/_ __/ __/ >>> _/ // (7 7// / / / / _/ >>> /___/\___/_/|_/___/ /_/ /___/ >>> >>> ver. 2.6.0#20180710-sha1:669feacc >>> 2018 Copyright(C) Apache Software Foundation >>> >>> Ignite documentation: http://ignite.apache.org [16:35:21,946][INFO][main][IgniteKernal] Config URL: file:/data/ignitedata/apache-ignite-fabric-2.6.0-bin/config/default-config.xml [16:35:21,954][INFO][main][IgniteKernal] IgniteConfiguration [igniteInstanceName=null, pubPoolSize=8, svcPoolSize=8, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=8, mgmtPoolSize=4, igfsPoolSize=4, dataStreamerPoolSize=8, utilityCachePoolSize=8, utilityCacheKeepAliveTime=6, p2pPoolSize=2, qryPoolSize=8, igniteHome=/data/ignitedata/apache-ignite-fabric-2.6.0-bin, igniteWorkDir=/data/ignitedata/apache-ignite-fabric-2.6.0-bin/work, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6f94fa3e, nodeId=df202ccb-356f-426a-8131-e2cc0b9bf98f, marsh=org.apache.ignite.internal.binary.BinaryMarshaller@3023df74, marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000, sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1, metricsUpdateFreq=2000, metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=60, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null], segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=1, commSpi=TcpCommunicationSpi [connectGate=null, connPlc=null, enableForcibleNodeKill=false, enableTroubleshootingLog=false, srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@6302bbb1, locAddr=null, locHost=null, locPort=47100, locPortRange=100, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=60, connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768, msgQueueLimit=0, slowClientQueueLimit=1000, nioSrvr=null, shmemSrv=null, usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, sockWriteTimeout=2000, lsnr=null, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@31304f14[Count = 1], stopping=false, metricsLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationMetricsListener@34a3d150], evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@2a4fb17b, colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [lsnr=null], indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@7cc0cdad, addrRslvr=null, clientMode=false, rebalanceThreadPoolSize=1, txCfg=org.apache.ignite.configuration.TransactionConfiguration@7c7b252e, cacheSanityCheckEnabled=true, discoStartupDelay=6, deployMode=SHARED, p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=1, clientFailureDetectionTimeout=3, metricsLogFreq=6, hadoopCfg=null, connectorCfg=org.apache.ignite.configuration.ConnectorConfiguration@4d5d943d, odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040, sysCacheMaxSize=104857600, pageSize=0, concLvl=0, dfltDataRegConf=DataRegionConfiguration [name=default, maxSize=10737418240, initSize=268435456, swapPath=null, pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100, metricsEnabled=true, metricsSubIntervalCount=5, metricsRateTimeInterval=6, persistenceEnabled=true, checkpointPageBufSize=0], storagePath=/data/ignitedata/data, checkpointFreq=18, lockWaitTime=1, checkpointThreads=4, checkpointWriteOrder=SEQUENTIAL, walHistSize=20, walSegments=10, walSegmentSize=67108864, walPath=/root/ignite/wal, walArchivePath=db/wal/archive, metricsEnabled=true, walMode=LOG_ONLY, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000, walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false, fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@4c583ecf, metricsSubIntervalCnt=5, metricsRateTimeInterval=6, walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=false, walCompactionEnabled=false], activeOnStart=true, autoActivation=true, longQryWarnTimeout=500, sqlConnCfg=null, cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128, threadPoolSize=8,
Spark to Ignite Data load, Ignite node crashashing
Hello Ignite team, I a writing data from Spark Dataframe to Ignite, frequently one node goes down, I dont see any error in log file below is the trace. If i restart it doesn't join Cluster unless I stop the Spark job which is writing data to Ignite Cluster. I have 4 nodes with 4CPU/16GB RAM 200GB disc space, persistenc eis enabled, What could be the reason? [00:44:33]__ [00:44:33] / _/ ___/ |/ / _/_ __/ __/ [00:44:33] _/ // (7 7// / / / / _/ [00:44:33] /___/\___/_/|_/___/ /_/ /___/ [00:44:33] [00:44:33] ver. 2.6.0#20180710-sha1:669feacc [00:44:33] 2018 Copyright(C) Apache Software Foundation [00:44:33] [00:44:33] Ignite documentation: http://ignite.apache.org [00:44:33] [00:44:33] Quiet mode. [00:44:33] ^-- Logging to file '/data/ignitedata/apache-ignite-fabric-2.6.0-bin/work/log/ignite-d90d68c6.0.log' [00:44:33] ^-- Logging by 'JavaLogger [quiet=true, config=null]' [00:44:33] ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat} [00:44:33] [00:44:33] OS: Linux 3.10.0-862.3.2.el7.x86_64 amd64 [00:44:33] VM information: Java(TM) SE Runtime Environment 1.8.0_171-b11 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.171-b11 [00:44:33] Configured plugins: [00:44:33] ^-- None [00:44:33] [00:44:33] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0]] [00:44:33] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides. [00:44:33] Security status [authentication=off, tls/ssl=off] [00:44:35] Nodes started on local machine require more than 20% of physical RAM what can lead to significant slowdown due to swapping (please decrease JVM heap size, data region size or checkpoint buffer size) [required=13412MB, available=15885MB] [00:44:35] Performance suggestions for grid (fix if possible) [00:44:35] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true [00:44:35] ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=[g|G|m|M|k|K]' to JVM options) [00:44:35] ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options) [00:44:35] ^-- Speed up flushing of dirty pages by OS (alter vm.dirty_expire_centisecs parameter by setting to 500) [00:44:35] ^-- Reduce pages swapping ratio (set vm.swappiness=10) [00:44:35] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning [00:44:35] [00:44:35] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat} [00:44:35] [00:44:35] Ignite node started OK (id=d90d68c6) [00:44:35] >>> Ignite cluster is not active (limited functionality available). Use control.(sh|bat) script or IgniteCluster interface to activate. [00:44:35] Topology snapshot [ver=4, servers=4, clients=0, CPUs=16, offheap=40.0GB, heap=4.0GB] [00:44:35] ^-- Node [id=D90D68C6-C725-43F8-BC32-71363FE3E86F, clusterState=INACTIVE] [00:44:35] ^-- Baseline [id=0, size=4, online=3, offline=1] [00:44:35] ^-- 1 nodes left for auto-activation [a99529d8-e483-44b3-96eb-a5a773e380e3] [00:44:35] Data Regions Configured: [00:44:35] ^-- default [initSize=256.0 MiB, maxSize=10.0 GiB, persistenceEnabled=true] [00:48:20] Topology snapshot [ver=5, servers=4, clients=1, CPUs=16, offheap=50.0GB, heap=8.4GB] [00:48:20] ^-- Node [id=D90D68C6-C725-43F8-BC32-71363FE3E86F, clusterState=ACTIVE] [00:48:20] ^-- Baseline [id=0, size=4, online=3, offline=1] [00:48:20] Data Regions Configured: [00:48:20] ^-- default [initSize=256.0 MiB, maxSize=10.0 GiB, persistenceEnabled=true] [00:48:37] Topology snapshot [ver=6, servers=4, clients=2, CPUs=16, offheap=60.0GB, heap=12.0GB] [00:48:37] ^-- Node [id=D90D68C6-C725-43F8-BC32-71363FE3E86F, clusterState=ACTIVE] [00:48:37] ^-- Baseline [id=0, size=4, online=3, offline=1] [00:48:37] Data Regions Configured: [00:48:37] ^-- default [initSize=256.0 MiB, maxSize=10.0 GiB, persistenceEnabled=true] [00:48:37] Topology snapshot [ver=7, servers=4, clients=3, CPUs=16, offheap=70.0GB, heap=16.0GB] [00:48:37] ^-- Node [id=D90D68C6-C725-43F8-BC32-71363FE3E86F, clusterState=ACTIVE] [00:48:37] ^-- Baseline [id=0, size=4, online=3, offline=1] [00:48:37] Data Regions Configured: [00:48:37] ^-- default [initSize=256.0 MiB, maxSize=10.0 GiB, persistenceEnabled=true] [00:48:38] Topology snapshot [ver=8, servers=4, clients=4, CPUs=16, offheap=80.0GB, heap=19.0GB] [00:48:38] ^-- Node [id=D90D68C6-C725-43F8-BC32-71363FE3E86F, clusterState=ACTIVE] [00:48:38] ^-- Baseline [id=0, size=4, online=3, offline=1] [00:48:38] Data Regions Configured: [00:48:38] ^-- default [initSize=256.0 MiB, maxSize=10.0 GiB, persistenceEnabled=true] [00:48:40] Topology snapshot [ver=9, servers=4, clients=5, CPUs=16, offheap=90.0GB, heap=23.0GB] [00:48:40] ^-- Node [id=D90D68C6-C725-43F8-BC32-71363FE3E86F,