Hi, How many folders you have in /opt/ignite/wal/ ? Is there a chance that you have there 2 folders with different node IDs? Can you share your configuration?
Thanks, Evgenii пн, 29 апр. 2019 г. в 11:17, shivakumar <[email protected]>: > HI all, > > I have 7 node ignite cluster running on kubernetes platform, each instance > is configured with 64GB total RAM(32GB Heap space + 12 GB default data > region + remaining 18GB for ignite process), 6 core CPU, 12GB disk mount > for > WAL + WAL archive, 1 TB separate disk mount for native persistence. > > My problem is one of the pod (ignite instance) went to crashLoopBackOff > state and it is not recovering from crash > > [root@ignite-stability-controller stability]# kubectl get pods | grep > ignite-server > ignite-cluster-ignite-server-0 3/3 Running > > 5 3d19h > ignite-cluster-ignite-server-1 3/3 Running > > 5 3d19h > ignite-cluster-ignite-server-2 3/3 Running > > 5 3d19h > ignite-cluster-ignite-server-3 3/3 Running > > 5 3d19h > ignite-cluster-ignite-server-4 3/3 Running > > 5 3d19h > ignite-cluster-ignite-server-5 3/3 Running > > 5 3d19h > *ignite-cluster-ignite-server-6 2/3 > CrashLoopBackOff 342 3d19h* > ignite-server-visor-5df679d57-p4rf4 1/1 Running > > 0 3d19h > > If i check the logs of crashed instance it says (logs are in different > formats) > > > :"INFO","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,149Z","logger":"FsyncModeFileWriteAheadLogManager","timezone":"UTC","marker":"","log":"Starting > to copy WAL segment [absIdx=50008, segIdx=8, > > origFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal, > > dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal]"} > > {"type":"log","host":"ignite-cluster-ignite-server-6","level":"INFO","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,154Z","logger":"GridClusterStateProcessor","timezone":"UTC","marker":"","log":"Writing > BaselineTopology[id=1]"} > > {"type":"log","host":"ignite-cluster-ignite-server-6","level":"ERROR","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,170Z","logger":"","timezone":"UTC","marker":"","log":"Critical > system error detected. Will be handled accordingly to configured handler > [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, > super=AbstractFailureHandler > [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], > failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class > o.a.i.IgniteCheckedException: Failed to archive WAL segment > > [srcFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal, > > dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp]]] > class org.apache.ignite.IgniteCheckedException: Failed to archive WAL > segment > > [srcFile=/opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal, > dstFile=/opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp]| > > at > org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.archiveSegment(FsyncModeFileWriteAheadLogManager.java:1826)| > > at > org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.body(FsyncModeFileWriteAheadLogManager.java:1622)| > > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)| > > at java.lang.Thread.run(Thread.java:748)|Caused by: > java.nio.file.FileSystemException: > > /opt/ignite/wal/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000000008.wal > -> > > /opt/ignite/wal/archive/node00-18d2aa89-7ae0-495b-a608-f28e8054e00f/0000000000050008.wal.tmp: > No space left on device| at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)| > > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)| > > at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)| at > sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)| at > sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)| > > at java.nio.file.Files.copy(Files.java:1274)| at > org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileArchiver.archiveSegment(FsyncModeFileWriteAheadLogManager.java:1813)| > > ... 3 more"} > > {"type":"log","host":"ignite-cluster-ignite-server-6","level":"WARN","systemid":"6f058db6","system":"ignite-service-st","time":"2019-04-29T06:47:41,171Z","logger":"FailureProcessor","timezone":"UTC","marker":"","log":"No > deadlocked threads detected."} > > and when I checked disk usage disk volume mounted for WAL+WAL archive is > full > > Filesystem Size Used Avail Use% Mounted on > overlay 158G 8.9G 142G 6% / > tmpfs 63G 0 63G 0% /dev > tmpfs 63G 0 63G 0% /sys/fs/cgroup > /dev/vda1 158G 8.9G 142G 6% /etc/hosts > tmpfs 63G 12K 63G 1% /opt/cert > shm 64M 0 64M 0% /dev/shm > */dev/vdc 12G 12G 7.1M 100% /opt/ignite/wal* > /dev/vdb 1008G 110G 899G 11% /opt/ignite/persistence > tmpfs 63G 8.0K 63G 1% /etc/ignite-ssl-certs/tls.key > tmpfs 63G 12K 63G 1% > /run/secrets/kubernetes.io/serviceaccount > tmpfs 63G 0 63G 0% /proc/acpi > tmpfs 63G 0 63G 0% /proc/scsi > tmpfs 63G 0 63G 0% /sys/firmware > > > According to ignite documentation on WAL archive > https://apacheignite.readme.io/docs/write-ahead-log#section-wal-archive, > it > says wal archive size is 4 times the checkpoint buffer size and also > checkpoint buffer size is a function of data region > > https://apacheignite.readme.io/docs/durable-memory-tuning#section-checkpointing-buffer-size > (since i have 12GB data region checkpoint buffer size set by default to 2GB > ) > that means WAL archive size is set to 4 times of 2GB = 8 GB > but i mounted 12gb disk volume for WAL+WAL archive still it is full ? > Iam seeing this only on 1 node and once on few nodes in my earlier > deployment > > regards, > shiva > > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
