Re: Mesos replicated log fills disk with logging output
Thanks for the hint! The cluster is using ext4, and judging from the linked thread this could have indeed be caused by a stalling hypervisor. From: Jie Yu <yujie@gmail.com> Reply-To: "user@mesos.apache.org" <user@mesos.apache.org> Date: Monday, 8. January 2018 at 23:36 To: user <user@mesos.apache.org> Subject: Re: Mesos replicated log fills disk with logging output Stephan, I haven't seen that before. A quick Google search suggests that it might be related to leveldb. The following thread might be related. https://groups.google.com/d/msg/leveldb/lRrbv4Y0YgU/AtfRTfQXNoYJ What is the filesystem you're using? - JIe On Mon, Jan 8, 2018 at 2:28 PM, Stephan Erb <stephan@blue-yonder.com<mailto:stephan@blue-yonder.com>> wrote: Hi everyone, a few days ago, we have bumped into an interesting issue that we had not seen before. Essentially, one of our toy clusters dissolved itself: · 3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper (3.4.5) for leader election · Master 1 and master 2 had 100% disk usage, because /var/lib/mesos/replicated_log/LOG had grown to about 170 GB · The replicated log of both Master 1 and 2 was corrupted. A process restart did not fix it. · The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was caused by the full disk. · Master 3 was the leading Mesos master and healthy. Its disk usage was normal. The content of /var/lib/mesos/replicated_log/LOG was an endless stream of: 2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753 2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started 2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK 2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753 2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751 2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started 2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK 2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757 2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760 2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from (begin) .. (end); will stop at '003060' @ 9423 : 1 2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761 2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762 2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763 Content of the associated folder: /var/lib/mesos/replicated_log.corrupted# ls -la total 964480 drwxr-xr-x 2 mesos mesos 4096 Jan 5 10:12 . drwxr-xr-x 4 mesos mesos 4096 Jan 5 10:27 .. -rw-r--r-- 1 mesos mesos 724 Dec 14 16:22 001735.ldb -rw-r--r-- 1 mesos mesos 7393 Dec 14 16:45 001737.sst -rw-r--r-- 1 mesos mesos 22129 Jan 3 12:53 001742.sst -rw-r--r-- 1 mesos mesos 14967 Jan 3 13:00 001747.sst -rw-r--r-- 1 mesos mesos 7526 Jan 4 12:30 001756.sst -rw-r--r-- 1 mesos mesos 15113 Jan 5 10:08 001765.sst -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:09 001767.log -rw-r--r-- 1 mesos mesos16 Jan 5 10:08 CURRENT -rw-r--r-- 1 mesos mesos 0 Aug 25 2015 LOCK -rw-r--r-- 1 mesos mesos 178303865220 Jan 5 10:12 LOG -rw-r--r-- 1 mesos mesos 463093282 Jan 5 10:08 LOG.old -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:08 MANIFEST-001764 Monitoring indicates that the disk usage started to grow shortly after a badly coordinated configuration deployment change: · Master 1 was leading and restarted after a few hours of uptime · Master 2 was now leading. After a few seconds (30s-60s or so) it got restarted as well · Master 3 was now leading (and continued to do so) I have to admit I am a bit surprised that the restart scenario could lead to the issues described above. Has anyone seen similar issues as well? Thanks and best regards, Stephan
Re: Mesos replicated log fills disk with logging output
Stephan, I haven't seen that before. A quick Google search suggests that it might be related to leveldb. The following thread might be related. https://groups.google.com/d/msg/leveldb/lRrbv4Y0YgU/AtfRTfQXNoYJ What is the filesystem you're using? - JIe On Mon, Jan 8, 2018 at 2:28 PM, Stephan Erbwrote: > Hi everyone, > > > > a few days ago, we have bumped into an interesting issue that we had not > seen before. Essentially, one of our toy clusters dissolved itself: > > > >- 3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and >ZooKeeper (3.4.5) for leader election >- Master 1 and master 2 had 100% disk usage, because >/var/lib/mesos/replicated_log/LOG had grown to about 170 GB >- The replicated log of both Master 1 and 2 was corrupted. A process >restart did not fix it. >- The ZooKeeper on Master 2 was corrupted as well. Logs indicated this >was caused by the full disk. >- Master 3 was the leading Mesos master and healthy. Its disk usage >was normal. > > > > > > The content of /var/lib/mesos/replicated_log/LOG was an endless stream of: > > > > 2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753 > > 2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started > > 2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK > > 2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753 > > 2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751 > > 2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started > > 2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK > > 2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757 > > 2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files > > 2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 > ] > > 2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760 > > 2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: > /var/lib/mesos/replicated_log/001735.sst: No such file or directory > > 2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from > (begin) .. (end); will stop at '003060' @ 9423 : 1 > > 2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files > > 2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 > ] > > 2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761 > > 2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: > /var/lib/mesos/replicated_log/001735.sst: No such file or directory > > 2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files > > 2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 > ] > > 2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762 > > 2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: > /var/lib/mesos/replicated_log/001735.sst: No such file or directory > > 2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files > > 2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 > ] > > 2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763 > > > > Content of the associated folder: > > > > /var/lib/mesos/replicated_log.corrupted# ls -la > > total 964480 > > drwxr-xr-x 2 mesos mesos 4096 Jan 5 10:12 . > > drwxr-xr-x 4 mesos mesos 4096 Jan 5 10:27 .. > > -rw-r--r-- 1 mesos mesos 724 Dec 14 16:22 001735.ldb > > -rw-r--r-- 1 mesos mesos 7393 Dec 14 16:45 001737.sst > > -rw-r--r-- 1 mesos mesos 22129 Jan 3 12:53 001742.sst > > -rw-r--r-- 1 mesos mesos 14967 Jan 3 13:00 001747.sst > > -rw-r--r-- 1 mesos mesos 7526 Jan 4 12:30 001756.sst > > -rw-r--r-- 1 mesos mesos 15113 Jan 5 10:08 001765.sst > > -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:09 001767.log > > -rw-r--r-- 1 mesos mesos16 Jan 5 10:08 CURRENT > > -rw-r--r-- 1 mesos mesos 0 Aug 25 2015 LOCK > > -rw-r--r-- 1 mesos mesos 178303865220 Jan 5 10:12 LOG > > -rw-r--r-- 1 mesos mesos 463093282 Jan 5 10:08 LOG.old > > -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:08 MANIFEST-001764 > > > > Monitoring indicates that the disk usage started to grow shortly after a > badly coordinated configuration deployment change: > > > >- Master 1 was leading and restarted after a few hours of uptime >- Master 2 was now leading. After a few seconds (30s-60s or so) it got >restarted as well >- Master 3 was now leading (and continued to do so) > > > > I have to admit I am a bit surprised that the restart scenario could lead > to the issues described above. Has anyone seen similar issues as well? > > > > Thanks and best regards, > > Stephan >
Mesos replicated log fills disk with logging output
Hi everyone, a few days ago, we have bumped into an interesting issue that we had not seen before. Essentially, one of our toy clusters dissolved itself: * 3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper (3.4.5) for leader election * Master 1 and master 2 had 100% disk usage, because /var/lib/mesos/replicated_log/LOG had grown to about 170 GB * The replicated log of both Master 1 and 2 was corrupted. A process restart did not fix it. * The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was caused by the full disk. * Master 3 was the leading Mesos master and healthy. Its disk usage was normal. The content of /var/lib/mesos/replicated_log/LOG was an endless stream of: 2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753 2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started 2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK 2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753 2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751 2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started 2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK 2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757 2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760 2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from (begin) .. (end); will stop at '003060' @ 9423 : 1 2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761 2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762 2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763 Content of the associated folder: /var/lib/mesos/replicated_log.corrupted# ls -la total 964480 drwxr-xr-x 2 mesos mesos 4096 Jan 5 10:12 . drwxr-xr-x 4 mesos mesos 4096 Jan 5 10:27 .. -rw-r--r-- 1 mesos mesos 724 Dec 14 16:22 001735.ldb -rw-r--r-- 1 mesos mesos 7393 Dec 14 16:45 001737.sst -rw-r--r-- 1 mesos mesos 22129 Jan 3 12:53 001742.sst -rw-r--r-- 1 mesos mesos 14967 Jan 3 13:00 001747.sst -rw-r--r-- 1 mesos mesos 7526 Jan 4 12:30 001756.sst -rw-r--r-- 1 mesos mesos 15113 Jan 5 10:08 001765.sst -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:09 001767.log -rw-r--r-- 1 mesos mesos16 Jan 5 10:08 CURRENT -rw-r--r-- 1 mesos mesos 0 Aug 25 2015 LOCK -rw-r--r-- 1 mesos mesos 178303865220 Jan 5 10:12 LOG -rw-r--r-- 1 mesos mesos 463093282 Jan 5 10:08 LOG.old -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:08 MANIFEST-001764 Monitoring indicates that the disk usage started to grow shortly after a badly coordinated configuration deployment change: * Master 1 was leading and restarted after a few hours of uptime * Master 2 was now leading. After a few seconds (30s-60s or so) it got restarted as well * Master 3 was now leading (and continued to do so) I have to admit I am a bit surprised that the restart scenario could lead to the issues described above. Has anyone seen similar issues as well? Thanks and best regards, Stephan