Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())
On Mon, Feb 11, 2019 at 8:01 PM Jake Grimmett wrote: > > Hi Zheng, > > Many, many thanks for your help... > > Your suggestion of setting large values for mds_cache_size and > mds_cache_memory_limit stopped our MDS crashing :) > > The values in ceph.conf are now: > > mds_cache_size = 8589934592 > mds_cache_memory_limit = 17179869184 > > Should these values be left in our configuration? No. you'd better to change them to original values. > > again thanks for the assistance, > > Jake > > On 2/11/19 8:17 AM, Yan, Zheng wrote: > > On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett > > wrote: > >> > >> Dear All, > >> > >> Unfortunately the MDS has crashed on our Mimic cluster... > >> > >> First symptoms were rsync giving: > >> "No space left on device (28)" > >> when trying to rename or delete > >> > >> This prompted me to try restarting the MDS, as it reported laggy. > >> > >> Restarting the MDS, shows this as error in the log before the crash: > >> > >> elist.h: 39: FAILED assert(!is_on_list()) > >> > >> A full MDS log showing the crash is here: > >> > >> http://p.ip.fi/iWlz > >> > >> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes... > >> > >> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x > >> replication for MDS. We have a single active MDS, with two failover MDS > >> > >> We have ~2PB of cephfs data here, all of which is currently > >> inaccessible, all and any advice gratefully received :) > >> > > > > Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set > > them to very large values before starting mds. If mds does not crash, > > restore the mds_cache_size and mds_cache_memory_limit to their > > original values (by admin socket) after mds becomes active for 10 > > seconds > > > > If mds still crash, try compile ceph-mds with following patch > > > > diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc > > index d3461fba2e..c2731e824c 100644 > > --- a/src/mds/CDir.cc > > +++ b/src/mds/CDir.cc > > @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn) > >// clean? > >if (dn->is_dirty()) > > dn->mark_clean(); > > + if (inode->is_stray()) > > +dn->item_stray.remove_myself(); > > > >if (dn->state_test(CDentry::STATE_BOTTOMLRU)) > > cache->bottom_lru.lru_remove(dn); > > > > > >> best regards, > >> > >> Jake > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())
Hi Zheng, Sorry - I've just re-read your email and saw your instruction to restore the mds_cache_size and mds_cache_memory_limit to original values if the MDS does not crash - I have now done this... thanks again for your help, best regards, Jake On 2/11/19 12:01 PM, Jake Grimmett wrote: > Hi Zheng, > > Many, many thanks for your help... > > Your suggestion of setting large values for mds_cache_size and > mds_cache_memory_limit stopped our MDS crashing :) > > The values in ceph.conf are now: > > mds_cache_size = 8589934592 > mds_cache_memory_limit = 17179869184 > > Should these values be left in our configuration? > > again thanks for the assistance, > > Jake > > On 2/11/19 8:17 AM, Yan, Zheng wrote: >> On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett wrote: >>> >>> Dear All, >>> >>> Unfortunately the MDS has crashed on our Mimic cluster... >>> >>> First symptoms were rsync giving: >>> "No space left on device (28)" >>> when trying to rename or delete >>> >>> This prompted me to try restarting the MDS, as it reported laggy. >>> >>> Restarting the MDS, shows this as error in the log before the crash: >>> >>> elist.h: 39: FAILED assert(!is_on_list()) >>> >>> A full MDS log showing the crash is here: >>> >>> http://p.ip.fi/iWlz >>> >>> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes... >>> >>> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x >>> replication for MDS. We have a single active MDS, with two failover MDS >>> >>> We have ~2PB of cephfs data here, all of which is currently >>> inaccessible, all and any advice gratefully received :) >>> >> >> Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set >> them to very large values before starting mds. If mds does not crash, >> restore the mds_cache_size and mds_cache_memory_limit to their >> original values (by admin socket) after mds becomes active for 10 >> seconds >> >> If mds still crash, try compile ceph-mds with following patch >> >> diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc >> index d3461fba2e..c2731e824c 100644 >> --- a/src/mds/CDir.cc >> +++ b/src/mds/CDir.cc >> @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn) >>// clean? >>if (dn->is_dirty()) >> dn->mark_clean(); >> + if (inode->is_stray()) >> +dn->item_stray.remove_myself(); >> >>if (dn->state_test(CDentry::STATE_BOTTOMLRU)) >> cache->bottom_lru.lru_remove(dn); >> >> >>> best regards, >>> >>> Jake >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())
Hi Zheng, Many, many thanks for your help... Your suggestion of setting large values for mds_cache_size and mds_cache_memory_limit stopped our MDS crashing :) The values in ceph.conf are now: mds_cache_size = 8589934592 mds_cache_memory_limit = 17179869184 Should these values be left in our configuration? again thanks for the assistance, Jake On 2/11/19 8:17 AM, Yan, Zheng wrote: > On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett wrote: >> >> Dear All, >> >> Unfortunately the MDS has crashed on our Mimic cluster... >> >> First symptoms were rsync giving: >> "No space left on device (28)" >> when trying to rename or delete >> >> This prompted me to try restarting the MDS, as it reported laggy. >> >> Restarting the MDS, shows this as error in the log before the crash: >> >> elist.h: 39: FAILED assert(!is_on_list()) >> >> A full MDS log showing the crash is here: >> >> http://p.ip.fi/iWlz >> >> I've tried upgrading the cluster to 13.2.4, but the MDS still crashes... >> >> The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x >> replication for MDS. We have a single active MDS, with two failover MDS >> >> We have ~2PB of cephfs data here, all of which is currently >> inaccessible, all and any advice gratefully received :) >> > > Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set > them to very large values before starting mds. If mds does not crash, > restore the mds_cache_size and mds_cache_memory_limit to their > original values (by admin socket) after mds becomes active for 10 > seconds > > If mds still crash, try compile ceph-mds with following patch > > diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc > index d3461fba2e..c2731e824c 100644 > --- a/src/mds/CDir.cc > +++ b/src/mds/CDir.cc > @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn) >// clean? >if (dn->is_dirty()) > dn->mark_clean(); > + if (inode->is_stray()) > +dn->item_stray.remove_myself(); > >if (dn->state_test(CDentry::STATE_BOTTOMLRU)) > cache->bottom_lru.lru_remove(dn); > > >> best regards, >> >> Jake >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())
On Sat, Feb 9, 2019 at 12:36 AM Jake Grimmett wrote: > > Dear All, > > Unfortunately the MDS has crashed on our Mimic cluster... > > First symptoms were rsync giving: > "No space left on device (28)" > when trying to rename or delete > > This prompted me to try restarting the MDS, as it reported laggy. > > Restarting the MDS, shows this as error in the log before the crash: > > elist.h: 39: FAILED assert(!is_on_list()) > > A full MDS log showing the crash is here: > > http://p.ip.fi/iWlz > > I've tried upgrading the cluster to 13.2.4, but the MDS still crashes... > > The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x > replication for MDS. We have a single active MDS, with two failover MDS > > We have ~2PB of cephfs data here, all of which is currently > inaccessible, all and any advice gratefully received :) > Add mds_cache_size and mds_cache_memory_limit to ceph.conf and set them to very large values before starting mds. If mds does not crash, restore the mds_cache_size and mds_cache_memory_limit to their original values (by admin socket) after mds becomes active for 10 seconds If mds still crash, try compile ceph-mds with following patch diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc index d3461fba2e..c2731e824c 100644 --- a/src/mds/CDir.cc +++ b/src/mds/CDir.cc @@ -508,6 +508,8 @@ void CDir::remove_dentry(CDentry *dn) // clean? if (dn->is_dirty()) dn->mark_clean(); + if (inode->is_stray()) +dn->item_stray.remove_myself(); if (dn->state_test(CDentry::STATE_BOTTOMLRU)) cache->bottom_lru.lru_remove(dn); > best regards, > > Jake > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS crash (Mimic 13.2.2 / 13.2.4 ) elist.h: 39: FAILED assert(!is_on_list())
Dear All, Unfortunately the MDS has crashed on our Mimic cluster... First symptoms were rsync giving: "No space left on device (28)" when trying to rename or delete This prompted me to try restarting the MDS, as it reported laggy. Restarting the MDS, shows this as error in the log before the crash: elist.h: 39: FAILED assert(!is_on_list()) A full MDS log showing the crash is here: http://p.ip.fi/iWlz I've tried upgrading the cluster to 13.2.4, but the MDS still crashes... The cluster has 10 nodes, 254 OSD's, uses EC for the data, 3x replication for MDS. We have a single active MDS, with two failover MDS We have ~2PB of cephfs data here, all of which is currently inaccessible, all and any advice gratefully received :) best regards, Jake ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com