Re: [ceph-users] Removing MDS

2018-11-02 Thread Rhian Resnick
Morning our backup of the metadata is 75% done (rados cppool as the metadata 
export fails by using up all server memory). Before we start working on fixing 
our metadata we wanted our projected procedure to be reviewed.


Does the following sequence look correct for our environment?


  1.  rados cppool cephfs_metadata cephfs_metadata.bk
  2.  cephfs-journal-tool event recover_dentries summary --rank=0
  3.  cephfs-journal-tool event recover_dentries summary --rank=1
  4.  cephfs-journal-tool event recover_dentries summary --rank=2
  5.  cephfs-journal-tool event recover_dentries summary --rank=3
  6.  cephfs-journal-tool event recover_dentries summary --rank=4
  7.  cephfs-journal-tool journal reset --rank=0
  8.  cephfs-journal-tool journal reset --rank=1
  9.  cephfs-journal-tool journal reset --rank=2
  10. cephfs-journal-tool journal reset --rank=3
  11. cephfs-journal-tool journal reset --rank=4
  12. cephfs-table-tool all reset session
  13. Start metadata servers
  14. Scrub mds:
 *   ceph daemon mds.{hostname} scrub_path / recursive
 *   ceph daemon mds.{hostname} scrub_path / force
  15.





Rhian Resnick

Associate Director Research Computing

Enterprise Systems

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] <https://hpc.fau.edu/wp-content/uploads/2015/01/image.jpg>



From: Rhian Resnick
Sent: Thursday, November 1, 2018 10:32 AM
To: Patrick Donnelly
Cc: Ceph Users
Subject: Re: [ceph-users] Removing MDS


Morning all,


This has been a rough couple days. We thought we had resolved all our 
performance issues by moving the ceph metadata to some high intensity write 
disks from Intel but what we didn't notice was that Ceph labeled them as HDD's 
(thanks dell raid controller).


We believe this caused read lock errors and resulted in the journal increasing 
from 700MB to 1 TB in 2 hours. (Basically over lunch) We tried to migrate and 
then stop everything before the OSD's reached full status but failed.


Over the last 12 hours the data has been migrated from the SDD's back to 
spinning disks but the MDS servers are now reporting that two ranks are damaged.


We are running a backup of the metadata pool but wanted to know what the list 
thinks the next steps should be. I have attached the error's we see in the logs 
as well as our OSD Tree, ceph.conf (comments removed), and ceph fs dump.


Combined logs (after marking things as repaired to see if that would rescue us):


Nov  1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499 7f68db7a3700 
-1 mds.4.purge_queue operator(): Error -108 loading Journaler
Nov  1 10:07:02 ceph-p-mds2 ceph-mds: 2018-11-01 10:07:02.045499 7f68db7a3700 
-1 mds.4.purge_queue operator(): Error -108 loading Journaler
Nov  1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:40 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:40.968143 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934 7f6dacd69700 
-1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
Nov  1 10:26:47 ceph-storage2 ceph-mds: mds.1 10.141.255.202:6898/1492854021 1 
: Error loading MDS rank 1: (22) Invalid argument
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914949 7f6dacd69700 
 0 mds.1.log _replay journaler got error -22, aborting
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.914934 7f6dacd69700 
-1 mds.1.journaler.mdlog(ro) try_read_entry: decode error from _is_readable
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745 7f6dacd69700 
-1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid 
argument
Nov  1 10:26:47 ceph-storage2 ceph-mds: 2018-11-01 10:26:47.915745 7f6dacd69700 
-1 log_channel(cluster) log [ERR] : Error loading MDS rank 1: (22) Invalid 
argument
Nov  1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 2 mds daemons damaged 
(MDS_DAMAGE)
Nov  1 10:26:47 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:47.999432 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 2 mds daemons damaged 
(MDS_DAMAGE)
Nov  1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)
Nov  1 10:26:55 ceph-p-mon2 ceph-mon: 2018-11-01 10:26:55.026231 7fa3b57ce700 
-1 log_channel(cluster) log [ERR] : Health check update: 1 mds daemon damaged 
(MDS_DAMAGE)

Ceph OSD Status: (The missing and oud osd's are in a different pool  from all 
data, these were the bad ssds that caused the issue)


  cluster:
id: 6a2e8f21-bca2-492

Re: [ceph-users] Removing MDS

2018-11-01 Thread Rhian Resnick
up  1.0 1.0
149   hdd   3.63199 osd.149up  1.0 1.0
150   hdd   3.63199 osd.150up  1.0 1.0
151   hdd   3.63199 osd.151up  1.0 1.0
152   hdd   3.63199 osd.152up  1.0 1.0
153   hdd   3.63199 osd.153up  1.0 1.0
154   hdd   3.63199 osd.154up  1.0 1.0
155   hdd   3.63199 osd.155up  1.0 1.0
156   hdd   3.63199 osd.156up  1.0 1.0
157   hdd   3.63199 osd.157up  1.0 1.0
158   hdd   3.63199 osd.158up  1.0 1.0
159   hdd   3.63199 osd.159up  1.0 1.0
161   hdd   3.63199 osd.161up  1.0 1.0
162   hdd   3.63199 osd.162up  1.0 1.0
164   hdd   3.63199 osd.164up  1.0 1.0
165   hdd   3.63199 osd.165up  1.0 1.0
167   hdd   3.63199 osd.167up  1.0 1.0
168   hdd   3.63199 osd.168up  1.0 1.0
169   hdd   3.63199 osd.169up  1.0 1.0
170   hdd   3.63199 osd.170up  1.0 1.0
171   hdd   3.63199 osd.171up  1.0 1.0
172   hdd   3.63199 osd.172up  1.0 1.0
173   hdd   3.63199 osd.173up  1.0 1.0
174   hdd   3.63869 osd.174up  1.0 1.0
177   hdd   3.63199 osd.177up  1.0 1.0



# Ceph configuration shared by all nodes


[global]
fsid = 6a2e8f21-bca2-492b-8869-eecc995216cc
public_network = 10.141.0.0/16
cluster_network = 10.85.8.0/22
mon_initial_members = ceph-p-mon1, ceph-p-mon2, ceph-p-mon3
mon_host = 10.141.161.248,10.141.160.250,10.141.167.237
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx


# Cephfs needs these to be set to support larger directories
mds_bal_frag = true
allow_dirfrags = true

rbd_default_format = 2
mds_beacon_grace = 60
mds session timeout = 120

log to syslog = true
err to syslog = true
clog to syslog = true


[mds]

[osd]
osd op threads = 32
osd max backfills = 32





# Old method of moving ssds to a pool

[osd.85]
host = ceph-storage1
crush_location =  root=ssds host=ceph-storage1-ssd

[osd.89]
host = ceph-storage1
crush_location =  root=ssds host=ceph-storage1-ssd

[osd.160]
host = ceph-storage3
crush_location =  root=ssds host=ceph-storage3-ssd

[osd.163]
host = ceph-storage3
crush_location =  root=ssds host=ceph-storage3-ssd

[osd.166]
host = ceph-storage3
crush_location =  root=ssds host=ceph-storage3-ssd

[osd.5]
host = ceph-storage2
crush_location =  root=ssds host=ceph-storage2-ssd

[osd.68]
host = ceph-storage2
crush_location =  root=ssds host=ceph-storage2-ssd

[osd.87]
host = ceph-storage2
crush_location =  root=ssds host=ceph-storage2-ssd






Rhian Resnick

Associate Director Research Computing

Enterprise Systems

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] <https://hpc.fau.edu/wp-content/uploads/2015/01/image.jpg>



From: Patrick Donnelly 
Sent: Tuesday, October 30, 2018 8:40 PM
To: Rhian Resnick
Cc: Ceph Users
Subject: Re: [ceph-users] Removing MDS

On Tue, Oct 30, 2018 at 4:05 PM Rhian Resnick  wrote:
> We are running into issues deactivating mds ranks. Is there a way to safely 
> forcibly remove a rank?

No, there's no "safe" way to force the issue. The rank needs to come
back, flush its journal, and then complete its deactivation. To get
more help, you need to describe your environment, version of Ceph in
use, relevant log snippets, etc.

--
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing MDS

2018-10-30 Thread Rhian Resnick
That is what I though. I am increasing debug to see where we are getting stuck. 
I am not sure if it is an issue deactivating or a rdlock issue.


Thanks if we discover more we will post a question with details.


Rhian Resnick

Associate Director Research Computing

Enterprise Systems

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] <https://hpc.fau.edu/wp-content/uploads/2015/01/image.jpg>



From: Patrick Donnelly 
Sent: Tuesday, October 30, 2018 8:40 PM
To: Rhian Resnick
Cc: Ceph Users
Subject: Re: [ceph-users] Removing MDS

On Tue, Oct 30, 2018 at 4:05 PM Rhian Resnick  wrote:
> We are running into issues deactivating mds ranks. Is there a way to safely 
> forcibly remove a rank?

No, there's no "safe" way to force the issue. The rank needs to come
back, flush its journal, and then complete its deactivation. To get
more help, you need to describe your environment, version of Ceph in
use, relevant log snippets, etc.

--
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing MDS

2018-10-30 Thread Patrick Donnelly
On Tue, Oct 30, 2018 at 4:05 PM Rhian Resnick  wrote:
> We are running into issues deactivating mds ranks. Is there a way to safely 
> forcibly remove a rank?

No, there's no "safe" way to force the issue. The rank needs to come
back, flush its journal, and then complete its deactivation. To get
more help, you need to describe your environment, version of Ceph in
use, relevant log snippets, etc.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Removing MDS

2018-10-30 Thread Rhian Resnick
Evening,


We are running into issues deactivating mds ranks. Is there a way to safely 
forcibly remove a rank?


Rhian Resnick

Associate Director Research Computing

Enterprise Systems

Office of Information Technology


Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [image] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Removing MDS

2014-09-12 Thread LaBarre, James (CTR) A6IT
We were building a test cluster here, and I enabled MDS in order to use 
ceph-fuse to fill the cluster with data.   It seems the metadata server is 
having problems, so I figured I'd just remove it and rebuild it.  However, the 
ceph-deploy mds destroy command is not implemented; it appears that once you 
have created an MDS, you can't get rid of it without demolishing your entire 
cluster and building from scratch.   And since the cluster is already out of 
whack, there seems to be no way to even drop OSDs to restart it cleanly.

Should I just reboot all the OSD nodes and the monitor node, and hope the 
cluster comes up in a usable fashion?  Because there seems no other option 
short of  the burn-down and rebuild.

--
CONFIDENTIALITY NOTICE: If you have received this email in error,
please immediately notify the sender by e-mail at the address shown. 
This email transmission may contain confidential information.  This
information is intended only for the use of the individual(s) or entity to
whom it is intended even if addressed incorrectly.  Please delete it from
your files if you are not the intended recipient.  Thank you for your
compliance.  Copyright (c) 2014 Cigna
==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing MDS

2014-09-12 Thread Gregory Farnum
You can turn off the MDS and create a new FS in new pools. The ability
to shut down a filesystem more completely is coming in Giant.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Sep 12, 2014 at 1:16 PM, LaBarre, James  (CTR)  A6IT
james.laba...@cigna.com wrote:
 We were building a test cluster here, and I enabled MDS in order to use
 ceph-fuse to fill the cluster with data.   It seems the metadata server is
 having problems, so I figured I’d just remove it and rebuild it.  However,
 the “ceph-deploy mds destroy” command is not implemented; it appears that
 once you have created an MDS, you can’t get rid of it without demolishing
 your entire cluster and building from scratch.   And since the cluster is
 already out of whack, there seems to be no way to even drop OSDs to restart
 it cleanly.



 Should I just reboot all the OSD nodes and the monitor node, and hope the
 cluster comes up in a usable fashion?  Because there seems no other option
 short of  the burn-down and rebuild.



 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.
 This email transmission may contain confidential information.  This
 information is intended only for the use of the individual(s) or entity to
 whom it is intended even if addressed incorrectly.  Please delete it from
 your files if you are not the intended recipient.  Thank you for your
 compliance.  Copyright (c) 2014 Cigna
 ==


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com