Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-02 Thread Heller, Chris
On Mar 1, 2017, at 9:39 AM, Peter Maloney > wrote: > > On 03/01/17 15:36, Heller, Chris wrote: >> I see. My journal is specified in ceph.conf. I'm not removing it from the >> OSD so sounds like flushing isn't needed in my case. >> > Okay but it seems it&

Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
I see. My journal is specified in ceph.conf. I'm not removing it from the OSD so sounds like flushing isn't needed in my case. -Chris > On Mar 1, 2017, at 9:31 AM, Peter Maloney > wrote: > > On 03/01/17 14:41, Heller, Chris wrote: >> That is a good question, an

Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
> > On 02/28/17 18:55, Heller, Chris wrote: >> Quick update. So I'm trying out the procedure as documented here. >> >> So far I've: >> >> 1. Stopped ceph-mds >> 2. set noout, norecover, norebalance, nobackfill >> 3. Stopped all ceph-osd

Re: [ceph-users] Antw: Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
ts start up correctly we set the osd weights to the normal value > so that the > cluster was rebalancing. > > With this procedure the cluster was always up. > > Regards > > Steffen > > >>>> "Heller, Chris" schrieb am Montag, 27. Februar 20

Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-02-28 Thread Heller, Chris
ceph osd stat` shows that the 'norecover' flag is still set. I'm going to wait out the recovery and see if the Ceph FS is OK. That would be huge if it is. But I am curious why I lost an OSD, and why recovery is happening with 'norecover' still set. -Chris > O

[ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-02-27 Thread Heller, Chris
I am attempting an operating system upgrade of a live Ceph cluster. Before I go an screw up my production system, I have been testing on a smaller installation, and I keep running into issues when bringing the Ceph FS metadata server online. My approach here has been to store all Ceph critical

Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Just a thought, but since a directory tree is a first class item in cephfs, could the wire protocol be extended with an “recursive delete” operation, specifically for cases like this? On 10/14/16, 4:16 PM, "Gregory Farnum" wrote: On Fri, Oct 14, 2016 at 1:11 PM, Heller, Ch

Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Ok. Since I’m running through the Hadoop/ceph api, there is no syscall boundary so there is a simple place to improve the throughput here. Good to know, I’ll work on a patch… On 10/14/16, 3:58 PM, "Gregory Farnum" wrote: On Fri, Oct 14, 2016 at 11:41 AM, Heller, Ch

Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
? -Chris On 10/13/16, 4:22 PM, "Gregory Farnum" wrote: On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris wrote: > I have a directory I’ve been trying to remove from cephfs (via > cephfs-hadoop), the directory is a few hundred gigabytes in size and > contains a few

[ceph-users] cephfs slow delete

2016-10-13 Thread Heller, Chris
I have a directory I’ve been trying to remove from cephfs (via cephfs-hadoop), the directory is a few hundred gigabytes in size and contains a few million files, but not in a single sub directory. I startd the delete yesterday at around 6:30 EST, and it’s still progressing. I can see from (ceph

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
uot;: false, "inst": "client.585194220 192.168.1.157:0\/634334964", "client_metadata": { "ceph_sha1": "d56bdf93ced6b80b07397d57e3fa68fe68304432", "ceph_version": "ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3f

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I also went and bumped mds_cache_size up to 1 million… still seeing cache pressure, but I might just need to evict those clients… On 9/21/16, 9:24 PM, "Heller, Chris" wrote: What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ leases appears to be, on

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for many many clients! -Chris On 9/21/16, 9:22 PM, "Gregory Farnum" wrote: On Wed, Sep 21, 2016 at 6:13 PM, Heller, Chris wrote:

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
the ceph tools. I’ll try upping “mds cache size”, are there any other configuration settings I might adjust to perhaps ease the problem while I track it down in the HDFS tools layer? -Chris On 9/21/16, 4:34 PM, "Gregory Farnum" wrote: On Wed, Sep 21, 2016 at 1:16 PM, Heller, Ch

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
ce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885178 2016-09-21 20:15:58.159134 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885188 -Chris On 9/21/16, 11:23 AM, "Heller, Chris&q

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
9baa8780 -1 mds.-1.0 log_to_monitors {default=true} 2016-09-21 15:13:27.329181 7f68969e9700 1 mds.-1.0 handle_mds_map standby 2016-09-21 15:13:28.484148 7f68969e9700 1 mds.-1.0 handle_mds_map standby 2016-09-21 15:13:33.280376 7f68969e9700 1 mds.-1.0 handle_mds_map standby On 9/21/16, 10:

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
the summary). -Chris On 9/21/16, 10:46 AM, "Gregory Farnum" wrote: On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris wrote: > I’m running a production 0.94.7 Ceph cluster, and have been seeing a > periodic issue arise where in all my MDS clients will become stuck, an

[ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’m running a production 0.94.7 Ceph cluster, and have been seeing a periodic issue arise where in all my MDS clients will become stuck, and the fix so far has been to restart the active MDS (sometimes I need to restart the subsequent active MDS as well). These clients are using the cephfs-hado

Re: [ceph-users] How to associate a cephfs client id to its process

2016-09-14 Thread Heller, Chris
Ok. I’ll see about tracking down the logs (set to stderr for these tasks), and the metadata stuff looks interesting for future association. Thanks, Chris On 9/14/16, 5:04 PM, "Gregory Farnum" wrote: On Wed, Sep 14, 2016 at 7:02 AM, Heller, Chris wrote: > I am making

[ceph-users] How to associate a cephfs client id to its process

2016-09-14 Thread Heller, Chris
I am making use of CephFS plus the cephfs-hadoop shim to replace HDFS in a system I’ve been experimenting with. I’ve noticed that a large number of my HDFS clients have a ‘num_caps’ value of 16385, as seen when running ‘session ls’ on the active mds. This appears to be one larger than the defau

[ceph-users] Ceph auth key generation algorithm documentation

2016-08-23 Thread Heller, Chris
I’d like to generate keys for ceph external to any system which would have ceph-authtool. Looking over the ceph website and googling have turned up nothing. Is the ceph auth key generation algorithm documented anywhere? -Chris ___ ceph-users mailing li

Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-16 Thread Heller, Chris
marked as ‘found’ once it returns to the network? -Chris From: Goncalo Borges Date: Monday, August 15, 2016 at 11:36 PM To: "Heller, Chris" , "ceph-users@lists.ceph.com" Subject: Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up Hi Chris..

Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-15 Thread Heller, Chris
, August 15, 2016 at 9:03 PM To: "ceph-users@lists.ceph.com" , "Heller, Chris" Subject: Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up Hi Heller... Can you actually post the result of ceph pg dump_stuck ? Cheers G. On 08/15/2016

[ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-15 Thread Heller, Chris
I’d like to better understand the current state of my CEPH cluster. I currently have 2 PG that are in the ‘stuck unclean’ state: # ceph health detail HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean pg 4.2a8 is stuck inactive for 124516.91, current state down+p