Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-02 Thread Heller, Chris
> On Mar 1, 2017, at 9:39 AM, Peter Maloney > <peter.malo...@brockmann-consult.de> wrote: > > On 03/01/17 15:36, Heller, Chris wrote: >> I see. My journal is specified in ceph.conf. I'm not removing it from the >> OSD so sounds like flushing isn't needed in my case

Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
I see. My journal is specified in ceph.conf. I'm not removing it from the OSD so sounds like flushing isn't needed in my case. -Chris > On Mar 1, 2017, at 9:31 AM, Peter Maloney > <peter.malo...@brockmann-consult.de> wrote: > > On 03/01/17 14:41, Heller, Chris wrote: >>

Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
eter.malo...@brockmann-consult.de> wrote: > > On 02/28/17 18:55, Heller, Chris wrote: >> Quick update. So I'm trying out the procedure as documented here. >> >> So far I've: >> >> 1. Stopped ceph-mds >> 2. set noout, norecover, norebalance, nobackfill

Re: [ceph-users] Antw: Safely Upgrading OS on a live Ceph Cluster

2017-03-01 Thread Heller, Chris
o that > all components start up correctly we set the osd weights to the normal value > so that the > cluster was rebalancing. > > With this procedure the cluster was always up. > > Regards > > Steffen > > >>>> "Heller, Chris" <chel...@akamai

Re: [ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-02-28 Thread Heller, Chris
'norecover' flag is still set. I'm going to wait out the recovery and see if the Ceph FS is OK. That would be huge if it is. But I am curious why I lost an OSD, and why recovery is happening with 'norecover' still set. -Chris > On Feb 28, 2017, at 4:51 AM, Peter Maloney > <peter.ma

[ceph-users] Safely Upgrading OS on a live Ceph Cluster

2017-02-27 Thread Heller, Chris
I am attempting an operating system upgrade of a live Ceph cluster. Before I go an screw up my production system, I have been testing on a smaller installation, and I keep running into issues when bringing the Ceph FS metadata server online. My approach here has been to store all Ceph critical

Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Just a thought, but since a directory tree is a first class item in cephfs, could the wire protocol be extended with an “recursive delete” operation, specifically for cases like this? On 10/14/16, 4:16 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Fri, Oct 14, 2016 at

Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
Ok. Since I’m running through the Hadoop/ceph api, there is no syscall boundary so there is a simple place to improve the throughput here. Good to know, I’ll work on a patch… On 10/14/16, 3:58 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Fri, Oct 14, 2016 at 11:41 A

Re: [ceph-users] cephfs slow delete

2016-10-14 Thread Heller, Chris
? -Chris On 10/13/16, 4:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Thu, Oct 13, 2016 at 12:44 PM, Heller, Chris <chel...@akamai.com> wrote: > I have a directory I’ve been trying to remove from cephfs (via > cephfs-hadoop), the directory is a f

[ceph-users] cephfs slow delete

2016-10-13 Thread Heller, Chris
I have a directory I’ve been trying to remove from cephfs (via cephfs-hadoop), the directory is a few hundred gigabytes in size and contains a few million files, but not in a single sub directory. I startd the delete yesterday at around 6:30 EST, and it’s still progressing. I can see from (ceph

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
uot;: false, "inst": "client.585194220 192.168.1.157:0\/634334964", "client_metadata": { "ceph_sha1": "d56bdf93ced6b80b07397d57e3fa68fe68304432", "ceph_version": "ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3f

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I also went and bumped mds_cache_size up to 1 million… still seeing cache pressure, but I might just need to evict those clients… On 9/21/16, 9:24 PM, "Heller, Chris" <chel...@akamai.com> wrote: What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
What is the interesting value in ‘session ls’? Is it ‘num_leases’ or ‘num_caps’ leases appears to be, on average, 1. But caps seems to be 16385 for many many clients! -Chris On 9/21/16, 9:22 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 21, 2016 at 6:13 P

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
2016 at 1:16 PM, Heller, Chris <chel...@akamai.com> wrote: > Ok. I just ran into this issue again. The mds rolled after many clients were failing to relieve cache pressure. That definitely could have had something to do with it, if say they overloaded the MDS so much it got

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
ce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885178 2016-09-21 20:15:58.159134 7fce3b626700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 364) v1 from client.491885188 -Chris On 9/21/16, 11:23 AM, "Heller, Ch

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
1 mds.-1.0 log_to_monitors {default=true} 2016-09-21 15:13:27.329181 7f68969e9700 1 mds.-1.0 handle_mds_map standby 2016-09-21 15:13:28.484148 7f68969e9700 1 mds.-1.0 handle_mds_map standby 2016-09-21 15:13:33.280376 7f68969e9700 1 mds.-1.0 handle_mds_map standby On 9/21/16, 10:48 AM, "He

Re: [ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
the summary). -Chris On 9/21/16, 10:46 AM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <chel...@akamai.com> wrote: > I’m running a production 0.94.7 Ceph cluster, and have been seeing a > periodic issue arise

[ceph-users] Faulting MDS clients, HEALTH_OK

2016-09-21 Thread Heller, Chris
I’m running a production 0.94.7 Ceph cluster, and have been seeing a periodic issue arise where in all my MDS clients will become stuck, and the fix so far has been to restart the active MDS (sometimes I need to restart the subsequent active MDS as well). These clients are using the

Re: [ceph-users] How to associate a cephfs client id to its process

2016-09-14 Thread Heller, Chris
Ok. I’ll see about tracking down the logs (set to stderr for these tasks), and the metadata stuff looks interesting for future association. Thanks, Chris On 9/14/16, 5:04 PM, "Gregory Farnum" <gfar...@redhat.com> wrote: On Wed, Sep 14, 2016 at 7:02 AM, Heller, Chris <

[ceph-users] How to associate a cephfs client id to its process

2016-09-14 Thread Heller, Chris
I am making use of CephFS plus the cephfs-hadoop shim to replace HDFS in a system I’ve been experimenting with. I’ve noticed that a large number of my HDFS clients have a ‘num_caps’ value of 16385, as seen when running ‘session ls’ on the active mds. This appears to be one larger than the

[ceph-users] Ceph auth key generation algorithm documentation

2016-08-23 Thread Heller, Chris
I’d like to generate keys for ceph external to any system which would have ceph-authtool. Looking over the ceph website and googling have turned up nothing. Is the ceph auth key generation algorithm documented anywhere? -Chris ___ ceph-users mailing

Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-16 Thread Heller, Chris
be marked as ‘found’ once it returns to the network? -Chris From: Goncalo Borges <goncalo.bor...@sydney.edu.au> Date: Monday, August 15, 2016 at 11:36 PM To: "Heller, Chris" <chel...@akamai.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: Re

Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-15 Thread Heller, Chris
bor...@sydney.edu.au> Date: Monday, August 15, 2016 at 9:03 PM To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>, "Heller, Chris" <chel...@akamai.com> Subject: Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up Hi Helle

[ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-15 Thread Heller, Chris
I’d like to better understand the current state of my CEPH cluster. I currently have 2 PG that are in the ‘stuck unclean’ state: # ceph health detail HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck unclean pg 4.2a8 is stuck inactive for 124516.91, current state