[ceph-users] PG:: recovery optimazation: recovery what is really modified by mslovy ・ Pull Request #3837 ・ ceph/ceph ・ GitHub
yaoning, haomai, Json what about the "recovery what is really modified" feature? I didn't see any update on github recently, will it be further developed? https://github.com/ceph/ceph/pull/3837 (PG:: recovery optimazation: recovery what is really modified) Thanks a lot. donglifec...@gmail.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS attempt to access beyond end of device
An update on this. The "attempt to access beyond end of device" messages are created due to a kernel bug which is rectified by the following patches. - 59d43914ed7b9625(vfs: make guard_bh_eod() more generic) - 4db96b71e3caea(vfs: guard end of device for mpage interface) An upgraded Red Hat kernel including these patches is pending. There was also discussion of the following upstream tracker http://tracker.ceph.com/issues/14842 however that has been eliminated as being in play for any of the devices analysed whilst investigating this issue since these partitions are correctly aligned. On Sun, Jul 23, 2017 at 10:49 AM, Brad Hubbardwrote: > Blair, > > I should clarify that I am *now* aware of your support case =D > > For anyone willing to run a systemtap the following should give us more > information about the problem. > > stap --all-modules -e 'probe kernel.function("handle_bad_sector"){ > printf("handle_bad_sector(): ARGS is %s\n",$$parms$$); print_backtrace()}' > > In order to run this you will need to install some non-trivial packages such > as > the kernel debuginfo package and kernel-devel. This is generally best > accomplished as follows, at least on rpm based systems. > > (yum|dnf) install systemtap > stap-prep > > The systemtap needs to be running when the error is generated as it monitors > calls to "handle_bad_sector" which is the function generating the error > message. > Once that function is called the probe will dump all information about the bio > structure passed as a parameter to "handle_bad_sector" as well as dumping the > call stack. This would give us a good idea of the specific code involved. > > > On Sat, Jul 22, 2017 at 9:45 AM, Brad Hubbard wrote: >> On Sat, Jul 22, 2017 at 9:38 AM, Blair Bethwaite >> wrote: >>> Hi Brad, >>> >>> On 22 July 2017 at 09:04, Brad Hubbard wrote: Could you share what kernel/distro you are running and also please test whether the error message can be triggered by running the "blkid" command? >>> >>> I'm seeing it on RHEL7.3 (3.10.0-514.2.2.el7.x86_64). See Red Hat >>> support case #01891011 for sosreport etc. >> >> Thanks Blair, >> >> I'm aware of your case and the Bugzilla created from it and we are >> investigating further. >> >>> >>> No, blkid does not seem to trigger it. So far I haven't figured out >>> what does. It seems to be showing up roughly once for each disk every >> >> Thanks, that appears to exclude any link to an existing Bugzilla that >> was suggested as being related. >> >>> 1-2 weeks, and there is a clear time correlation across the hosts >>> experiencing it. >>> >>> -- >>> Cheers, >>> ~Blairo >> >> >> >> -- >> Cheers, >> Brad > > > > -- > Cheers, > Brad -- Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Networking/naming doubt
The only thing that is supposed to use the cluster network are the OSDs. Not even the MONs access the cluster network. I am sure that if you have a need to make this work that you can find a way, but I don't know that one exists in the standard tool set. You might try temporarily setting the /etc/hosts reference for vdicnode02 and vdicnode03 to the cluster network and use the proper hosts name in the ceph-deploy command. Ceph cluster operations do not use dns at all, so you could probably leave your /etc/hosts in this state. I don't know if it would work though. It's really not intended for any communication to happen on this subnet other than inter-OSD traffic. On Thu, Jul 27, 2017 at 6:31 PM Oscar Segarrawrote: > Sorry! I'd like to add that I want to use the cluster network for both > purposes: > > ceph-deploy --username vdicceph new vdicnode01 --cluster-network > 192.168.100.0/24 --public-network 192.168.100.0/24 > > Thanks a lot > > > 2017-07-28 0:29 GMT+02:00 Oscar Segarra : > >> Hi, >> >> ¿Do you mean that for security reasons ceph-deploy can only be executed >> from the public interface? >> >> Looks extrange that one cannot decide what network use for ceph-deploy... >> I could have a dedicated network for ceph-deploy... :S >> >> Thanks a lot >> >> 2017-07-28 0:03 GMT+02:00 Roger Brown : >> >>> I could be wrong, but I think you cannot achieve this objective. If you >>> declare a cluster network, OSDs will route heartbeat, object replication >>> and recovery traffic over the cluster network. We prefer that the cluster >>> network is NOT reachable from the public network or the Internet for added >>> security. Therefore it will not work with ceph-deploy actions. >>> Source: >>> http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/ >>> >>> >>> On Thu, Jul 27, 2017 at 3:53 PM Oscar Segarra >>> wrote: >>> Hi, In my environment I have 3 hosts, every host has 2 network interfaces: public: 192.168.2.0/24 cluster: 192.168.100.0/24 The hostname "vdicnode01", "vdicnode02" and "vdicnode03" are resolved by public DNS through the public interface, that means the "ping vdicnode01" will resolve 192.168.2.1. In my environment the "admin" node is the first node vdicnode01 and I'd like all the deployment "ceph-deploy" and all osd traffic to go from the cluster network. 1) To begin with, I create the cluster and I want all traffic to go from the cluster network: ceph-deploy --username vdicceph new vdicnode01 --cluster-network 192.168.100.0/24 --public-network 192.168.100.0/24 2) The problem comes when I have to launch my commands to the other hosts for example, from node vdicnode01 I execute: 2.1) ceph-deploy --username vdicceph osd create vdicnode02:sdb --> Finishes Ok but communication goes through the public interface 2.2) ceph-deploy --username vdicceph osd create vdicnode02.local:sdb --> vdicnode02.local is added manually in /etc/hosts (assigned a cluster IP) --> It raises some errors/warnings becase vdicnod02.local is not the real hostname. Some files are created with vdicnode02.local in the middle of the name of the file and some errors appear when starting up the osd service related to "file does not exist" 2.3) ceph-deploy --username vdicceph osd create vdicnode02-priv:sdb --> vdicnode02-priv is added manually in /etc/hosts (assigned a cluster IP) --> It raises some errors/warnings becase vdicnod02-priv is not the real hostname. Some files are created with vdicnode02-priv in the middle of the name of the file and some errors appear when starting up the osd service related to "file does not exist" What would be the right way to achieve my objective? If is there any documentation I have not found, please redirect me... Thanks a lot for your help in advance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High iowait on OSD node
My first suspicion would be the HBA. Are you using a RAID HBA? If so I suggest checking the status of your BBU/FBWC and cache policy. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Networking/naming doubt
Sorry! I'd like to add that I want to use the cluster network for both purposes: ceph-deploy --username vdicceph new vdicnode01 --cluster-network 192.168.100.0/24 --public-network 192.168.100.0/24 Thanks a lot 2017-07-28 0:29 GMT+02:00 Oscar Segarra: > Hi, > > ¿Do you mean that for security reasons ceph-deploy can only be executed > from the public interface? > > Looks extrange that one cannot decide what network use for ceph-deploy... > I could have a dedicated network for ceph-deploy... :S > > Thanks a lot > > 2017-07-28 0:03 GMT+02:00 Roger Brown : > >> I could be wrong, but I think you cannot achieve this objective. If you >> declare a cluster network, OSDs will route heartbeat, object replication >> and recovery traffic over the cluster network. We prefer that the cluster >> network is NOT reachable from the public network or the Internet for added >> security. Therefore it will not work with ceph-deploy actions. >> Source: http://docs.ceph.com/docs/master/rados/configuration >> /network-config-ref/ >> >> >> On Thu, Jul 27, 2017 at 3:53 PM Oscar Segarra >> wrote: >> >>> Hi, >>> >>> In my environment I have 3 hosts, every host has 2 network interfaces: >>> >>> public: 192.168.2.0/24 >>> cluster: 192.168.100.0/24 >>> >>> The hostname "vdicnode01", "vdicnode02" and "vdicnode03" are resolved by >>> public DNS through the public interface, that means the "ping vdicnode01" >>> will resolve 192.168.2.1. >>> >>> In my environment the "admin" node is the first node vdicnode01 and I'd >>> like all the deployment "ceph-deploy" and all osd traffic to go from the >>> cluster network. >>> >>> 1) To begin with, I create the cluster and I want all traffic to go from >>> the cluster network: >>> ceph-deploy --username vdicceph new vdicnode01 --cluster-network >>> 192.168.100.0/24 --public-network 192.168.100.0/24 >>> >>> 2) The problem comes when I have to launch my commands to the other >>> hosts for example, from node vdicnode01 I execute: >>> >>> 2.1) ceph-deploy --username vdicceph osd create vdicnode02:sdb >>> --> Finishes Ok but communication goes through the public interface >>> >>> 2.2) ceph-deploy --username vdicceph osd create vdicnode02.local:sdb >>> --> vdicnode02.local is added manually in /etc/hosts (assigned a cluster >>> IP) >>> --> It raises some errors/warnings becase vdicnod02.local is not the >>> real hostname. Some files are created with vdicnode02.local in the middle >>> of the name of the file and some errors appear when starting up the osd >>> service related to "file does not exist" >>> >>> 2.3) ceph-deploy --username vdicceph osd create vdicnode02-priv:sdb >>> --> vdicnode02-priv is added manually in /etc/hosts (assigned a cluster >>> IP) >>> --> It raises some errors/warnings becase vdicnod02-priv is not the real >>> hostname. Some files are created with vdicnode02-priv in the middle of the >>> name of the file and some errors appear when starting up the osd service >>> related to "file does not exist" >>> >>> What would be the right way to achieve my objective? >>> >>> If is there any documentation I have not found, please redirect me... >>> >>> Thanks a lot for your help in advance. >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to troubleshoot "heartbeat_check: no reply" in OSD log
On Fri, Jul 28, 2017 at 6:06 AM, Jared Wattswrote: > I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are up/in). > ceph status and ceph osd tree output can be found at: > > https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12 > > > > In osd.4 log, I see many of these: > > 2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no > reply from 10.32.0.3:6807 osd.15 ever on either front or back, first ping > sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) > > 2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no > reply from 10.32.0.3:6811 osd.16 ever on either front or back, first ping > sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) > > > > From osd.4, those endpoints look reachable: > > / # nc -vz 10.32.0.3 6807 > > 10.32.0.3 (10.32.0.3:6807) open > > / # nc -vz 10.32.0.3 6811 > > 10.32.0.3 (10.32.0.3:6811) open > > > > What else can I look at to determine why most of the OSDs cannot > communicate? http://tracker.ceph.com/issues/16092 indicates this behavior > is a networking or hardware issue, what else can I check there? I can turn > on extra logging as needed. Thanks! Do a packet capture on both machines at the same time and verify the packets are arriving as expected. > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Networking/naming doubt
I could be wrong, but I think you cannot achieve this objective. If you declare a cluster network, OSDs will route heartbeat, object replication and recovery traffic over the cluster network. We prefer that the cluster network is NOT reachable from the public network or the Internet for added security. Therefore it will not work with ceph-deploy actions. Source: http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/ On Thu, Jul 27, 2017 at 3:53 PM Oscar Segarrawrote: > Hi, > > In my environment I have 3 hosts, every host has 2 network interfaces: > > public: 192.168.2.0/24 > cluster: 192.168.100.0/24 > > The hostname "vdicnode01", "vdicnode02" and "vdicnode03" are resolved by > public DNS through the public interface, that means the "ping vdicnode01" > will resolve 192.168.2.1. > > In my environment the "admin" node is the first node vdicnode01 and I'd > like all the deployment "ceph-deploy" and all osd traffic to go from the > cluster network. > > 1) To begin with, I create the cluster and I want all traffic to go from > the cluster network: > ceph-deploy --username vdicceph new vdicnode01 --cluster-network > 192.168.100.0/24 --public-network 192.168.100.0/24 > > 2) The problem comes when I have to launch my commands to the other hosts > for example, from node vdicnode01 I execute: > > 2.1) ceph-deploy --username vdicceph osd create vdicnode02:sdb > --> Finishes Ok but communication goes through the public interface > > 2.2) ceph-deploy --username vdicceph osd create vdicnode02.local:sdb > --> vdicnode02.local is added manually in /etc/hosts (assigned a cluster > IP) > --> It raises some errors/warnings becase vdicnod02.local is not the real > hostname. Some files are created with vdicnode02.local in the middle of the > name of the file and some errors appear when starting up the osd service > related to "file does not exist" > > 2.3) ceph-deploy --username vdicceph osd create vdicnode02-priv:sdb > --> vdicnode02-priv is added manually in /etc/hosts (assigned a cluster IP) > --> It raises some errors/warnings becase vdicnod02-priv is not the real > hostname. Some files are created with vdicnode02-priv in the middle of the > name of the file and some errors appear when starting up the osd service > related to "file does not exist" > > What would be the right way to achieve my objective? > > If is there any documentation I have not found, please redirect me... > > Thanks a lot for your help in advance. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error in boot.log - Failed to start Ceph disk activation - Luminous
Hi Roger, Thanks a lot, I will try your workarround. I have opened a bug in order devs to review it as soon as they have availability. http://tracker.ceph.com/issues/20807 2017-07-27 23:39 GMT+02:00 Roger Brown: > I had same issue on Lumninous and worked around it by disabling ceph-disk. > The osds can start without it. > > On Thu, Jul 27, 2017 at 3:36 PM Oscar Segarra > wrote: > >> Hi, >> >> First of all, my version: >> >> [root@vdicnode01 ~]# ceph -v >> ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous >> (rc) >> >> When I boot my ceph node (I have an all in one) I get the following >> message in boot.log: >> >> *[FAILED] Failed to start Ceph disk activation: /dev/sdb2.* >> *See 'systemctl status ceph-disk@dev-sdb2.service' for details.* >> *[FAILED] Failed to start Ceph disk activation: /dev/sdb1.* >> *See 'systemctl status ceph-disk@dev-sdb1.service' for details.* >> >> [root@vdicnode01 ~]# systemctl status ceph-disk@dev-sdb1.service >> ● ceph-disk@dev-sdb1.service - Ceph disk activation: /dev/sdb1 >>Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; >> vendor preset: disabled) >>Active: failed (Result: exit-code) since Thu 2017-07-27 23:37:23 CEST; >> 1h 52min ago >> Process: 740 ExecStart=/bin/sh -c timeout $CEPH_DISK_TIMEOUT flock >> /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose >> --log-stdout trigger --sync %f (code=exited, status=1/FAILURE) >> Main PID: 740 (code=exited, status=1/FAILURE) >> >> Jul 27 23:37:23 vdicnode01 sh[740]: main(sys.argv[1:]) >> Jul 27 23:37:23 vdicnode01 sh[740]: File >> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", >> line 5682, in main >> Jul 27 23:37:23 vdicnode01 sh[740]: args.func(args) >> Jul 27 23:37:23 vdicnode01 sh[740]: File >> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", >> line 4891, in main_trigger >> Jul 27 23:37:23 vdicnode01 sh[740]: raise Error('return code ' + str(ret)) >> Jul 27 23:37:23 vdicnode01 sh[740]: ceph_disk.main.Error: Error: return >> code 1 >> Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb1.service: main >> process exited, code=exited, status=1/FAILURE >> Jul 27 23:37:23 vdicnode01 systemd[1]: Failed to start Ceph disk >> activation: /dev/sdb1. >> Jul 27 23:37:23 vdicnode01 systemd[1]: Unit ceph-disk@dev-sdb1.service >> entered failed state. >> Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb1.service failed. >> >> >> [root@vdicnode01 ~]# systemctl status ceph-disk@dev-sdb2.service >> ● ceph-disk@dev-sdb2.service - Ceph disk activation: /dev/sdb2 >>Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; >> vendor preset: disabled) >>Active: failed (Result: exit-code) since Thu 2017-07-27 23:37:23 CEST; >> 1h 52min ago >> Process: 744 ExecStart=/bin/sh -c timeout $CEPH_DISK_TIMEOUT flock >> /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose >> --log-stdout trigger --sync %f (code=exited, status=1/FAILURE) >> Main PID: 744 (code=exited, status=1/FAILURE) >> >> Jul 27 23:37:23 vdicnode01 sh[744]: main(sys.argv[1:]) >> Jul 27 23:37:23 vdicnode01 sh[744]: File >> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", >> line 5682, in main >> Jul 27 23:37:23 vdicnode01 sh[744]: args.func(args) >> Jul 27 23:37:23 vdicnode01 sh[744]: File >> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", >> line 4891, in main_trigger >> Jul 27 23:37:23 vdicnode01 sh[744]: raise Error('return code ' + str(ret)) >> Jul 27 23:37:23 vdicnode01 sh[744]: ceph_disk.main.Error: Error: return >> code 1 >> Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb2.service: main >> process exited, code=exited, status=1/FAILURE >> Jul 27 23:37:23 vdicnode01 systemd[1]: Failed to start Ceph disk >> activation: /dev/sdb2. >> Jul 27 23:37:23 vdicnode01 systemd[1]: Unit ceph-disk@dev-sdb2.service >> entered failed state. >> Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb2.service failed. >> >> I have created an entry in /etc/fstab in order to mount journal disk >> automatically: >> >> /dev/sdb1 /var/lib/ceph/osd/ceph-0 xfs defaults,noatime >> 1 2 >> >> But when I boot, I get the same error message. >> >> When I execute ceph -s osd look work perfectly: >> >> [root@vdicnode01 ~]# ceph -s >> cluster: >> id: 61881df3-1365-4139-a586-92b5eca9cf18 >> health: HEALTH_WARN >> Degraded data redundancy: 5/10 objects degraded (50.000%), >> 128 pgs unclean, 128 pgs degraded, 128 pgs undersized >> 128 pgs not scrubbed for 86400 >> >> services: >> mon: 1 daemons, quorum vdicnode01 >> mgr: vdicnode01(active) >> osd: 1 osds: 1 up, 1 in >> >> data: >> pools: 1 pools, 128 pgs >> objects: 5 objects, 1349 bytes >> usage: 1073 MB used, 39785 MB / 40858 MB avail >> pgs: 5/10 objects degraded (50.000%) >> 128 active+undersized+degraded >> >> >> ¿Anybody has experienced the same issue? >>
[ceph-users] Networking/naming doubt
Hi, In my environment I have 3 hosts, every host has 2 network interfaces: public: 192.168.2.0/24 cluster: 192.168.100.0/24 The hostname "vdicnode01", "vdicnode02" and "vdicnode03" are resolved by public DNS through the public interface, that means the "ping vdicnode01" will resolve 192.168.2.1. In my environment the "admin" node is the first node vdicnode01 and I'd like all the deployment "ceph-deploy" and all osd traffic to go from the cluster network. 1) To begin with, I create the cluster and I want all traffic to go from the cluster network: ceph-deploy --username vdicceph new vdicnode01 --cluster-network 192.168.100.0/24 --public-network 192.168.100.0/24 2) The problem comes when I have to launch my commands to the other hosts for example, from node vdicnode01 I execute: 2.1) ceph-deploy --username vdicceph osd create vdicnode02:sdb --> Finishes Ok but communication goes through the public interface 2.2) ceph-deploy --username vdicceph osd create vdicnode02.local:sdb --> vdicnode02.local is added manually in /etc/hosts (assigned a cluster IP) --> It raises some errors/warnings becase vdicnod02.local is not the real hostname. Some files are created with vdicnode02.local in the middle of the name of the file and some errors appear when starting up the osd service related to "file does not exist" 2.3) ceph-deploy --username vdicceph osd create vdicnode02-priv:sdb --> vdicnode02-priv is added manually in /etc/hosts (assigned a cluster IP) --> It raises some errors/warnings becase vdicnod02-priv is not the real hostname. Some files are created with vdicnode02-priv in the middle of the name of the file and some errors appear when starting up the osd service related to "file does not exist" What would be the right way to achieve my objective? If is there any documentation I have not found, please redirect me... Thanks a lot for your help in advance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error in boot.log - Failed to start Ceph disk activation - Luminous
I had same issue on Lumninous and worked around it by disabling ceph-disk. The osds can start without it. On Thu, Jul 27, 2017 at 3:36 PM Oscar Segarrawrote: > Hi, > > First of all, my version: > > [root@vdicnode01 ~]# ceph -v > ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous > (rc) > > When I boot my ceph node (I have an all in one) I get the following > message in boot.log: > > *[FAILED] Failed to start Ceph disk activation: /dev/sdb2.* > *See 'systemctl status ceph-disk@dev-sdb2.service' for details.* > *[FAILED] Failed to start Ceph disk activation: /dev/sdb1.* > *See 'systemctl status ceph-disk@dev-sdb1.service' for details.* > > [root@vdicnode01 ~]# systemctl status ceph-disk@dev-sdb1.service > ● ceph-disk@dev-sdb1.service - Ceph disk activation: /dev/sdb1 >Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; > vendor preset: disabled) >Active: failed (Result: exit-code) since Thu 2017-07-27 23:37:23 CEST; > 1h 52min ago > Process: 740 ExecStart=/bin/sh -c timeout $CEPH_DISK_TIMEOUT flock > /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose > --log-stdout trigger --sync %f (code=exited, status=1/FAILURE) > Main PID: 740 (code=exited, status=1/FAILURE) > > Jul 27 23:37:23 vdicnode01 sh[740]: main(sys.argv[1:]) > Jul 27 23:37:23 vdicnode01 sh[740]: File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5682, in main > Jul 27 23:37:23 vdicnode01 sh[740]: args.func(args) > Jul 27 23:37:23 vdicnode01 sh[740]: File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4891, in > main_trigger > Jul 27 23:37:23 vdicnode01 sh[740]: raise Error('return code ' + str(ret)) > Jul 27 23:37:23 vdicnode01 sh[740]: ceph_disk.main.Error: Error: return > code 1 > Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb1.service: main > process exited, code=exited, status=1/FAILURE > Jul 27 23:37:23 vdicnode01 systemd[1]: Failed to start Ceph disk > activation: /dev/sdb1. > Jul 27 23:37:23 vdicnode01 systemd[1]: Unit ceph-disk@dev-sdb1.service > entered failed state. > Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb1.service failed. > > > [root@vdicnode01 ~]# systemctl status ceph-disk@dev-sdb2.service > ● ceph-disk@dev-sdb2.service - Ceph disk activation: /dev/sdb2 >Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; > vendor preset: disabled) >Active: failed (Result: exit-code) since Thu 2017-07-27 23:37:23 CEST; > 1h 52min ago > Process: 744 ExecStart=/bin/sh -c timeout $CEPH_DISK_TIMEOUT flock > /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose > --log-stdout trigger --sync %f (code=exited, status=1/FAILURE) > Main PID: 744 (code=exited, status=1/FAILURE) > > Jul 27 23:37:23 vdicnode01 sh[744]: main(sys.argv[1:]) > Jul 27 23:37:23 vdicnode01 sh[744]: File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5682, in main > Jul 27 23:37:23 vdicnode01 sh[744]: args.func(args) > Jul 27 23:37:23 vdicnode01 sh[744]: File > "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4891, in > main_trigger > Jul 27 23:37:23 vdicnode01 sh[744]: raise Error('return code ' + str(ret)) > Jul 27 23:37:23 vdicnode01 sh[744]: ceph_disk.main.Error: Error: return > code 1 > Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb2.service: main > process exited, code=exited, status=1/FAILURE > Jul 27 23:37:23 vdicnode01 systemd[1]: Failed to start Ceph disk > activation: /dev/sdb2. > Jul 27 23:37:23 vdicnode01 systemd[1]: Unit ceph-disk@dev-sdb2.service > entered failed state. > Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb2.service failed. > > I have created an entry in /etc/fstab in order to mount journal disk > automatically: > > /dev/sdb1 /var/lib/ceph/osd/ceph-0 xfs defaults,noatime > 1 2 > > But when I boot, I get the same error message. > > When I execute ceph -s osd look work perfectly: > > [root@vdicnode01 ~]# ceph -s > cluster: > id: 61881df3-1365-4139-a586-92b5eca9cf18 > health: HEALTH_WARN > Degraded data redundancy: 5/10 objects degraded (50.000%), 128 > pgs unclean, 128 pgs degraded, 128 pgs undersized > 128 pgs not scrubbed for 86400 > > services: > mon: 1 daemons, quorum vdicnode01 > mgr: vdicnode01(active) > osd: 1 osds: 1 up, 1 in > > data: > pools: 1 pools, 128 pgs > objects: 5 objects, 1349 bytes > usage: 1073 MB used, 39785 MB / 40858 MB avail > pgs: 5/10 objects degraded (50.000%) > 128 active+undersized+degraded > > > ¿Anybody has experienced the same issue? > > Thanks a lot. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Error in boot.log - Failed to start Ceph disk activation - Luminous
Hi, First of all, my version: [root@vdicnode01 ~]# ceph -v ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) When I boot my ceph node (I have an all in one) I get the following message in boot.log: *[FAILED] Failed to start Ceph disk activation: /dev/sdb2.* *See 'systemctl status ceph-disk@dev-sdb2.service' for details.* *[FAILED] Failed to start Ceph disk activation: /dev/sdb1.* *See 'systemctl status ceph-disk@dev-sdb1.service' for details.* [root@vdicnode01 ~]# systemctl status ceph-disk@dev-sdb1.service ● ceph-disk@dev-sdb1.service - Ceph disk activation: /dev/sdb1 Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2017-07-27 23:37:23 CEST; 1h 52min ago Process: 740 ExecStart=/bin/sh -c timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f (code=exited, status=1/FAILURE) Main PID: 740 (code=exited, status=1/FAILURE) Jul 27 23:37:23 vdicnode01 sh[740]: main(sys.argv[1:]) Jul 27 23:37:23 vdicnode01 sh[740]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5682, in main Jul 27 23:37:23 vdicnode01 sh[740]: args.func(args) Jul 27 23:37:23 vdicnode01 sh[740]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4891, in main_trigger Jul 27 23:37:23 vdicnode01 sh[740]: raise Error('return code ' + str(ret)) Jul 27 23:37:23 vdicnode01 sh[740]: ceph_disk.main.Error: Error: return code 1 Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb1.service: main process exited, code=exited, status=1/FAILURE Jul 27 23:37:23 vdicnode01 systemd[1]: Failed to start Ceph disk activation: /dev/sdb1. Jul 27 23:37:23 vdicnode01 systemd[1]: Unit ceph-disk@dev-sdb1.service entered failed state. Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb1.service failed. [root@vdicnode01 ~]# systemctl status ceph-disk@dev-sdb2.service ● ceph-disk@dev-sdb2.service - Ceph disk activation: /dev/sdb2 Loaded: loaded (/usr/lib/systemd/system/ceph-disk@.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2017-07-27 23:37:23 CEST; 1h 52min ago Process: 744 ExecStart=/bin/sh -c timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f (code=exited, status=1/FAILURE) Main PID: 744 (code=exited, status=1/FAILURE) Jul 27 23:37:23 vdicnode01 sh[744]: main(sys.argv[1:]) Jul 27 23:37:23 vdicnode01 sh[744]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5682, in main Jul 27 23:37:23 vdicnode01 sh[744]: args.func(args) Jul 27 23:37:23 vdicnode01 sh[744]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4891, in main_trigger Jul 27 23:37:23 vdicnode01 sh[744]: raise Error('return code ' + str(ret)) Jul 27 23:37:23 vdicnode01 sh[744]: ceph_disk.main.Error: Error: return code 1 Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb2.service: main process exited, code=exited, status=1/FAILURE Jul 27 23:37:23 vdicnode01 systemd[1]: Failed to start Ceph disk activation: /dev/sdb2. Jul 27 23:37:23 vdicnode01 systemd[1]: Unit ceph-disk@dev-sdb2.service entered failed state. Jul 27 23:37:23 vdicnode01 systemd[1]: ceph-disk@dev-sdb2.service failed. I have created an entry in /etc/fstab in order to mount journal disk automatically: /dev/sdb1 /var/lib/ceph/osd/ceph-0 xfs defaults,noatime 1 2 But when I boot, I get the same error message. When I execute ceph -s osd look work perfectly: [root@vdicnode01 ~]# ceph -s cluster: id: 61881df3-1365-4139-a586-92b5eca9cf18 health: HEALTH_WARN Degraded data redundancy: 5/10 objects degraded (50.000%), 128 pgs unclean, 128 pgs degraded, 128 pgs undersized 128 pgs not scrubbed for 86400 services: mon: 1 daemons, quorum vdicnode01 mgr: vdicnode01(active) osd: 1 osds: 1 up, 1 in data: pools: 1 pools, 128 pgs objects: 5 objects, 1349 bytes usage: 1073 MB used, 39785 MB / 40858 MB avail pgs: 5/10 objects degraded (50.000%) 128 active+undersized+degraded ¿Anybody has experienced the same issue? Thanks a lot. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to troubleshoot "heartbeat_check: no reply" in OSD log
I’ve got a cluster where a bunch of OSDs are down/out (only 6/21 are up/in). ceph status and ceph osd tree output can be found at: https://gist.github.com/jbw976/24895f5c35ef0557421124f4b26f6a12 In osd.4 log, I see many of these: 2017-07-27 19:38:53.468852 7f3855c1c700 -1 osd.4 120 heartbeat_check: no reply from 10.32.0.3:6807 osd.15 ever on either front or back, first ping sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) 2017-07-27 19:38:53.468881 7f3855c1c700 -1 osd.4 120 heartbeat_check: no reply from 10.32.0.3:6811 osd.16 ever on either front or back, first ping sent 2017-07-27 19:37:40.857220 (cutoff 2017-07-27 19:38:33.468850) From osd.4, those endpoints look reachable: / # nc -vz 10.32.0.3 6807 10.32.0.3 (10.32.0.3:6807) open / # nc -vz 10.32.0.3 6811 10.32.0.3 (10.32.0.3:6811) open What else can I look at to determine why most of the OSDs cannot communicate? http://tracker.ceph.com/issues/16092 indicates this behavior is a networking or hardware issue, what else can I check there? I can turn on extra logging as needed. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Client behavior when OSD is unreachable
The clients receive up to date versions of the osd map which includes which osds are down. So yes, when an osd is marked down in the cluster the clients know about it. If an osd is unreachable but isn't marked down in the cluster, the result is blocked requests. On Thu, Jul 27, 2017, 1:21 PM Daniel Kwrote: > Does the client track which OSDs are reachable? How does it behave if some > are not reachable? > > For example: > > Cluster network with all OSD hosts on a switch. > Public network with OSD hosts split between two switches, failure domain > is switch. > > copies=3 so with a failure of the public switch, 1 copy would be reachable > by client. Will the client know that it can't reach the OSDs on the failed > switch? > > Well...thinking through this: > The mons communicate on the public network -- correct? So an unreachable > public network for some of the OSDs would cause them to be marked down, > which then the clients would know about. > > Correct? > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Client behavior when OSD is unreachable
Does the client track which OSDs are reachable? How does it behave if some are not reachable? For example: Cluster network with all OSD hosts on a switch. Public network with OSD hosts split between two switches, failure domain is switch. copies=3 so with a failure of the public switch, 1 copy would be reachable by client. Will the client know that it can't reach the OSDs on the failed switch? Well...thinking through this: The mons communicate on the public network -- correct? So an unreachable public network for some of the OSDs would cause them to be marked down, which then the clients would know about. Correct? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High iowait on OSD node
I'm using bcache (starting around the middle of December...before that see way higher await) for all the 12 hdds on the 2 SSDs, and NVMe for journals. (and some months ago I changed all the 2TB disks to 6TB and added ceph4,5) Here's my iostat in ganglia: just raw per disk await http://www.brockmann-consult.de/ganglia/graph_all_periods.php?title[]=ceph.*[]=sd[a-z]_await=line=show=1 per host max await http://www.brockmann-consult.de/ganglia/graph_all_periods.php?title[]=ceph.*[]=max_await=line=show=1 strangely aggregated data (my max metric is the max disk, but ganglia averages out across disk/host or something, so it's not really a max) http://www.brockmann-consult.de/ganglia/graph_all_periods.php?c=ceph=network_report=week=by%20name=4=2=1501155678=disk_wait_report=large or to explore and make your own graphs, start from here: http://www.brockmann-consult.de/ganglia/ I didn't find any ganglia plugins for that, so I wrote some that take 30s averages every minute from iostat and stores them, so when you see numbers like 400 in my data, it could have been steady 400 for 30 seconds, or 4000 for 3 seconds and then 0 for 27 seconds averaged together, and 30s of every minute is missing from the data. In my data, sda,b,c on ceph1,2,3 are probably always the SSDs, and sdm,n on ceph4,5 are currently the SSDs and possibly were sda,b once; sometimes rebooting changes it (yeah not ideal but not sure how to change it... maybe a udev rule to name ssds differently). And also note that I found deadline instead of CFQ scheduler has way lower iowait and latency, but not necessarily more throughput or iops... you could test that; but not using CFQ might disable some ceph priority settings (or maybe not relevant since Jewel?). ps. use fixed width on your iostat and it's more readable in html supporting email clients...see below where I changed it On 07/27/17 05:48, John Petrini wrote: > Hello list, > > Just curious if anyone has ever seen this behavior and might have some > ideas on how to troubleshoot it. > > We're seeing very high iowait in iostat across all OSD's in on a > single OSD host. It's very spiky - dropping to zero and then shooting > up to as high as 400 in some cases. Despite this it does not seem to > be having a major impact on the cluster performance as a whole. > > Some more details: > 3x OSD Nodes - Dell R730's: 24 cores @2.6GHz, 256GB RAM, 20x 1.2TB 10K > SAS OSD's per node. > > We're running ceph hammer. > > Here's the output of iostat. Note that this is from a period when the > cluster is not very busy but you can still see high spikes on a few > OSD's. It's much worse during high load. > > Device: rrqm/s wrqm/s r/s w/srkB/swkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 0.00 0.000.000.50 0.00 6.00 > 24.00 0.008.000.008.00 8.00 0.40 > sdb 0.00 0.000.00 60.00 0.00 808.00 > 26.93 0.000.070.000.07 0.03 0.20 > sdc 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.000.000.00 0.00 0.00 > sdd 0.00 0.000.00 67.00 0.00 1010.00 > 30.15 0.010.090.000.09 0.09 0.60 > sde 0.00 0.000.00 93.00 0.00 868.00 > 18.67 0.000.040.000.04 0.04 0.40 > sdf 0.00 0.000.00 57.50 0.00 572.00 > 19.90 0.000.030.000.03 0.03 0.20 > sdg 0.00 1.000.003.50 0.0022.00 > 12.57 0.75 16.000.00 16.00 2.86 1.00 > sdh 0.00 0.001.50 25.50 6.00 458.50 > 34.41 2.03 75.260.00 79.69 3.04 8.20 > sdi 0.00 0.000.00 30.50 0.00 384.50 > 25.21 2.36 77.510.00 77.51 3.28 10.00 > sdj 0.00 1.001.50 105.00 6.00 925.75 > 17.5010.85 101.848.00 103.18 2.35 25.00 > sdl 0.00 0.002.000.00 320.00 0.00 > 320.00 0.013.003.000.00 2.00 0.40 > sdk 0.00 1.000.00 55.00 0.00 334.50 > 12.16 7.92 136.910.00 136.91 2.51 13.80 > sdm 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.000.000.00 0.00 0.00 > sdn 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.000.000.00 0.00 0.00 > sdo 0.00 0.001.000.00 4.00 0.00 > 8.00 0.004.004.000.00 4.00 0.40 > sdp 0.00 0.000.000.00 0.00 0.00 > 0.00 0.000.000.000.00 0.00 0.00 > sdq 0.50 0.00 756.000.00 93288.00 0.00 > 246.79 1.471.951.950.00 1.17 88.60 > sdr 0.00 0.001.000.00 4.00
Re: [ceph-users] Fwd: [lca-announce] Call for Proposals for linux.conf.au 2018 in Sydney are open!
On 07/03/2017 02:36 PM, Tim Serong wrote: > It's that time of year again, folks! Please everyone go submit talks, > or at least plan to attend this most excellent of F/OSS conferences. CFP closes in a bit over a week (August 6). Get into it if you didn't already :-) > (I thought I might put in a proposal to run a ceph miniconf, unless > anyone else was already thinking of doing that? If accepted, that would > give us a whole day of cephy goodness in addition to whatever lands in > the main conference programme.) I *did* put in a proposal for a Ceph miniconf, BTW. Fingers crossed... Tim > Forwarded Message > Subject: [lca-announce] Call for Proposals for linux.conf.au 2018 in > Sydney are open! > Date: Mon, 03 Jul 2017 11:04:27 +1000 > From: linux.conf.au Announcements> Reply-To: lca-annou...@lists.linux.org.au > To: lca-annou...@lists.linux.org.au, annou...@lists.linux.org.au > > On behalf of the LCA2018 team we are pleased to announce that the Call > for Proposals for linux.conf.au 2018 is now open! This Call for > Proposals will close on August 6 with no extensions expected. > > linux.conf.au is one of the best-known community driven Free and Open > Source Software conferences in the world. In 2018 we welcome you to join > us in Sydney, New South Wales on Monday 22 January through to Friday 26 > January. > > For full details including those not covered by this announcement visit > https://linux.conf.au/proposals > > == IMPORTANT DATES == > > * Call for Proposals Opens: 3 July 2017 > * Call for Proposals Closes: 6 August 2017 (no extensions) > * Notifications from the programme committee: mid-September 2017 > * Conference Opens: 22nd January 2018 > > == HOW TO SUBMIT == > > Create an account at https://login.linux.conf.au/manage/public/newuser > Visit https://linux.conf.au/proposals and click the link to submit your > proposal > > == ABOUT LINUX.CONF.AU == > > linux.conf.au is a conference where people gather to learn about the > entire world of Free and Open Source Software, directly from the people > who shape the projects and topics that they’re presenting on. > > Our aim is to create a deeply technical conference made up of industry > leaders and experts on a wide range of subjects. > > linux.conf.au welcomes submissions first-time and seasoned speakers from > all free and open technology communities and all walks of life. We > respect and encourage diversity at our conference. > > == CONFERENCE THEME == > > The theme for linux.conf.au 2018 is “A Little Bit Of History Repeating”. > Building on last year’s theme of “The Future of Open Source”, we intend > to examine the future through the lens of the past. > > For some suggestions to get you started with your proposal ideas please > visit the linux.conf.au website. > > == PROPOSAL TYPES == > > We’re accepting submissions for three different types of proposal: > > * Presentation (45 minutes): These are generally presented in lecture > format and form the bulk of the available conference slots. > * Tutorial (100 minutes): These are generally presented in a classroom > format. They should be interactive or hands-on in nature. Tutorials are > expected to have a specific learning outcome for attendees. > * Miniconf (full-day): Single-track mini-conferences that run for the > duration of a day on either Monday or Tuesday. We provide the room, and > you provide the speakers. Together, you can explore a field in Free and > Open Source software in depth. > > == PROPOSER RECOGNITION == > > In recognition of the value that presenters and organisers bring to our > conference, once a proposal is accepted, one presenter or organiser per > proposal is entitled to: > > * Free registration, which holds all of the benefits of a Professional > Delegate Ticket > * A complimentary ticket to the Speakers' Dinner for the speaker, with > additional tickets for significant others and children of the speaker > available for purchase. > * Optionally, recognition as a Fairy Penguin Sponsor, available at 50% > off the advertised price > > If your proposal includes more than one presenter or organiser, these > additional people will be entitled to: > > * Professional or hobbyist registration at the Early Bird rate, > regardless of whether the Early Bird rate is generally available > * Speakers’ dinner tickets available for purchase at cost > > Important Note for miniconf organisers: These discounts apply to the > organisers only. All participants in your miniconf must arrange or > purchase tickets for themselves via the regular ticket sales process or > they may not be able to attend! > > As a volunteer-run non-profit conference, linux.conf.au does not pay > speakers to present at the conference; but you may be eligible for > financial assistance. > > == FINANCIAL ASSISTANCE == > > linux.conf.au is able to provide limited financial assistance for some > speakers. > > Financial assistance may
[ceph-users] Ceph Developers Monthly - August
Hey Cephers, This is just a friendly reminder that the next Ceph Developer Montly meeting is coming up: https://wiki.ceph.com/Planning If you have work that you're doing that it a feature work, significant backports, or anything you would like to discuss with the core team, please add it to the following page: https://wiki.ceph.com/CDM_02-AUG-2017 If you have questions or comments, please let us know. Kindest regards, Leo -- Leonardo Vaz Ceph Community Manager Open Source and Standards Team ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] High iowait on OSD node
Hello list, Just curious if anyone has ever seen this behavior and might have some ideas on how to troubleshoot it. We're seeing very high iowait in iostat across all OSD's in on a single OSD host. It's very spiky - dropping to zero and then shooting up to as high as 400 in some cases. Despite this it does not seem to be having a major impact on the cluster performance as a whole. Some more details: 3x OSD Nodes - Dell R730's: 24 cores @2.6GHz, 256GB RAM, 20x 1.2TB 10K SAS OSD's per node. We're running ceph hammer. Here's the output of iostat. Note that this is from a period when the cluster is not very busy but you can still see high spikes on a few OSD's. It's much worse during high load. Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.50 0.00 6.0024.00 0.008.000.008.00 8.00 0.40 sdb 0.00 0.000.00 60.00 0.00 808.0026.93 0.000.070.000.07 0.03 0.20 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdd 0.00 0.000.00 67.00 0.00 1010.0030.15 0.010.090.000.09 0.09 0.60 sde 0.00 0.000.00 93.00 0.00 868.0018.67 0.000.040.000.04 0.04 0.40 sdf 0.00 0.000.00 57.50 0.00 572.0019.90 0.000.030.000.03 0.03 0.20 sdg 0.00 1.000.003.50 0.0022.0012.57 0.75 16.000.00 16.00 2.86 1.00 sdh 0.00 0.001.50 25.50 6.00 458.5034.41 2.03 75.260.00 79.69 3.04 8.20 sdi 0.00 0.000.00 30.50 0.00 384.5025.21 2.36 77.510.00 77.51 3.28 10.00 sdj 0.00 1.001.50 105.00 6.00 925.7517.50 10.85 101.848.00 103.18 2.35 25.00 sdl 0.00 0.002.000.00 320.00 0.00 320.00 0.013.003.000.00 2.00 0.40 sdk 0.00 1.000.00 55.00 0.00 334.5012.16 7.92 136.910.00 136.91 2.51 13.80 sdm 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdn 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdo 0.00 0.001.000.00 4.00 0.00 8.00 0.004.004.000.00 4.00 0.40 sdp 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdq 0.50 0.00 756.000.00 93288.00 0.00 246.79 1.471.951.950.00 1.17 88.60 sdr 0.00 0.001.000.00 4.00 0.00 8.00 0.004.004.000.00 4.00 0.40 sds 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdt 0.00 0.000.00 36.50 0.00 643.5035.26 3.49 95.730.00 95.73 2.63 9.60 sdu 0.00 0.000.00 21.00 0.00 323.2530.79 0.78 37.240.00 37.24 2.95 6.20 sdv 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdw 0.00 0.000.00 31.00 0.00 689.5044.48 2.48 80.060.00 80.06 3.29 10.20 sdx 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-0 0.00 0.000.000.50 0.00 6.0024.00 0.008.000.008.00 8.00 0.40 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph object recovery
So I'm not sure if this was the best or right way to do this but -- using rados I confirmed the unfound object was in the cephfs_data pool # rados -p cephfs_data ls|grep 001c0ed4 using the osdmap tool I found the pg/osd the unfound object was in -- # osdmaptool --test-map-object 162.001c0ed4 osdmap (previously exported osdmap to file "osdmap") > object '162.001c0ed4' -> 1.21 -> [4] then told ceph to just delete the unfound object ceph pg 1.21 mark_unfound_lost delete and then used rados to put the object back (from the file I had extracted previously) # rados -p cephfs_data put 162.001c0ed4 162.001c0ed4.obj Still have more recovery to do but this seems to have fixed my unfound object problem. On Tue, Jul 25, 2017 at 12:54 PM, Daniel Kwrote: > I did some bad things to my cluster, broke 5 OSDs and wound up with 1 > unfound object. > > I mounted one of the OSD drives, used ceph-objectstore-tool to find and > exported the object: > > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-10 > 162.001c0ed4 get-bytes filename.obj > > > What's the best way to bring this object back into the active cluster? > > Do I need to bring an OSD offline, mount it and do the reverse of the > above command? > > Something like: > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 > 162.001c0ed4 set-bytes filename.obj > > Is there some way to do this without bringing down an osd? > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com