Re: [ceph-users] Ceph + VMWare
Hi all, we are using Ceph (jewel 10.2.2, 10GBit Ceph frontend/backend, 3 nodes, each 8 OSD's and 2 journal SSD's) in out VMware environment especially for test environments and templates - but currently not for productive machines (because of missing FC-redundancy & performance). On our Linux based SCST 4GBit fiber channel proxy, 16 ceph-rbd devices (non-caching, in total 10 TB) creating a LVM (stripped) volume which is published as a FC-target to our VMware cluster. Looks fine, works stable. But currently the proxy is not redundant (only one head). Performance is ok (a), but not that good than our IBM Storwize 3700 SAN (16 HDD's). Especially for small IO's (4k), the IBM is twice as fast as Ceph. Native ceph integration to VMware would be great (-: Best regards Daniel (a) Atto Benchmark screenshots - IBM Storwize 37000 vs. Ceph https://dtnet.storage.dtnetcloud.com/d/684b330eea/ --- DT Netsolution GmbH - Taläckerstr. 30-D-70437 Stuttgart Geschäftsführer: Daniel Schwager, Stefan Hörz - HRB Stuttgart 19870 Tel: +49-711-849910-32, Fax: -932 - Mailto:daniel.schwa...@dtnet.de > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Patrick McGarry > Sent: Wednesday, October 05, 2016 8:33 PM > To: Ceph-User; Ceph Devel > Subject: [ceph-users] Ceph + VMWare > > Hey guys, > > Starting to buckle down a bit in looking at how we can better set up > Ceph for VMWare integration, but I need a little info/help from you > folks. > > If you currently are using Ceph+VMWare, or are exploring the option, > I'd like some simple info from you: > > 1) Company > 2) Current deployment size > 3) Expected deployment growth > 4) Integration method (or desired method) ex: iscsi, native, etc > > Just casting the net so we know who is interested and might want to > help us shape and/or test things in the future if we can make it > better. Thanks. > smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Automount Failovered Multi MDS CephFS
Maybe something like this? 192.168.135.31:6789:/ /cephfs ceph name=cephfs,secretfile=/etc/ceph/client.cephfs,noatime 0 0 Best regards Daniel From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Lazuardi Nasution Sent: Wednesday, August 03, 2016 6:10 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] Automount Failovered Multi MDS CephFS Hi, I'm looking for example about what to put on /etc/fstab if I want to auto mount CephFS on failovered multi MDS (only one MDS is active) especially with Jewel. My target is to build loadbalanced file/web servers with CephFS backend. Best regards, smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD behavior, in case of its journal disk (either HDD or SSD) failure
Hi, > ok...OSD stop. Any reason why OSD stop ( I assume if journal disk > fails, OSD should work as no journal. Isn't it?) No. In my understanding - if a journal fails, all the attached (to this Journal HDD) OSD's fails also. E.g. if you have 4 OSD's with the 4 journals's located on one SSD-hard disk, the failure of this SSD will crash/fail also your 4 OSD's. regards Danny > > Not understand, why the OSD data lost. You mean - data lost during the > traction time? or total OSD data lost? > > Thanks > Swami > > On Mon, Jan 25, 2016 at 7:06 PM, Jan Schermerwrote: > > OSD stops. > > And you pretty much lose all data on the OSD if you lose the journal. > > > > Jan > > > >> On 25 Jan 2016, at 14:04, M Ranga Swami Reddy wrote: > >> > >> Hello, > >> > >> If a journal disk fails (with crash or power failure, etc), what > >> happens on OSD operations? > >> > >> PS: Assume that journal and OSD is on a separate drive. > >> > >> Thanks > >> Swami > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg is stuck stale (osd.21 still removed)
Hi ceph-users, any idea to fix my cluster? OSD.21 removed, but still some (staled) PG's pointing to OSD.21... I don't know how to proceed... Help is very welcome! Best regards Daniel > -Original Message- > From: Daniel Schwager > Sent: Friday, January 08, 2016 3:10 PM > To: 'ceph-us...@ceph.com' > Subject: pg is stuck stale (osd.21 still removed) > > Hi, > > we had a HW-problem with OSD.21 today. The OSD daemon was down and "smartctl" > told me about some > hardware errors. > > I decided to remove the HDD: > > ceph osd out 21 > ceph osd crush remove osd.21 > ceph auth del osd.21 > ceph osd rm osd.21 > > But afterwards I saw that I have some stucked pg's for osd.21: > > root@ceph-admin:~# ceph -w > cluster c7b12656-15a6-41b0-963f-4f47c62497dc >health HEALTH_WARN > 50 pgs stale > 50 pgs stuck stale >monmap e4: 3 mons at > {ceph-mon1=192.168.135.31:6789/0,ceph-mon2=192.168.135.32:6789/0,ceph- > mon3=192.168.135.33:6789/0} > election epoch 404, quorum 0,1,2 > ceph-mon1,ceph-mon2,ceph-mon3 >mdsmap e136: 1/1/1 up {0=ceph-mon1=up:active} >osdmap e18259: 23 osds: 23 up, 23 in > pgmap v47879105: 6656 pgs, 10 pools, 23481 GB data, 6072 kobjects > 54974 GB used, 30596 GB / 85571 GB avail > 6605 active+clean > 50 stale+active+clean > 1 active+clean+scrubbing+deep > > root@ceph-admin:~# ceph health > HEALTH_WARN 50 pgs stale; 50 pgs stuck stale > > root@ceph-admin:~# ceph health detail > HEALTH_WARN 50 pgs stale; 50 pgs stuck stale; noout flag(s) set > pg 34.225 is stuck stale for 98780.399254, current state > stale+active+clean, last acting [21] > pg 34.186 is stuck stale for 98780.399195, current state > stale+active+clean, last acting [21] > ... > > root@ceph-admin:~# ceph pg 34.225 query > Error ENOENT: i don't have pgid 34.225 > > root@ceph-admin:~# ceph pg 34.225 list_missing > Error ENOENT: i don't have pgid 34.225 > > root@ceph-admin:~# ceph osd lost 21 --yes-i-really-mean-it > osd.21 is not down or doesn't exist > > # checking the crushmap > ceph osd getcrushmap -o crush.map > crushtool -d crush.map -o crush.txt > root@ceph-admin:~# grep 21 crush.txt > -> nothing here > > > Of course, I cannot start OSD.21, because it's not available anymore - I > removed it. > > Is there a way to remap the stucked pg's to other OSD's than osd.21 > > One more - I tried to recreate the pg but now this pg this "stuck inactive": > > root@ceph-admin:~# ceph pg force_create_pg 34.225 > pg 34.225 now creating, ok > > root@ceph-admin:~# ceph health detail > HEALTH_WARN 49 pgs stale; 1 pgs stuck inactive; 49 pgs stuck stale; 1 > pgs stuck unclean > pg 34.225 is stuck inactive since forever, current state creating, last > acting [] > pg 34.225 is stuck unclean since forever, current state creating, last > acting [] > pg 34.186 is stuck stale for 118481.013632, current state > stale+active+clean, last acting [21] > ... > > Maybe somebody has an idea how to fix this situation? smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg is stuck stale (osd.21 still removed) - SOLVED.
Well, ok - I found the solution: ceph health detail HEALTH_WARN 50 pgs stale; 50 pgs stuck stale pg 34.225 is stuck inactive since forever, current state creating, last acting [] pg 34.225 is stuck unclean since forever, current state creating, last acting [] pg 34.226 is stuck stale for 77328.923060, current state stale+active+clean, last acting [21] pg 34.3cb is stuck stale for 77328.923213, current state stale+active+clean, last acting [21] root@ceph-admin:~# ceph pg map 34.225 osdmap e18263 pg 34.225 (34.225) -> up [16] acting [16] After restart osd.16, pg 34.225 is fine. So, I recreate all the broken PG's: for pg in `ceph health detail | grep stale | cut -d' ' -f2`; do ceph pg force_create_pg $pg; done and restart all (or the necessary) OSD's.. Now, the cluster is HEALTH_OK again. root@ceph-admin:~# ceph health HEALTH_OK Best regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg is stuck stale (osd.21 still removed)
One more - I tried to recreate the pg but now this pg this "stuck inactive": root@ceph-admin:~# ceph pg force_create_pg 34.225 pg 34.225 now creating, ok root@ceph-admin:~# ceph health detail HEALTH_WARN 49 pgs stale; 1 pgs stuck inactive; 49 pgs stuck stale; 1 pgs stuck unclean pg 34.225 is stuck inactive since forever, current state creating, last acting [] pg 34.225 is stuck unclean since forever, current state creating, last acting [] pg 34.186 is stuck stale for 118481.013632, current state stale+active+clean, last acting [21] ... Maybe somebody has an idea how to fix this situation? regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pg is stuck stale (osd.21 still removed)
Hi, we had a HW-problem with OSD.21 today. The OSD daemon was down and "smartctl" told me about some hardware errors. I decided to remove the HDD: ceph osd out 21 ceph osd crush remove osd.21 ceph auth del osd.21 ceph osd rm osd.21 But afterwards I saw that I have some stucked pg's for osd.21: root@ceph-admin:~# ceph -w cluster c7b12656-15a6-41b0-963f-4f47c62497dc health HEALTH_WARN 50 pgs stale 50 pgs stuck stale monmap e4: 3 mons at {ceph-mon1=192.168.135.31:6789/0,ceph-mon2=192.168.135.32:6789/0,ceph-mon3=192.168.135.33:6789/0} election epoch 404, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3 mdsmap e136: 1/1/1 up {0=ceph-mon1=up:active} osdmap e18259: 23 osds: 23 up, 23 in pgmap v47879105: 6656 pgs, 10 pools, 23481 GB data, 6072 kobjects 54974 GB used, 30596 GB / 85571 GB avail 6605 active+clean 50 stale+active+clean 1 active+clean+scrubbing+deep root@ceph-admin:~# ceph health HEALTH_WARN 50 pgs stale; 50 pgs stuck stale root@ceph-admin:~# ceph health detail HEALTH_WARN 50 pgs stale; 50 pgs stuck stale; noout flag(s) set pg 34.225 is stuck stale for 98780.399254, current state stale+active+clean, last acting [21] pg 34.186 is stuck stale for 98780.399195, current state stale+active+clean, last acting [21] ... root@ceph-admin:~# ceph pg 34.225 query Error ENOENT: i don't have pgid 34.225 root@ceph-admin:~# ceph pg 34.225 list_missing Error ENOENT: i don't have pgid 34.225 root@ceph-admin:~# ceph osd lost 21 --yes-i-really-mean-it osd.21 is not down or doesn't exist # checking the crushmap ceph osd getcrushmap -o crush.map crushtool -d crush.map -o crush.txt root@ceph-admin:~# grep 21 crush.txt -> nothing here Of course, I cannot start OSD.21, because it's not available anymore - I removed it. Is there a way to remap the stucked pg's to other OSD's than osd.21? How can I help my cluster (ceph 0.94.2)? best regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] certificate of `ceph.com' is not trusted!
Hi, I think the root-CA (COMODO RSA Certification Authority) is not available on your Linux host? Using Google chrome connecting to https://ceph.com/ works fine. regards Danny -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dietmar Maurer Sent: Friday, February 13, 2015 8:10 AM To: ceph-users Subject: [ceph-users] certificate of `ceph.com' is not trusted! I get the following error on standard Debian Wheezy # wget https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc --2015-02-13 07:19:04-- https://ceph.com/git/?p=ceph.git Resolving ceph.com (ceph.com)... 208.113.241.137, 2607:f298:4:147::b05:fe2a Connecting to ceph.com (ceph.com)|208.113.241.137|:443... connected. ERROR: The certificate of `ceph.com' is not trusted. ERROR: The certificate of `ceph.com' hasn't got a known issuer. Previously, this worked without problem. smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] snapshoting on btrfs vs xfs
Hi Cristian, We will try to report back, but I'm not sure our use case is relevant. We are trying to use every dirty trick to speed up the VMs. we have the same use-case. The second pool is for the tests machines and has the journal in ram, so this part is very volatile. We don't really care, because if the worst happens and we have a power loss we just redo the pool and start new instances. Journal in ram did wonders for us in terms of read/write speed. How do you handle a reboot of a node managing your pool having the journals in RAM? All the mon's knows about the volatile pool - do you have remove recreate the pool automatically after rebooting this node? Did you tried to enable rdb-caching? Is there a write-performance benefit using journal @RAM instead of enable rbd-caching on client (openstack) side ? I thought with rbd-caching the write performance should be fast enough. regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Number of SSD for OSD journal
Hallo Mike, This is also have another way. * for CONF 2,3 replace 200Gb SSD to 800Gb and add another 1-2 SSD to each node. * make tier1 read-write cache on SSDs * also you can add journal partition on them if you wish - then data will moving from SSD to SSD before let down on HDD * on HDD you can make erasure pool or replica pool Do you have some experience (performance ?) with SSD as caching tier1? Maybe some small benchmarks? From the mailing list, I feel that SSD-tearing is not much used in productive. regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Tip of the week: don't use Intel 530 SSD's for journals
Hi. If you are like me, you have the journals for your OSD's with rotating media stored separately on an SSD. If you are even more like me, you happen to use Intel 530 SSD's in some of your hosts. If so, please do check your S.M.A.R.T. statistics regularly, because these SSD's really can't cope with Ceph. We are using as a check_smart_attributes (1,2) nagios check to handle perf/thresholds for our different HDD/SDD models of our ceph cluser. regards Danny (1) http://git.thomas-krenn.com/check_smart_attributes.git/ (2) http://www.thomas-krenn.com/de/wiki/SMART_Attributes_Monitoring_Plugin smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds fails to start with mismatch in id
Hi Ramakrishna, we use the phy. path (containing the serial number) to a disk to prevent complexity and wrong mapping... This path will never change: /etc/ceph/ceph.conf [osd.16] devs = /dev/disk/by-id/scsi-SATA_ST4000NM0033-9Z_Z1Z0SDCY-part1 osd_journal = /dev/disk/by-id/scsi-SATA_INTEL_SSDSC2BA1BTTV330609AU100FGN-part1 ... regards Danny From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Irek Fasikhov Sent: Tuesday, November 11, 2014 6:36 AM To: Ramakrishna Nishtala (rnishtal); Gregory Farnum Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] osds fails to start with mismatch in id Hi, Ramakrishna. I think you understand what the problem is: [ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-56/whoami 56 [ceph@ceph05 ~]$ cat /var/lib/ceph/osd/ceph-57/whoami 57 Tue Nov 11 2014 at 6:01:40, Ramakrishna Nishtala (rnishtal) rnish...@cisco.commailto:rnish...@cisco.com: Hi Greg, Thanks for the pointer. I think you are right. The full story is like this. After installation, everything works fine until I reboot. I do observe udevadm getting triggered in logs, but the devices do not come up after reboot. Exact issue as http://tracker.ceph.com/issues/5194. But this has been fixed a while back per the case details. As a workaround, I copied the contents from /proc/mounts to fstab and that’s where I landed into the issue. After your suggestion, defined as UUID in fstab, but similar problem. blkid.tab now moved to tmpfs and also isn’t consistent ever after issuing blkid explicitly to get the UUID’s. Goes in line with ceph-disk comments. Decided to reinstall, dd the partitions, zapdisks etc. Did not help. Very weird that links below change in /dev/disk/by-uuid and /dev/disk/by-partuuid etc. Before reboot lrwxrwxrwx 1 root root 10 Nov 10 06:31 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdc2 lrwxrwxrwx 1 root root 10 Nov 10 06:31 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdb2 After reboot lrwxrwxrwx 1 root root 10 Nov 10 09:50 11aca3e2-a9d5-4bcc-a5b0-441c53d473b6 - ../../sdd2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 89594989-90cb-4144-ac99-0ffd6a04146e - ../../sde2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 c17fe791-5525-4b09-92c4-f90eaaf80dc6 - ../../sda2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 c57541a1-6820-44a8-943f-94d68b4b03d4 - ../../sdb2 lrwxrwxrwx 1 root root 10 Nov 10 09:50 da7030dd-712e-45e4-8d89-6e795d9f8011 - ../../sdh2 Essentially, the transformation here is sdb2-sdh2 and sdc2- sdb2. In fact I haven’t partitioned my sdh at all before the test. The only difference probably from the standard procedure is I have pre-created the partitions for the journal and data, with parted. /lib/udev/rules.d osd rules has four different partition GUID codes, 45b0969e-9b03-4f30-b4c6-5ec00ceff106, 45b0969e-9b03-4f30-b4c6-b4b80ceff106, 4fbd7e29-9d25-41b8-afd0-062c0ceff05d, 4fbd7e29-9d25-41b8-afd0-5ec00ceff05d, But all my partitions journal/data are having ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 as partition guid code. Appreciate any help. Regards, Rama = -Original Message- From: Gregory Farnum [mailto:g...@gregs42.commailto:g...@gregs42.com] Sent: Sunday, November 09, 2014 3:36 PM To: Ramakrishna Nishtala (rnishtal) Cc: ceph-us...@ceph.commailto:ceph-us...@ceph.com Subject: Re: [ceph-users] osds fails to start with mismatch in id On Sun, Nov 9, 2014 at 3:21 PM, Ramakrishna Nishtala (rnishtal) rnish...@cisco.commailto:rnish...@cisco.com wrote: Hi I am on ceph 0.87, RHEL 7 Out of 60 few osd’s start and the rest complain about mismatch about id’s as below. 2014-11-09 07:09:55.501177 7f4633e01880 -1 OSD id 56 != my id 53 2014-11-09 07:09:55.810048 7f636edf4880 -1 OSD id 57 != my id 54 2014-11-09 07:09:56.122957 7f459a766880 -1 OSD id 58 != my id 55 2014-11-09 07:09:56.429771 7f87f8e0c880 -1 OSD id 0 != my id 56 2014-11-09 07:09:56.741329 7fadd9b91880 -1 OSD id 2 != my id 57 Found one OSD ID in /var/lib/ceph/cluster-id/keyring. To check this out manually corrected it and turned authentication to none too, but did not help. Any clues, how it can be corrected? It sounds like maybe the symlinks to data and journal aren't matching up with where they're supposed to be. This is usually a result of using unstable /dev links that don't always match to the same physical disks. Have you checked that? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME cryptographic signature
[ceph-users] Statistic information about rbd bandwith/usage (from a rbd/kvm client)
Hi, is there a possibility to see which rbd-device (used by a kvm hypervisor) produces high load on a ceph cluster? ceph -w shows only the total usage - but I don't know see which client or rbd-device is responsible for this load. best regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD - possible to query used space of images/clones ?
Hi, is there a way to query the used space of a RBD image created with format 2 (used for kvm)? Also, if I create a linked clone base on this image, how do I get the additional, individual used space of this clone? In zfs, I can query these kind of information by calling zfs info .. (2). rbd info (1) shows not that much information about the image. best regards Danny --- (1) Output of zfs info from a Solaris system root@storage19:~# zfs get all pool5/w2k8.dsk NAMEPROPERTY VALUE SOURCE pool5/w2k8.dsk available 75,3G - pool5/w2k8.dsk checksum on default pool5/w2k8.dsk compression offdefault pool5/w2k8.dsk compressratio 1.00x - pool5/w2k8.dsk copies1 default pool5/w2k8.dsk creation Di. Mai 10 14:44 2011 - pool5/w2k8.dsk dedup offdefault pool5/w2k8.dsk encryptionoff- pool5/w2k8.dsk keychangedate - default pool5/w2k8.dsk keysource none default pool5/w2k8.dsk keystatus none - pool5/w2k8.dsk logbias latencydefault pool5/w2k8.dsk primarycache alldefault pool5/w2k8.dsk readonly offdefault pool5/w2k8.dsk referenced17,4G - pool5/w2k8.dsk refreservationnone default pool5/w2k8.dsk rekeydate - default pool5/w2k8.dsk reservation none default pool5/w2k8.dsk secondarycachealldefault pool5/w2k8.dsk sync standard default pool5/w2k8.dsk type volume - pool5/w2k8.dsk used 18,5G - pool5/w2k8.dsk usedbychildren0 - pool5/w2k8.dsk usedbydataset 17,4G - pool5/w2k8.dsk usedbyrefreservation 0 - pool5/w2k8.dsk usedbysnapshots 1,15G - pool5/w2k8.dsk volblocksize 8K - pool5/w2k8.dsk volsize 25Glocal pool5/w2k8.dsk zoned offdefault (2) Output of rbd info [root@ceph-admin2 ~]# rbd info rbd/myimage-1 rbd image 'myimage-1': size 5 MB in 12500 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.11e82ae8944a format: 2 features: layering smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replacing a disk: Best practices?
Hi, I recently had an OSD disk die, and I'm wondering what are the current best practices for replacing it. I think I've thoroughly removed the old disk, both physically and logically, but I'm having trouble figuring out how to add the new disk into ceph. I did this today (one disk - osd.16 - died ;-): # @ceph-node3 /etc/init.d/ceph stop osd.16 # osd.16 loeschen ceph osd crush remove osd.16 ceph auth del osd.16 ceph osd rm osd.16 # remove hdd, plugin new hdd # /var/log/messages tells me Oct 15 09:51:09 ceph-node3 kernel: [1489736.671840] sd 0:0:0:0: [sdd] Synchronizing SCSI cache Oct 15 09:51:09 ceph-node3 kernel: [1489736.671873] sd 0:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Oct 15 09:54:56 ceph-node3 kernel: [1489963.094744] sd 0:0:8:0: Attached scsi generic sg4 type 0 Oct 15 09:54:56 ceph-node3 kernel: [1489963.095235] sd 0:0:8:0: [sdd] 7814037168 512-byte logical blocks: (4.00 TB/3.63 TiB) Oct 15 09:54:57 ceph-node3 kernel: [1489963.343664] sd 0:0:8:0: [sdd] Attached SCSI disk -- /dev/sdd # check /dev/sdd root@ceph-node3:~# smartctl -a /dev/sdd | less === START OF INFORMATION SECTION === Device Model: ST4000NM0033-9ZM170 Serial Number:Z1Z5LGBX LU WWN Device Id: 5 000c50 079577e1a Firmware Version: SN04 User Capacity:4.000.787.030.016 bytes [4,00 TB] ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 4 Start_Stop_Count0x0032 100 100 020Old_age Always - 1 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 -- ok # new /dev/sdd uses the absolute path: /dev/disk/by-id/scsi-SATA_ST4000NM0033-9Z_Z1Z5LGBX # create new OSD (with old journal partition) admin@ceph-admin:~/cluster1$ ceph-deploy osd create ceph-node3:sdd:/dev/disk/by-id/scsi-SATA_INTEL_SSDSC2BA1BTTV330609AU100FGN-part1 [ceph_deploy.conf][DEBUG ] found configuration file at: /home/admin/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.17): /usr/bin/ceph-deploy osd create ceph-node3:sdd:/dev/disk/by-id/scsi-SATA_INTEL_SSDSC2BA1BTTV330609AU100FGN-part1 [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks ceph-node3:/dev/sdd:/dev/disk/by-id/scsi-SATA_INTEL_SSDSC2BA1BTTV330609AU100FGN-part1 ... [ceph_deploy.osd][DEBUG ] Host ceph-node3 is now ready for osd use. # @ceph-admin modify config admin@ceph-admin:~/cluster1$ ceph osd tree ... admin@ceph-admin:~/cluster1$ emacs -nw ceph.conf # osd16 was replaced [osd.16] ... devs = /dev/disk/by-id/scsi-SATA_ST4000NM0033-9Z_Z1Z5LGBX-part1 ... # deploy config ceph-deploy --overwrite-conf config push ceph-mon{1,2,3} ceph-node{1,2,3} ceph-admin # cluster-sync enablen ceph osd unset noout # check ceph -w regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replacing a disk: Best practices?
Loic, root@ceph-node3:~# smartctl -a /dev/sdd | less === START OF INFORMATION SECTION === Device Model: ST4000NM0033-9ZM170 Serial Number:Z1Z5LGBX .. admin@ceph-admin:~/cluster1$ emacs -nw ceph.conf [osd.16] ... devs = /dev/disk/by-id/scsi-SATA_ST4000NM0033-9Z_Z1Z5LGBX-part1 I'm curious about what this is used for. The normal device path /dev/sdd1 can change dependent on the amount/order of disks/controllers. So, using the scsi-path (containing the serial number) is always unique: root@ceph-node3:~# ls -altr /dev/sdd1 brw-rw---T 1 root disk 8, 49 Okt 15 10:06 /dev/sdd1 root@ceph-node3:~# ls -altr /dev/disk/by-id/scsi-SATA_ST4000NM0033-9Z_Z1Z5LGBX-part1 lrwxrwxrwx 1 root root 10 Okt 15 10:06 /dev/disk/by-id/scsi-SATA_ST4000NM0033-9Z_Z1Z5LGBX-part1 - ../../sdd1 regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad Write-Performance on Ceph/Possible bottlenecks?
Try to create e.g. 20 (small) rbd devices, putting them all in a lvm vg, creating a logical volume (Raid0) with 20 stripes and e.g. stripeSize 1MB (better bandwith) or 4kb (better io) - or use md-raid0 (it's maybe 10% faster - but not that flexible): BTW - we use this approach for VMware using - one LVM LV (raid0: 20 stripes, 1MB stripe size ) LUN based on - one VG containing 20 rbd's (each 40GB) based on - a ceph pool with 24osd, 3 replicates inside our - ceph cluster, 3 nodes x 8 x 4TB OSD's, 2 x 10GBit - published by scst (Fibre channel, 4 GBit QLA) to vSphere ESX. IOmeter (one worker, one disk) inside a w2k8r2 vm @esx tells me iometer: 270/360 MB/sec write/read (1MByte block size, 4 outstanding IOs) And - important - other vm's share the bandwidth from 20 rbd volumes - so, now, our 4GBit fibrchannel is the bottle neck - not the (one) rbd volume anymore. Also, we will add a flashcache in front of the raid0 LV to bust the 4k IO's - at the moment, 4k is terrible slow iometer: 4/14 MB/sec write/read (4k block size, 8 outstanding IOs) with a 10 GByte flashcache, it's about iometer: 14/60 MB/sec write/read (4k block size, 8 outstanding IOs) regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] change order of an rbd image ?
Hi, I created a 1TB rbd-image formated with vmfs (vmware) for an ESX server - but with a wrong order (25 instead of 22 ...). The rbd man page tells me for export/import/cp, rbd will use the order of the source image. Is there a way to change the order of a rbd image by doing some conversion? Ok - one idea could be to 'dd' the 1TB mapped rbd device to same mounted filesystem - but is this the only way? best regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD+KVM problems with sequential read
setup a bitter value for read_ahead_kb ? I tested with 256 MB read ahead cache ( From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ??? Sent: Friday, February 07, 2014 10:55 AM To: Konrad Gutkowski Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] RBD+KVM problems with sequential read echo noop/sys/block/vda/queue/scheduler echo 1000 /sys/block/vda/queue/nr_requests echo 8192/sys/block/vda/queue/read_ahead_kb [root@nfs tmp]# dd if=test of=/dev/null 39062500+0 records in 39062500+0 records out 200 bytes (20 GB) copied, 244.024 s, 82.0 MB/s Changing these parameters are not affected... Are there other ideas on this problem? Thank you. 2014-02-07 Konrad Gutkowski konrad.gutkow...@ffs.plmailto:konrad.gutkow...@ffs.pl: Hi, W dniu 07.02.2014 o 08:14 Ирек Фасихов malm...@gmail.commailto:malm...@gmail.com pisze: [...] Why might such a low speed sequential read? Do ideas on this issue? Iirc you need to set your readahead for the device higher (inside the vm) to compensate for network rtt. blockdev --setra x /dev/vda Thanks. -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 Regards, Konrad Gutkowski ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD+KVM problems with sequential read
I'm sorry, but I did not understand you :) Sorry (-: My finger touched the RETURN-key to fast... Try to setup a bigger value for the read ahead cache, maybe 256 MB? echo 262144/sys/block/vda/queue/read_ahead_kb Try also fio performance tool - it will show more detailed information. [global] ioengine=libaio invalidate=1 ramp_time=5 #exec_prerun=echo 3 /proc/sys/vm/drop_caches iodepth=16 runtime=30 time_based direct=1 bs=1m filename=/dev/vda [seq-write] stonewall rw=write [seq-read] stonewall rw=read Compare the fio result with a fio-test agains the mounted rbd-volume (filename=/dev/rbdX) on your KVM phy. host (not inside the vm). Try this also: echo 3 /proc/sys/vm/drop_caches best regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush Maps
Hallo Bradley, additionally to your question, I'm interesting in the following: 5) can I change all 'type' Ids because adding a new type host-slow to distinguish between OSD's with journal on the same HDD / separate SSD? E.g. from type 0 osd type 1 host type 2 rack .. to type 0 osd type 1 host type 2 host-slow type 3 rack .. 6) After importing the crush map to the cluster, how can I start rebalancing all existing pools? (This is because all OSD now mixed up to other locations in the crush hierarchy). best regards Danny From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of McNamara, Bradley 1) Do the ID's of the bucket types need to be consecutive, or can I make them up as long as they are negative in value and unique? 2) Is there any way that I can control the assignment of the bucket type ID's if I were to update the crushmap on a running system using the CLI? 3) Is there any harm in adding bucket types that are not currently used, but assigning them a weight of 0, so they aren't used (a row defined, with racks, but the racks have no hosts defined)? 4) Can I have a bucket type with no item lines in it, or does each bucket type need at least on item declaration to be valid? smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Meetup in Frankfurt, before the Ceph day
Hi, Anyone else from ceph community , willing to join. will also visit my first ceph day (-: See you all in Frankfurt! best regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] monitor can not rejoin the cluster
Hi all, my monitor3 is not able to rejoin the cluster (containing mon1, mon2 and mon3 - running stable emperor). I try to recreate/inject a new monmap to all 3 mon's - but only mon1 and mon2 are up and joined. Now, enabling debugging on mon3, I got the following: 2014-01-30 08:51:03.823669 7f39b3f56700 10 mon.ceph-mon3@2(probing) e3 handle_probe_reply mon.1 192.168.135.32:6789/0mon_probe(reply c7b12656-15a6-41b0-963f-4f47c62497dc name ceph-mon2 quorum 0,1 paxos( fc 1 lc 160 )) v5 2014-01-30 08:51:03.823678 7f39b3f56700 10 mon.ceph-mon3@2(probing) e3 monmap is e3: 3 mons at {mon.ceph-mon1=192.168.135.31:6789/0,mon.ceph-mon2=192.168.135.32:6789/0,mon.ceph-mon3=192.168.135.33:6789/0} 2014-01-30 08:51:03.823701 7f39b3f56700 10 mon.ceph-mon3@2(probing) e3 peer name is mon.ceph-mon2 2014-01-30 08:51:03.823706 7f39b3f56700 10 mon.ceph-mon3@2(probing) e3 existing quorum 0,1 2014-01-30 08:51:03.823708 7f39b3f56700 10 mon.ceph-mon3@2(probing) e3 peer paxos version 160 vs my version 154 (ok) 2014-01-30 08:51:03.823711 7f39b3f56700 10 mon.ceph-mon3@2(probing) e3 ready to join, but i'm not in the monmap or my addr is blank, trying to join But why mon3 (but i'm not in the monmap) is not in the monmap ? Checking the sources https://github.com/ceph/ceph/blob/emperor/src/mon/Monitor.cc -- if (monmap-contains(name) -- !monmap-get_addr(name).is_blank_ip()) { // i'm part of the cluster; just initiate a new election start_election(); } else { dout(10) ready to join, but i'm not in the monmap or my addr is blank, trying to join dendl; messenger-send_message(new MMonJoin(monmap-fsid, name, messenger-get_myaddr()), monmap-get_inst(*m-quorum.begin())); } My map on mon3 looks like root@ceph-mon3:/var/log/ceph# ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-mon3.asok mon_status { name: ceph-mon3, rank: 2, state: probing, election_epoch: 0, quorum: [], outside_quorum: [], extra_probe_peers: [], sync_provider: [], monmap: { epoch: 3, fsid: c7b12656-15a6-41b0-963f-4f47c62497dc, modified: 2014-01-30 08:27:28.808771, created: 2014-01-30 08:27:28.808771, mons: [ { rank: 0, name: mon.ceph-mon1, addr: 192.168.135.31:6789\/0}, { rank: 1, name: mon.ceph-mon2, addr: 192.168.135.32:6789\/0}, { rank: 2, name: mon.ceph-mon3, addr: 192.168.135.33:6789\/0}]}} So, the condition (monmap-contains(name) !monmap-get_addr(name).is_blank_ip()) should work, or ? But the start_election() is not called. Can somebody help me here ? regards Danny More infos to mon3: root@ceph-mon3:/var/log/ceph# hostname -a ceph-mon3 root@ceph-mon3:/var/log/ceph# netstat -tulpen | grep ceph-mon tcp0 0 192.168.135.33:6789 0.0.0.0:* LISTEN 0 635369 2164/ceph-mon root@ceph-mon3:/var/log/ceph# cat /etc/hosts 127.0.0.1 localhost 192.168.135.33 ceph-mon3.dtnet.de ceph-mon3 admin@ceph-admin:~/cluster1$ ceph -s cluster c7b12656-15a6-41b0-963f-4f47c62497dc health HEALTH_WARN 192 pgs degraded; 192 pgs stale; 192 pgs stuck stale; 192 pgs stuck unclean; 1 mons down, quorum 0,1 ceph-mon1,ceph-mon2 monmap e3: 3 mons at {ceph-mon1=192.168.135.31:6789/0,ceph-mon2=192.168.135.32:6789/0,ceph-mon3=192.168.135.33:6789/0}, election epoch 230, quorum 0,1 ceph-mon1,ceph-mon2 osdmap e28: 1 osds: 1 up, 1 in pgmap v38: 192 pgs, 3 pools, 0 bytes data, 0 objects 36388 kB used, 3724 GB / 3724 GB avail 192 stale+active+degraded smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] monitor can not rejoin the cluster
OK - found the problem: mon_status { name: ceph-mon3, .. mons: [ { rank: 2, name: mon.ceph-mon3, NAME is wrong addr: 192.168.135.33:6789\/0}]}} In the docu http://ceph.com/docs/master/man/8/monmaptool/ the creation of the monmap is described like monmaptool --create --add mon.a 192.168.0.10:6789 --add mon.b 192.168.0.11:6789 \ --add mon.c 192.168.0.12:6789 --clobber monmap But it must be monmaptool --create --add a 192.168.0.10:6789 --add b 192.168.0.11:6789 \ --add c 192.168.0.12:6789 --clobber monmap So, i recreate my monmap again monmaptool --create --add ceph-mon1 192.168.135.31:6789 --add ceph-mon2 192.168.135.32:6789 --add ceph-mon3 192.168.135.33:6789 --fsid c7b12656-15a6-41b0-963f-4f47c62497dc --clobber monmap and reinject the it to all my mon's ceph-mon -i ceph-mon3 --inject-monmap /tmp/monmap Now, all looks fine: root@ceph-mon3:~# ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-mon3.asok mon_status { name: ceph-mon3, rank: 2, state: peon, election_epoch: 236, quorum: [ 0, 1, 2], Maybe somebody will adjust the docu on ttp://ceph.com/docs/master/man/8/monmaptool/ ? thx Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] udev names /dev/sd* - what happens if they change ?
Hi, just a small question: Createing a new OSD i use e.g. ceph-deploy osd create ceph-node1:sdg:/dev/sdb5 Question: What happens if the mapping of my disks changes (e.g. because adding new disks to the server) sdg becomes sgh sdb becomes sdc Is this handled (how?) by ceph? I cannot find any udev-rules for dev mapping... Or is it in my scope to add udev persistent rules ? thx Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] odd performance graph
Hi, The low points are all ~35Mbytes/sec and the high points are all ~60Mbytes/sec. This is very reproducible. It occurred to me that just stopping the OSD's selectively would allow me to see if there was a change when one was ejected, but at no time was there a change to the graph... did you configure the pool with 3 copies and try to run the benchmark test with one OSD only ? Can you reproduce the values for each OSD ? Also, while doing the benchmarks, check the native IO performance on linux side with e.g. iostat(hdd) or iperf (net). Additionally you can use other benchmark tools like bonnie, fio or the ceph-benchmark on linux to get values not intercepted by a windows virtual-machine (running HDTach on) abstract storage layer. regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem with delete or rename a pool
Hi, The problem is: now I want to delete or rename the pool '-help', maybe you will try using double-hyphen (--) [1] , e.g. something (not tested) like ceph osd pool rename -- -help aaa ceph osd pool delete -- -help regards Danny [1] http://unix.stackexchange.com/questions/11376/what-does-double-dash-mean http://stackoverflow.com/questions/14052892/delete-a-file-in-linux-that-contains-double-dash smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to replace a failed OSD
Hi Robert, What is the easiest way to replace a failed disk / OSD. It looks like the documentation here is not really compatible with ceph_deploy: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ I found the following thread useful: http://www.spinics.net/lists/ceph-users/msg05854.html best regards Danny smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com