[ceph-users] ceph can't recognize ext4 extended attributes when --mkfs --mkkey
ceph version 0.80.1 System: CentOS 6.5 [root@dn1 osd.6]# mount /dev/sde1 on /cache4 type ext4 (rw,noatime,user_xattr) —— osd.6 /dev/sdf1 on /cache5 type ext4 (rw,noatime,user_xattr) —— osd.7 /dev/sdg1 on /cache6 type ext4 (rw,noatime,user_xattr) —— osd.8 /dev/sdh1 on /cache7 type ext4 (rw,noatime,user_xattr) —— osd.9 /dev/sdi1 on /cache8 type ext4 (rw,noatime,user_xattr) —— osd.10 /dev/sdj1 on /cache9 type ext4 (rw,noatime,user_xattr) —— osd.11 [root@dn1 osd.6]# ceph-osd -i 6 --mkfs --mkkey 2015-03-03 15:52:12.156548 7fba6de2b7a0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2015-03-03 15:52:12.468304 7fba6de2b7a0 -1 filestore(/cache4/osd.6) Extended attributes don't appear to work. Got error (95) Operation not supported. If you are using ext3 or ext4, be sure to mount the underlying file system with the 'user_xattr' option. 2015-03-03 15:52:12.468367 7fba6de2b7a0 -1 filestore(/cache4/osd.6) FileStore::mount : error in _detect_fs: (95) Operation not supported 2015-03-03 15:52:12.468387 7fba6de2b7a0 -1 OSD::mkfs: couldn't mount ObjectStore: error -95 2015-03-03 15:52:12.468470 7fba6de2b7a0 -1 ** ERROR: error creating empty object store in /cache4/osd.6: (95) Operation not supported [root@dn1 osd.6]# tail -f /var/log/ceph/osd.6.log 2015-03-03 15:52:11.770484 7fba6de2b7a0 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 30336 2015-03-03 15:52:12.156548 7fba6de2b7a0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2015-03-03 15:52:12.224362 7fba6de2b7a0 0 filestore(/cache4/osd.6) mkjournal created journal on /cache4/osd.6/journal 2015-03-03 15:52:12.274706 7fba6de2b7a0 0 genericfilestorebackend(/cache4/osd.6) detect_features: FIEMAP ioctl is supported and appears to work 2015-03-03 15:52:12.274733 7fba6de2b7a0 0 genericfilestorebackend(/cache4/osd.6) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-03-03 15:52:12.468181 7fba6de2b7a0 0 genericfilestorebackend(/cache4/osd.6) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-03-03 15:52:12.468304 7fba6de2b7a0 -1 filestore(/cache4/osd.6) Extended attributes don't appear to work. Got error (95) Operation not supported. If you are using ext3 or ext4, be sure to mount the underlying file system with the 'user_xattr' option. 2015-03-03 15:52:12.468367 7fba6de2b7a0 -1 filestore(/cache4/osd.6) FileStore::mount : error in _detect_fs: (95) Operation not supported 2015-03-03 15:52:12.468387 7fba6de2b7a0 -1 OSD::mkfs: couldn't mount ObjectStore: error -95 2015-03-03 15:52:12.468470 7fba6de2b7a0 -1 ** ERROR: error creating empty object store in /cache4/osd.6: (95) Operation not supported Thanks!___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
You have a number of replication? 2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić -- Andrija Panić -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Once you have only three nodes in the cluster. I recommend you add new nodes to the cluster, and then delete the old. 2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com: You have a number of replication? 2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić -- Andrija Panić -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Objects, created with Rados Gateway, have incorrect UTC timestamp
Hi, I have a problem with timestamps of objects created in Rados Gateway. Timestamps are supposed to be in UTC timezone but instead I have strange offset shift. Server with Rados Gateway use MSK timezone (GMT +3). NTP is set, up and running correctly. Rados Gateway and Ceph has no objects (usage log is empty). Then I use Boto to create some buckets and objects: $ date Чтв Фев 26 11:29:05 MSK 2015 $ python fill_smth.py $ date Чтв Фев 26 11:29:16 MSK 2015 As you can see, my local time is *11*:29:05, in UTC it is *09*:29:05. After that I fetch Rados Gateway usage log: $ date Чтв Фев 26 11:35:35 MSK 2015 $ radosgw-admin usage show --uid=2733d594-2f5a-46f7-9174-68000ce754c8 { entries: [ { owner: 2733d594-2f5a-46f7-9174-68000ce754c8, buckets: [ { bucket: 0f2f1c7e-f420-4b36-8ff0-333fd9523902, time: 2015-02-26 05:00:00.00Z, epoch: 1424926800, categories: [ { category: create_bucket, bytes_sent: 0, bytes_received: 0, ops: 1, successful_ops: 1}, { category: get_obj, bytes_sent: 88, bytes_received: 0, ops: 4, successful_ops: 4}, { category: list_bucket, bytes_sent: 1585, bytes_received: 0, ops: 1, successful_ops: 1}, { category: put_obj, bytes_sent: 0, bytes_received: 88, ops: 4, successful_ops: 4}]}, { bucket: 6ab239b4-1806-441f-8831-85fb3c0cf7a8, time: 2015-02-26 05:00:00.00Z, epoch: 1424926800, categories: [ { category: create_bucket, bytes_sent: 0, bytes_received: 0, ops: 1, successful_ops: 1}, { category: get_obj, bytes_sent: 110, bytes_received: 0, ops: 5, successful_ops: 5}, { category: list_bucket, bytes_sent: 1916, bytes_received: 0, ops: 1, successful_ops: 1}, { category: put_obj, bytes_sent: 0, bytes_received: 110, ops: 5, successful_ops: 5}]}, { bucket: b461cb37-c7a0-4e56-8444-b190452f5c6a, time: 2015-02-26 05:00:00.00Z, epoch: 1424926800, categories: [ { category: create_bucket, bytes_sent: 0, bytes_received: 0, ops: 1, successful_ops: 1}, { category: get_obj, bytes_sent: 44, bytes_received: 0, ops: 2, successful_ops: 2}, { category: list_bucket, bytes_sent: 923, bytes_received: 0, ops: 1, successful_ops: 1}, { category: put_obj, bytes_sent: 0, bytes_received: 44, ops: 2, successful_ops: 2}]}, { bucket: e7d7ef55-9eeb-4d43-9d58-48dd373261ba, time: 2015-02-26 05:00:00.00Z, epoch: 1424926800, categories: [ { category: create_bucket, bytes_sent: 0, bytes_received: 0, ops: 1, successful_ops: 1}, { category: get_obj, bytes_sent: 66, bytes_received: 0, ops: 3, successful_ops: 3}, { category: list_bucket, bytes_sent: 1254, bytes_received: 0, ops: 1, successful_ops: 1}, { category: put_obj, bytes_sent: 0, bytes_received: 66, ops: 3, successful_ops: 3}]}, { bucket:
Re: [ceph-users] Question regarding rbd cache
librbd caches data at a buffer / block level. In a simplified example, if you are reading and writing random 4K blocks, the librbd cache would store only those individual 4K blocks. Behind the scenes, it is possible for adjacent block buffers to be merged together within the librbd cache. Therefore, if you read a whole object worth of adjacent blocks, the whole object could be stored in the cache as a single entry due to merging -- assuming no cache trimming occurred to evict blocks. When a flush occurs, only buffers that are flagged as dirty written are written back to the OSDs. The whole object would not be written to the OSDs unless you wrote data to the whole object. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Xu (Simon) Chen xche...@gmail.com To: ceph-users@lists.ceph.com Sent: Wednesday, February 25, 2015 7:12:01 PM Subject: [ceph-users] Question regarding rbd cache Hi folks, I am curious about how RBD cache works, whether it caches and writes back entire objects. For example, if my VM images are stored with order 23 (8MB blocks), would a 64MB rbd cache only be able to cache 8 objects at a time? Or does it work at a more granular fashion? Also, when a sync/flush happens, would the entire 8MB block be written back to ceph, or maybe some offset writes happens? Thanks. -Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs filesystem layouts : authentication gotchas ?
Hi, I am attempting to test the cephfs filesystem layouts. I created a user with rights to write only in one pool : client.puppet key:zzz caps: [mon] allow r caps: [osd] allow rwx pool=puppet I also created another pool in which I would assume this user is allowed to do nothing after I successfully configure things. By the way : looks like the ceph fs ls command is inconsistent when the cephfs is mounted (I used a locally compiled kmod-ceph rpm): [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ] (umount /mnt ...) [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet root ] So, I have this pool named root that I added in the cephfs filesystem. I then edited the filesystem xattrs : [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root getfattr: Removing leading '/' from absolute path names # file: mnt/root ceph.dir.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root I'm therefore assuming client.puppet should not be allowed to write or read anything in /mnt/root, which belongs to the root pool... but that is not the case. On another machine where I mounted cephfs using the client.puppet key, I can do this : The mount was done with the client.puppet key, not the admin one that is not deployed on that node : 1.2.3.4:6789:/ on /mnt type ceph (rw,relatime,name=puppet,secret=hidden,nodcache) [root@dev7248 ~]# echo not allowed /mnt/root/secret.notfailed [root@dev7248 ~]# [root@dev7248 ~]# cat /mnt/root/secret.notfailed not allowed And I can even see the xattrs inherited from the parent dir : [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root Whereas on the node where I mounted cephfs as ceph admin, I get nothing : [root@ceph0 ~]# cat /mnt/root/secret.notfailed [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed -rw-r--r-- 1 root root 12 Mar 3 15:27 /mnt/root/secret.notfailed After some time, the file also gets empty on the puppet client host : [root@dev7248 ~]# cat /mnt/root/secret.notfailed [root@dev7248 ~]# (but the metadata remained ?) Also, as an unpriviledged user, I can get ownership of a secret file by changing the extended attribute : [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet /mnt/root/secret.notfailed [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=puppet But fortunately, I haven't succeeded yet (?) in reading that file... My question therefore is : what am I doing wrong ? Final question for those that read down here : it appears that before creating the cephfs filesystem, I used the puppet pool to store a test rbd instance. And it appears I cannot get the list of cephfs objects in that pool, whereas I can get those that are on the newly created root pool : [root@ceph0 ~]# rados -p puppet ls test.rbd rbd_directory [root@ceph0 ~]# rados -p root ls 10a. 10b. Bug, or feature ? Thanks regards P.S : ceph release : [root@dev7248 ~]# rpm -qa '*ceph*' kmod-libceph-3.10.0-0.1.20150130gitee04310.el7.centos.x86_64 libcephfs1-0.87-0.el7.centos.x86_64 ceph-common-0.87-0.el7.centos.x86_64 ceph-0.87-0.el7.centos.x86_64 kmod-ceph-3.10.0-0.1.20150130gitee04310.el7.centos.x86_64 ceph-fuse-0.87.1-0.el7.centos.x86_64 python-ceph-0.87-0.el7.centos.x86_64 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-kvm and cloned rbd image
Your procedure appears correct to me. Would you mind re-running your cloned image VM with the following ceph.conf properties: [client] rbd cache off debug rbd = 20 log file = /path/writeable/by/qemu.$pid.log If you recreate the issue, would you mind opening a ticket at http://tracker.ceph.com/projects/rbd/issues? Thanks, -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: koukou73gr koukou7...@yahoo.com To: ceph-users@lists.ceph.com Sent: Monday, March 2, 2015 7:16:08 AM Subject: [ceph-users] qemu-kvm and cloned rbd image Hello, Today I thought I'd experiment with snapshots and cloning. So I did: rbd import --image-format=2 vm-proto.raw rbd/vm-proto rbd snap create rbd/vm-proto@s1 rbd snap protect rbd/vm-proto@s1 rbd clone rbd/vm-proto@s1 rbd/server And then proceeded to create a qemu-kvm guest with rbd/server as its backing store. The guest booted but as soon as it got to mount the root fs, things got weird: [...] scsi2 : Virtio SCSI HBA scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0 ANSI: 5 sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB) sd 2:0:0:0: [sda] Write Protect is off sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sd 2:0:0:0: [sda] Attached SCSI disk dracut: Scanning devices sda2 for LVM logical volumes vg_main/lv_swap vg_main/lv_root dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit EXT4-fs (dm-1): INFO: recovery required on readonly filesystem EXT4-fs (dm-1): write access will be enabled during recovery sd 2:0:0:0: [sda] abort sd 2:0:0:0: [sda] abort sd 2:0:0:0: [sda] abort sd 2:0:0:0: [sda] abort sd 2:0:0:0: [sda] abort sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:0: [sda] Sense Key : Aborted Command [current] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 b0 e0 d8 00 00 08 00 Buffer I/O error on device dm-1, logical block 1058331 lost page write due to I/O error on dm-1 sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:0: [sda] Sense Key : Aborted Command [current] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 6f ba c8 00 00 08 00 [ ... snip ... snip ... more or less the same messages ] end_request: I/O error, dev sda, sector 3129880 end_request: I/O error, dev sda, sector 11518432 end_request: I/O error, dev sda, sector 3194664 end_request: I/O error, dev sda, sector 3129824 end_request: I/O error, dev sda, sector 3194376 end_request: I/O error, dev sda, sector 11579664 end_request: I/O error, dev sda, sector 3129448 end_request: I/O error, dev sda, sector 3197856 end_request: I/O error, dev sda, sector 3129400 end_request: I/O error, dev sda, sector 7385360 end_request: I/O error, dev sda, sector 11515912 end_request: I/O error, dev sda, sector 11514112 __ratelimit: 12 callbacks suppressed sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:0: [sda] Sense Key : Aborted Command [current] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 af b0 80 00 00 10 00 __ratelimit: 12 callbacks suppressed __ratelimit: 13 callbacks suppressed Buffer I/O error on device dm-1, logical block 1048592 lost page write due to I/O error on dm-1 Buffer I/O error on device dm-1, logical block 1048593 lost page write due to I/O error on dm-1 sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:0: [sda] Sense Key : Aborted Command [current] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f bf 00 00 00 08 00 Buffer I/O error on device dm-1, logical block 480 lost page write due to I/O error on dm-1 [... snip... more of the same ... ] Buffer I/O error on device dm-1, logical block 475 lost page write due to I/O error on dm-1 Buffer I/O error on device dm-1, logical block 476 lost page write due to I/O error on dm-1 Buffer I/O error on device dm-1, logical block 477 lost page write due to I/O error on dm-1 sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:0: [sda] Sense Key : Aborted Command [current] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f be 30 00 00 10 00 Buffer I/O error on device dm-1, logical block 454 lost page write due to I/O error on dm-1 sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:0: [sda] Sense Key : Aborted Command [current] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f be 10 00 00 18 00 sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 2:0:0:0: [sda] Sense Key : Aborted Command [current] sd 2:0:0:0: [sda] Add. Sense: I/O process terminated sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 2f be
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?
Ah yes, that's a good point :-) Thank you for your assistance Greg, I'm understanding a little more about how Ceph operates under the hood now. We're probably at a reasonable point for me to say I'll just switch the machines off and forget about them for a while. It's no great loss; I just wanted to see if the cluster would come back to life despite any mis-treatment, and how far it can be pushed with the limited resources on the Microservers. Getting to the admin socket fails: root@ceph26:~# ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok help admin_socket: exception getting command descriptions: [Errno 111] Connection refused And after activity ceased on /dev/sdb ... (60 second intervals again, snipped many hours of these sorts of figures) sdb 5.52 0.00 801.27 0 48076 sdb 4.68 0.00 731.80 0 43908 sdb 5.25 0.00 792.80 0 47568 sdb 18.83 483.07 569.53 28984 34172 sdb 28.28 894.6035.40 53676 2124 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 ... the log hadn't progressed beyond the below. Note the last entry was 13 hours prior to activity on sdb ending, so whatever finished writing (then momentarily reading) this morning, didn't add anything to the log. ... 2015-03-02 18:24:45.942970 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) get_omap_iterator meta/39e3fb/pglog_4.57c/0//-1 2015-03-02 18:24:45.977857 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1 2015-03-02 18:24:45.978400 7f27f03ef780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3 2015-03-02 18:24:45.978414 7f27f03ef780 10 filestore header.spos 0.0.0 2015-03-02 18:24:45.986763 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1 2015-03-02 18:24:45.987350 7f27f03ef780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4 2015-03-02 18:24:45.987363 7f27f03ef780 10 filestore header.spos 0.0.0 2015-03-02 18:24:45.991651 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/39e3fb/pglog_4.57c/0//-1 2015-03-02 18:24:45.992119 7f27f03ef780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5 2015-03-02 18:24:45.992128 7f27f03ef780 10 filestore header.spos 0.0.0 2015-03-02 18:24:46.016116 7f27f03ef780 10 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x1a92540 2015-03-02 18:24:46.016133 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1 2015-03-02 18:24:46.016542 7f27f03ef780 10 filestore oid: 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1 2015-03-02 18:24:46.016555 7f27f03ef780 10 filestore header.spos 0.0.0 2015-03-02 18:24:48.855098 7f27e2fe0700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000291 The complete file is attached, in case it's of interest to anyone. I get the feeling it's BTRFS which is the 'cause' here. I'm running a scrub in case it highlights anything wrong with the filesystem. If it all springs back to life, I'll post back here with my findings! Thanks again for the pointers, Chris -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: 02 March 2015 18:05 To: Chris Murray Cc: ceph-users Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help? You can turn the filestore up to 20 instead of 1. ;) You might also explore what information you can get out of the admin socket. You are correct that those numbers are the OSD epochs, although note that when the system is running you'll get output both for the OSD as a whole and for individual PGs within it (which can be lagging behind). I'm still pretty convinced the OSDs are simply stuck trying to bring their PGs up to date and are thrashing the maps on disk, but we're well past what I can personally diagnose without log diving. -Greg On Sat, Feb 28, 2015 at 11:51 AM, Chris Murray chrismurra...@gmail.com wrote: After noticing that the number increases by 101 on each attempt to start osd.11, I figured I was only 7 iterations away from the output being within 101 of 63675. So, I killed the osd process, started it again, lather, rinse, repeat. I then did the same for other OSDs. Some created very small logs, and some created logs into the gigabytes. Grepping the latter for update_osd_stat showed me where the maps were up to, and therefore which OSDs needed some special attention. Some of the epoch numbers appeared to increase by themselves to a point and then
Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?
On 03/03/2015 15:21, SCHAER Frederic wrote: By the way : looks like the “ceph fs ls” command is inconsistent when the cephfs is mounted (I used a locally compiled kmod-ceph rpm): [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ] (umount /mnt …) [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet root ] This is probably #10288, which was fixed in 0.87.1 So, I have this pool named “root” that I added in the cephfs filesystem. I then edited the filesystem xattrs : [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root getfattr: Removing leading '/' from absolute path names # file: mnt/root ceph.dir.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root I’m therefore assuming client.puppet should not be allowed to write or read anything in /mnt/root, which belongs to the “root” pool… but that is not the case. On another machine where I mounted cephfs using the client.puppet key, I can do this : The mount was done with the client.puppet key, not the admin one that is not deployed on that node : 1.2.3.4:6789:/ on /mnt type ceph (rw,relatime,name=puppet,secret=hidden,nodcache) [root@dev7248 ~]# echo not allowed /mnt/root/secret.notfailed [root@dev7248 ~]# [root@dev7248 ~]# cat /mnt/root/secret.notfailed not allowed This is data you're seeing from the page cache, it hasn't been written to RADOS. You have used the nodcache setting, but that doesn't mean what you think it does (it was about caching dentries, not data). It's actually not even used in recent kernels (http://tracker.ceph.com/issues/11009). You could try the nofsc option, but I don't know exactly how much caching that turns off -- the safer approach here is probably to do your testing using I/Os that have O_DIRECT set. And I can even see the xattrs inherited from the parent dir : [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root Whereas on the node where I mounted cephfs as ceph admin, I get nothing : [root@ceph0 ~]# cat /mnt/root/secret.notfailed [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed -rw-r--r-- 1 root root 12 Mar 3 15:27 /mnt/root/secret.notfailed After some time, the file also gets empty on the “puppet client” host : [root@dev7248 ~]# cat /mnt/root/secret.notfailed [root@dev7248 ~]# (but the metadata remained ?) Right -- eventually the cache goes away, and you see the true (empty) state of the file. Also, as an unpriviledged user, I can get ownership of a “secret” file by changing the extended attribute : [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet /mnt/root/secret.notfailed [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=puppet Well, you're not really getting ownership of anything here: you're modifying the file's metadata, which you are entitled to do (pool permissions have nothing to do with file metadata). There was a recent bug where a file's pool layout could be changed even if it had data, but that was about safety rather than permissions. Final question for those that read down here : it appears that before creating the cephfs filesystem, I used the “puppet” pool to store a test rbd instance. And it appears I cannot get the list of cephfs objects in that pool, whereas I can get those that are on the newly created “root” pool : [root@ceph0 ~]# rados -p puppet ls test.rbd rbd_directory [root@ceph0 ~]# rados -p root ls 10a. 10b. Bug, or feature ? I didn't see anything in your earlier steps that would have led to any objects in the puppet pool. To get closer to the effect you're looking for, you probably need to combine your pool settings with some permissions on the folders, and do your I/O as a user other than root -- your user-level permissions would protect your metadata, and your pool permissions would protect your data. There are also plans to make finer grained access control for the metadata, but that's not there yet. Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Rbd image's data deletion
Hi all, what happens to data contained in an rbd image when the image itself gets deleted? Are the data just unlinked or are them destroyed in a way that make them unreadable? thanks Giuseppe ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question about rados bench
Hi all, In my reading on the net about various implementations of Ceph, I came across this website blog page (really doesn't give a lot of good information but caused me to wonder): http://avengermojo.blogspot.com/2014/12/cubieboard-cluster-ceph-test.html near the bottom, the person did a rados bench test. During the write phase, there were several areas where there was a 0 in the cur MB/s. I figure there must have been a bottleneck somewhere slowing down the operation where data wasn't getting written. Is something like that during a benchmark test something that one should be concerned about? Is there a good procedure for tracking down where the bottleneck is (like if it's a given OSD?) Is the data cached and just taking a long time to write or is it lost in an instance like that? -Tony ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Thx Irek. Number of replicas is 3. I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already decommissioned), which is further connected to a new 10G switch/network with 3 servers on it with 12 OSDs each. I'm decommissioning old 3 nodes on 1G network... So you suggest removing whole node with 2 OSDs manually from crush map? Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas were originally been distributed over all 3 nodes. So anyway It could be safe to remove 2 OSDs at once together with the node itself...since replica count is 3... ? Thx again for your time On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote: Once you have only three nodes in the cluster. I recommend you add new nodes to the cluster, and then delete the old. 2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com: You have a number of replication? 2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] problem in cephfs for remove empty directory
On 03/03/2015 14:07, Daniel Takatori Ohara wrote: *$ls test-daniel-old/* total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ *$rm -rf test-daniel-old/* rm: cannot remove ‘test-daniel-old/’: Directory not empty *$ls test-daniel-old/* ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ l? ? ? ? ? ? M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ? ? M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ? ? M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ? ? M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ? ? M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ? ? M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ? ? M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ? ? M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam You don't say what version of the client (version of kernel, if it's the kernel client) this is. It would appear that the client thinks there are some dentries that don't really exist. You should enable verbose debug logs (with fuse client, debug client = 20) and reproduce this. It looks like you had similar issues (subject: problem for remove files in cephfs) a while back, when Yan Zheng also advised you to get some debug logs. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem in cephfs for remove empty directory
On Tue, Mar 3, 2015 at 9:24 AM, John Spray john.sp...@redhat.com wrote: On 03/03/2015 14:07, Daniel Takatori Ohara wrote: $ls test-daniel-old/ total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ $rm -rf test-daniel-old/ rm: cannot remove ‘test-daniel-old/’: Directory not empty $ls test-daniel-old/ ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ l? ? ? ? ?? M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam You don't say what version of the client (version of kernel, if it's the kernel client) this is. It would appear that the client thinks there are some dentries that don't really exist. You should enable verbose debug logs (with fuse client, debug client = 20) and reproduce this. It looks like you had similar issues (subject: problem for remove files in cephfs) a while back, when Yan Zheng also advised you to get some debug logs. In particular this is a known bug in older kernels and is fixed in new enough ones. Unfortunately I don't have the bug link handy though. :( -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Cluster Address
Hi, I have ceph cluster that is contained within a rack (1 Monitor and 5 OSD nodes). I kept the same public and private address for configuration. I do have 2 NICS and 2 valid IP addresses (one internal only and one external) for each machine. Is it possible now, to change the Public Network address, after the cluster is up and running? I had used Ceph-deploy for the cluster. If I change the address of the public network in Ceph.conf, do I need to propagate to all the machines in the cluster or just the Monitor Node is enough? Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Unbalanced cluster
Hi All, I have a cluster that I've been pushing data into in order to get an idea of how full it can get prior ceph marking the cluster full. Unfortunately, each time I fill the cluster I end up with one disk that typically hits the full ratio (0.95) while all other disks still have anywhere from 20-40% free space (my latest attempt resulted in the cluster marking full at 60% total usage). Any idea why the OSDs would be so unbalanced? Few notes on the cluster: - It has 6 storage hosts with 143 total OSDs (typically 144 but it has one failed disk - removed from cluster) - All OSDs are 4TB drives - All OSDs are set to the same weight - The cluster is using host rules - Using ceph version 0.80.7 In terms of the Pool(s), I have been varying the number of pools from run to run, following the PG calculator at http://ceph.com/pgcalc/ to determine the number of placement groups. I have also attempted a few runs bumping up the number of PGs, but it has only resulted in further unbalance. Any thoughts? Thanks, Matt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem in cephfs for remove empty directory
Hi John and Gregory, The version of ceph client is 0.87 and the kernel is 3.13. The debug logs here in attach. I see this problem in a older kernel, but i didn't find the solution in the track. Thanks, Att. --- Daniel Takatori Ohara. System Administrator - Lab. of Bioinformatics Molecular Oncology Center Instituto Sírio-Libanês de Ensino e Pesquisa Hospital Sírio-Libanês Phone: +55 11 3155-0200 (extension 1927) R: Cel. Nicolau dos Santos, 69 São Paulo-SP. 01308-060 http://www.bioinfo.mochsl.org.br On Tue, Mar 3, 2015 at 2:26 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Mar 3, 2015 at 9:24 AM, John Spray john.sp...@redhat.com wrote: On 03/03/2015 14:07, Daniel Takatori Ohara wrote: $ls test-daniel-old/ total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ $rm -rf test-daniel-old/ rm: cannot remove ‘test-daniel-old/’: Directory not empty $ls test-daniel-old/ ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ l? ? ? ? ?? M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam You don't say what version of the client (version of kernel, if it's the kernel client) this is. It would appear that the client thinks there are some dentries that don't really exist. You should enable verbose debug logs (with fuse client, debug client = 20) and reproduce this. It looks like you had similar issues (subject: problem for remove files in cephfs) a while back, when Yan Zheng also advised you to get some debug logs. In particular this is a known bug in older kernels and is fixed in new enough ones. Unfortunately I don't have the bug link handy though. :( -Greg log_mds.gz Description: GNU Zip compressed data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Attributes Question Marks
I did a bit more testing. 1. I tried on a newer kernel and was not able to recreate the problem, maybe it is that kernel bug you mentioned. Although its not an exact replica of the load. 2. I haven't tried the debug yet since I have to wait for the right moment. One thing I realized and maybe it is not an issue is we are using a symlink to a folder in the ceph mount. ceph-fuse on /mnt/ceph type fuse.ceph-fuse (rw,nosuid,nodev,noatime,user_id=0,group_id=0,default_permissions,allow_other) lrwxrwxrwx 1 root root metadata - /mnt/ceph/DataCenter/metadata Not sure if that would create any issues. Anyway we are going to update the machine soon so, I can report if we keep having the issue. Thanks for your support, Scott On Mon, Mar 2, 2015 at 4:07 PM Scottix scot...@gmail.com wrote: I'll try the following things and report back to you. 1. I can get a new kernel on another machine and mount to the CephFS and see if I get the following errors. 2. I'll run the debug and see if anything comes up. I'll report back to you when I can do these things. Thanks, Scottie On Mon, Mar 2, 2015 at 4:04 PM Gregory Farnum g...@gregs42.com wrote: I bet it's that permission issue combined with a minor bug in FUSE on that kernel, or maybe in the ceph-fuse code (but I've not seen it reported before, so I kind of doubt it). If you run ceph-fuse with debug client = 20 it will output (a whole lot of) logging to the client's log file and you could see what requests are getting processed by the Ceph code and how it's responding. That might let you narrow things down. It's certainly not any kind of timeout. -Greg On Mon, Mar 2, 2015 at 3:57 PM, Scottix scot...@gmail.com wrote: 3 Ceph servers on Ubuntu 12.04.5 - kernel 3.13.0-29-generic We have an old server that we compiled the ceph-fuse client on Suse11.4 - kernel 2.6.37.6-0.11 This is the only mount we have right now. We don't have any problems reading the files and the directory shows full 775 permissions and doing a second ls fixes the problem. On Mon, Mar 2, 2015 at 3:51 PM Bill Sanders billysand...@gmail.com wrote: Forgive me if this is unhelpful, but could it be something to do with permissions of the directory and not Ceph at all? http://superuser.com/a/528467 Bill On Mon, Mar 2, 2015 at 3:47 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 3:39 PM, Scottix scot...@gmail.com wrote: We have a file system running CephFS and for a while we had this issue when doing an ls -la we get question marks in the response. -rw-r--r-- 1 wwwrun root14761 Feb 9 16:06 data.2015-02-08_00-00-00.csv.bz2 -? ? ? ? ?? data.2015-02-09_00-00-00.csv.bz2 If we do another directory listing it show up fine. -rw-r--r-- 1 wwwrun root14761 Feb 9 16:06 data.2015-02-08_00-00-00.csv.bz2 -rw-r--r-- 1 wwwrun root13675 Feb 10 15:21 data.2015-02-09_00-00-00.csv.bz2 It hasn't been a problem but just wanted to see if this is an issue, could the attributes be timing out? We do have a lot of files in the filesystem so that could be a possible bottleneck. Huh, that's not something I've seen before. Are the systems you're doing this on the same? What distro and kernel version? Is it reliably one of them showing the question marks, or does it jump between systems? -Greg We are using the ceph-fuse mount. ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) We are planning to do the update soon to 87.1 Thanks Scottie ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] import-diff requires snapshot exists?
Hello, I've been playing with backing up images from my production site (running 0.87) to my backup site (running 0.87.1) using export/import and export-diff/import-diff. After initially exporting and importing the image (rbd/small to backup/small) I took a snapshot (called test1) on the production cluster, ran export-diff from that snapshot, and then attempted to import-diff the diff file on the backup cluster. # rbd import-diff ./foo.diff backup/small start snapshot 'test1' does not exist in the image, aborting Importing image diff: 0% complete...failed. rbd: import-diff failed: (22) Invalid argument This works fine if I create a test1 snapshot on the backup cluster before running import-diff. However, it appears that the changes get written into backup/small not backup/small@test1. So unless I'm not understanding something, it seems like the content of the snapshot on the backup cluster is of no importance, which makes me wonder why it must exist at all. Any thoughts? Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW do not populate log file
After change the ownership of the log file directory everything became fine. Thanks for your help Regards. Italo Santos http://italosantos.com.br/ On Tuesday, March 3, 2015 at 00:35, zhangdongmao wrote: I have met this before. Because I use apache with rgw, radosrgw is executed by the user 'apache', so you have to make sure the apache user have permissions to write the log file. 在 2015年03月03日 07:06, Italo Santos 写道: Hello everyone, I have a radosgw configured with the bellow ceph.conf file, but this instanse aren't generate any log entry on log file path, the log is aways empty, but if I take a look to the apache access.log there are a lot of entries. Anyone knows why? Regards. Italo Santos http://italosantos.com.br/ ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Cluster Address
I had to go through the same experience of changing the public network address and it's not easy. Ceph seems to keep a record of what ip address is associated to what OSD and a port number for the process. I was never able to find out where this record is kept or how to change it manually. Here's what I did, from memory : 1. Remove the network address I didn't want to use anymore from the ceph.conf and put the one I wanted to use instead. Don't worry, modifying the ceph.conf will not affect a currently running cluster unless you issue a command to it, like adding an OSD. 2. Remove each OSD one by one and then reinitialize them right after. You will lose the data that's on the OSD, but if your cluster is replicated properly and do this operation one OSD at a time, you should not lose the copies of that data. 3. Check the OSD status to make sure they use the proper IP. The command ceph osd dump will tell you if your OSDs are detected on the proper IP. 4. Remove and reinstall each monitor one by one. If anybody else has another solution I'd be curious to hear it, but this is how I managed to do it, by basically reinstalling each component one by one. On 3/3/2015 12:26 PM, Garg, Pankaj wrote: Hi, I have ceph cluster that is contained within a rack (1 Monitor and 5 OSD nodes). I kept the same public and private address for configuration. I do have 2 NICS and 2 valid IP addresses (one internal only and one external) for each machine. Is it possible now, to change the Public Network address, after the cluster is up and running? I had used Ceph-deploy for the cluster. If I change the address of the public network in Ceph.conf, do I need to propagate to all the machines in the cluster or just the Monitor Node is enough? Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] import-diff requires snapshot exists?
Jason, Ah, ok that makes sense. I was forgetting snapshots are read-only. Thanks! My plan was to do something like this. First, create a sync snapshot and seed the backup: rbd snap create rbd/small@sync rbd export rbd/small@sync ./foo rbd import ./foo backup/small rbd snap create backup/small@sync Then each day, create a daily snap on the backup cluster: rbd snap create backup/small@2015-02-03 Then send that day's changes: rbd export-diff --from-snap sync rbd/small ./foo.diff rbd import-diff ./foo.diff rbd/small Then remove and recreate the daily snap marker to prepare for the next sync. rbd snap rm rbd/small@sync rbd snap rm backup/small@sync rbd snap create rbd/small@sync rbd snap create backup/small@sync Finally remove any dated snapshots on the remote cluster outside the retention window. -Steve On 03/03/2015 04:37 PM, Jason Dillaman wrote: Snapshots are read-only, so all changes to the image can only be applied to the HEAD revision. In general, you should take a snapshot prior to export / export-diff to ensure consistent images: rbd snap create rbd/small@snap1 rbd export rbd/small@snap1 ./foo rbd import ./foo backup/small rbd snap create backup/small@snap1 ** rbd/small and backup/small are now consistent through snap1 -- rbd/small might have been modified post snapshot rbd snap create rbd/small@snap2 rbd export-diff --from-snap snap1 rbd/small@snap2 ./foo.diff rbd import-diff ./foo.diff backup/small ** rbd/small and backup/small are now consistent through snap2. import-diff automatically created backup/small@snap2 after importing all changes. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Steve Anthony sma...@lehigh.edu To: ceph-users@lists.ceph.com Sent: Tuesday, March 3, 2015 2:06:44 PM Subject: [ceph-users] import-diff requires snapshot exists? Hello, I've been playing with backing up images from my production site (running 0.87) to my backup site (running 0.87.1) using export/import and export-diff/import-diff. After initially exporting and importing the image (rbd/small to backup/small) I took a snapshot (called test1) on the production cluster, ran export-diff from that snapshot, and then attempted to import-diff the diff file on the backup cluster. # rbd import-diff ./foo.diff backup/small start snapshot 'test1' does not exist in the image, aborting Importing image diff: 0% complete...failed. rbd: import-diff failed: (22) Invalid argument This works fine if I create a test1 snapshot on the backup cluster before running import-diff. However, it appears that the changes get written into backup/small not backup/small@test1. So unless I'm not understanding something, it seems like the content of the snapshot on the backup cluster is of no importance, which makes me wonder why it must exist at all. Any thoughts? Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Unexpected OSD down during deep-scrub
Hello everyone, I have a cluster with 5 hosts and 18 OSDs, today I faced with a unexpected issue when multiple OSD goes down. The first OSD go down, was osd.8, feel minutes after, another OSD goes down on the same host, the osd.1. So, I tried restart the OSDs (osd.8 and osd.1) but doesn’t worked and I decided put this OSDs out of cluster and wait the recovery complete. During the recovery, more two OSDs goes down, osd.6 in another host… and seconds after, osd.0 on the same host that first osd goes down too. Looking to the “ceph -w” status I realised some slow/stuck ops and I decided stop the writes on cluster. After that I restarted the OSDs 0 and 6 and bouth became UP and I was able to wait the recovery finish, which happened successfully. I realised that when the first OSD goes down, the cluster was performing a deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help me understand why the osd.8, and other osds, unexpected goes down? Bellow the osd.8 trace: -2 2015-03-03 16:31:48.191796 7f91a388b700 5 -- op tracker -- seq: 2633606, time: 2015-03-03 16:31:48.191796, event: done, op: osd_op(client.3880912.0:236 8430 notify.6 [watch ping cookie 140352686583296] 40.97c520d4 ack+write+known_if_redirected e4231) -1 2015-03-03 16:31:48.192174 7f91af8a3700 1 -- 10.32.30.11:6804/3991 == client.3880912 10.32.30.10:0/1001424 282597 ping magic: 0 v1 0+0+0 (0 0 0) 0xf500 con 0x1535c580 0 2015-03-03 16:31:48.251131 7f91a0084700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)' thread 7 f91a0084700 time 2015-03-03 16:31:48.169895 osd/ReplicatedPG.cc: 7494: FAILED assert(!i-mod_desc.empty()) ceph version 0.92 (00a3ac3b67d93860e7f0b6e07319f11b14d0fec0) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcc86c2] 2: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x49c) [0x9624fc] 3: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) [0x9698ba] 4: (ReplicatedPG::_scrub(ScrubMap)+0x2e62) [0x99b072] 5: (PG::scrub_compare_maps()+0x511) [0x90f0d1] 6: (PG::chunky_scrub(ThreadPool::TPHandle)+0x204) [0x910bb4] 7: (PG::scrub(ThreadPool::TPHandle)+0x3a3) [0x912c53] 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0x13) [0x7ebdd3] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcbade9] 10: (ThreadPool::WorkThread::entry()+0x10) [0xcbbfe0] 11: (()+0x6b50) [0x7f91bfe46b50] 12: (clone()+0x6d) [0x7f91be8627bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. At. Italo Santos http://italosantos.com.br/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Attributes Question Marks
Ya we are not at 0.87.1 yet, possibly tomorrow. I'll let you know if it still reports the same. Thanks John, --Scottie On Tue, Mar 3, 2015 at 2:57 PM John Spray john.sp...@redhat.com wrote: On 03/03/2015 22:35, Scottix wrote: I was testing a little bit more and decided to run the cephfs-journal-tool I ran across some errors $ cephfs-journal-tool journal inspect 2015-03-03 14:18:54.453981 7f8e29f86780 -1 Bad entry start ptr (0x2aebf6) at 0x2aeb32279b 2015-03-03 14:18:54.539060 7f8e29f86780 -1 Bad entry start ptr (0x2aeb000733) at 0x2aeb322dd8 2015-03-03 14:18:54.584539 7f8e29f86780 -1 Bad entry start ptr (0x2aeb000d70) at 0x2aeb323415 2015-03-03 14:18:54.669991 7f8e29f86780 -1 Bad entry start ptr (0x2aeb0013ad) at 0x2aeb323a52 2015-03-03 14:18:54.707724 7f8e29f86780 -1 Bad entry start ptr (0x2aeb0019ea) at 0x2aeb32408f Overall journal integrity: DAMAGED I expect this is http://tracker.ceph.com/issues/9977, which is fixed in master. You are in *very* bleeding edge territory here, and I'd suggest using the latest development release if you want to experiment with the latest CephFS tooling. Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.80.8 and librbd performance
Does kernel client affected by the problem ? Le mardi 03 mars 2015 à 15:19 -0800, Sage Weil a écrit : Hi, This is just a heads up that we've identified a performance regression in v0.80.8 from previous firefly releases. A v0.80.9 is working it's way through QA and should be out in a few days. If you haven't upgraded yet you may want to wait. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.80.8 and librbd performance
Le mardi 03 mars 2015 à 16:32 -0800, Sage Weil a écrit : On Wed, 4 Mar 2015, Olivier Bonvalet wrote: Does kernel client affected by the problem ? Nope. The kernel client is unaffected.. the issue is in librbd. sage Ok, thanks for the clarification. So I have to dig ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Attributes Question Marks
I was testing a little bit more and decided to run the cephfs-journal-tool I ran across some errors $ cephfs-journal-tool journal inspect 2015-03-03 14:18:54.453981 7f8e29f86780 -1 Bad entry start ptr (0x2aebf6) at 0x2aeb32279b 2015-03-03 14:18:54.539060 7f8e29f86780 -1 Bad entry start ptr (0x2aeb000733) at 0x2aeb322dd8 2015-03-03 14:18:54.584539 7f8e29f86780 -1 Bad entry start ptr (0x2aeb000d70) at 0x2aeb323415 2015-03-03 14:18:54.669991 7f8e29f86780 -1 Bad entry start ptr (0x2aeb0013ad) at 0x2aeb323a52 2015-03-03 14:18:54.707724 7f8e29f86780 -1 Bad entry start ptr (0x2aeb0019ea) at 0x2aeb32408f Overall journal integrity: DAMAGED Corrupt regions: 0x2aeb3226a5-2aeb32279b 0x2aeb32279b-2aeb322dd8 0x2aeb322dd8-2aeb323415 0x2aeb323415-2aeb323a52 0x2aeb323a52-2aeb32408f 0x2aeb32408f-2aeb3246cc $ cephfs-journal-tool header get { magic: ceph fs volume v011, write_pos: 184430420380, expire_pos: 184389995327, trimmed_pos: 184389992448, stream_format: 1, layout: { stripe_unit: 4194304, stripe_count: 4194304, object_size: 4194304, cas_hash: 4194304, object_stripe_unit: 4194304, pg_pool: 4194304}} $ cephfs-journal-tool event get summary 2015-03-03 14:32:50.102863 7f47c3006780 -1 Bad entry start ptr (0x2aee8000e6) at 0x2aee800c25 2015-03-03 14:32:50.242576 7f47c3006780 -1 Bad entry start ptr (0x2aee800b3f) at 0x2aee80167e 2015-03-03 14:32:50.486354 7f47c3006780 -1 Bad entry start ptr (0x2aee800e4f) at 0x2aee80198e 2015-03-03 14:32:50.577443 7f47c3006780 -1 Bad entry start ptr (0x2aee801f65) at 0x2aee802aa4 Events by type: no output here On Tue, Mar 3, 2015 at 12:01 PM Scottix scot...@gmail.com wrote: I did a bit more testing. 1. I tried on a newer kernel and was not able to recreate the problem, maybe it is that kernel bug you mentioned. Although its not an exact replica of the load. 2. I haven't tried the debug yet since I have to wait for the right moment. One thing I realized and maybe it is not an issue is we are using a symlink to a folder in the ceph mount. ceph-fuse on /mnt/ceph type fuse.ceph-fuse (rw,nosuid,nodev,noatime,user_id=0,group_id=0,default_permissions,allow_other) lrwxrwxrwx 1 root root metadata - /mnt/ceph/DataCenter/metadata Not sure if that would create any issues. Anyway we are going to update the machine soon so, I can report if we keep having the issue. Thanks for your support, Scott On Mon, Mar 2, 2015 at 4:07 PM Scottix scot...@gmail.com wrote: I'll try the following things and report back to you. 1. I can get a new kernel on another machine and mount to the CephFS and see if I get the following errors. 2. I'll run the debug and see if anything comes up. I'll report back to you when I can do these things. Thanks, Scottie On Mon, Mar 2, 2015 at 4:04 PM Gregory Farnum g...@gregs42.com wrote: I bet it's that permission issue combined with a minor bug in FUSE on that kernel, or maybe in the ceph-fuse code (but I've not seen it reported before, so I kind of doubt it). If you run ceph-fuse with debug client = 20 it will output (a whole lot of) logging to the client's log file and you could see what requests are getting processed by the Ceph code and how it's responding. That might let you narrow things down. It's certainly not any kind of timeout. -Greg On Mon, Mar 2, 2015 at 3:57 PM, Scottix scot...@gmail.com wrote: 3 Ceph servers on Ubuntu 12.04.5 - kernel 3.13.0-29-generic We have an old server that we compiled the ceph-fuse client on Suse11.4 - kernel 2.6.37.6-0.11 This is the only mount we have right now. We don't have any problems reading the files and the directory shows full 775 permissions and doing a second ls fixes the problem. On Mon, Mar 2, 2015 at 3:51 PM Bill Sanders billysand...@gmail.com wrote: Forgive me if this is unhelpful, but could it be something to do with permissions of the directory and not Ceph at all? http://superuser.com/a/528467 Bill On Mon, Mar 2, 2015 at 3:47 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 3:39 PM, Scottix scot...@gmail.com wrote: We have a file system running CephFS and for a while we had this issue when doing an ls -la we get question marks in the response. -rw-r--r-- 1 wwwrun root14761 Feb 9 16:06 data.2015-02-08_00-00-00.csv.bz2 -? ? ? ? ?? data.2015-02-09_00-00-00.csv.bz2 If we do another directory listing it show up fine. -rw-r--r-- 1 wwwrun root14761 Feb 9 16:06 data.2015-02-08_00-00-00.csv.bz2 -rw-r--r-- 1 wwwrun root13675 Feb 10 15:21 data.2015-02-09_00-00-00.csv.bz2 It hasn't been a problem but just wanted to see if this is an issue, could the attributes be timing out? We do have a lot of files in the filesystem so that could be a possible bottleneck. Huh, that's not something I've seen before. Are the systems you're doing this on the same? What
Re: [ceph-users] CephFS Attributes Question Marks
On 03/03/2015 22:35, Scottix wrote: I was testing a little bit more and decided to run the cephfs-journal-tool I ran across some errors $ cephfs-journal-tool journal inspect 2015-03-03 14:18:54.453981 7f8e29f86780 -1 Bad entry start ptr (0x2aebf6) at 0x2aeb32279b 2015-03-03 14:18:54.539060 7f8e29f86780 -1 Bad entry start ptr (0x2aeb000733) at 0x2aeb322dd8 2015-03-03 14:18:54.584539 7f8e29f86780 -1 Bad entry start ptr (0x2aeb000d70) at 0x2aeb323415 2015-03-03 14:18:54.669991 7f8e29f86780 -1 Bad entry start ptr (0x2aeb0013ad) at 0x2aeb323a52 2015-03-03 14:18:54.707724 7f8e29f86780 -1 Bad entry start ptr (0x2aeb0019ea) at 0x2aeb32408f Overall journal integrity: DAMAGED I expect this is http://tracker.ceph.com/issues/9977, which is fixed in master. You are in *very* bleeding edge territory here, and I'd suggest using the latest development release if you want to experiment with the latest CephFS tooling. Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] v0.80.8 and librbd performance
Hi, This is just a heads up that we've identified a performance regression in v0.80.8 from previous firefly releases. A v0.80.9 is working it's way through QA and should be out in a few days. If you haven't upgraded yet you may want to wait. Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: RPM Build Errors
someone in this DL had the thread error? Checking for unpackaged file(s): /usr/lib/rpm/check-files /home/vagrant/rpmbuild/BUILDROOT/calamari-server-1.3-rc_23_g4c41db3.el7.x86_64 Wrote: /home/vagrant/rpmbuild/RPMS/x86_64/calamari-server-1.3-rc_23_g4c41db3.el7.x86_64.rpm Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.TF02LW -- ID: cp-artifacts-to-share calamari/repobuild/calamari-repo-*.tar.gz Function: cmd.run Name: cp calamari/repobuild/calamari-repo-*.tar.gz /git Result: False Comment: Command cp calamari/repobuild/calamari-repo-*.tar.gz /git run Started: 18:38:22.222920 Duration: 12.591 ms Changes: -- pid: 855 retcode: 1 stderr: cp: cannot stat 'calamari/repobuild/calamari-repo-*.tar.gz': No such file or directory stdout: Begin forwarded message: From: Jesus Chavez (jeschave) jesch...@cisco.commailto:jesch...@cisco.com Subject: RPM Build Errors Date: March 3, 2015 at 5:47:14 PM CST To: ceph-calam...@ceph.commailto:ceph-calam...@ceph.com Hi everyone I am having exactly the same issue, does anybody knew whats going on on this? Thanks! I've seen this kind of compiler error with too little memory. As for salt, it's set up by vagrant because it's using the salt provider. What error are you seeing? On Jan 5, 2015 4:46 AM, John Spray john.spray at redhat.comhttp://lists.ceph.com/listinfo.cgi/ceph-calamari-ceph.com wrote: Forwarding from ceph-users to ceph-calamari. -- Forwarded message -- From: Tony unixfly at gmail.comhttp://lists.ceph.com/listinfo.cgi/ceph-calamari-ceph.com Date: Wed, Dec 24, 2014 at 7:03 PM Subject: [ceph-users] Calamari To: ceph-users at ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-calamari-ceph.com Has anyone else ran into this error? I've tried different versions of GCC and even CentOS and RHEL to compile the calamari but continues to fail and by the way the instructions on the ceph website are not correct because the virtual used with vagrant isn't complete with which ever versions they used to compile with and salt doesn't exist on the virtual. Here is the error message I'm having below: I thought this was a memory issue but I changed memory in the virtual and even cores from 1 to 4, 8, 16 without luck. The below error still looks like a memory issue but I gave it 12G and still failed. gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.6 -c /home/vagrant/rpmbuild/BUILD/calamari-server-1.2.1/venv/build/cython/Cython/Compiler/Parsing.c -o build/temp.linux-x86_64-2.6/home/vagrant/rpmbuild/BUILD/calamari-server-1.2.1/venv/build/cython/Cython/Compiler/Parsing.o {standard input}: Assembler messages: {standard input}:19186: Warning: end of file not at end of a line; newline inserted {standard input}:19533: Error: unknown pseudo-op: `.lc2' gcc: Internal error: Killed (program cc1) Please submit a full bug report. See http://bugzilla.redhat.com/bugzilla for instructions. error: command 'gcc' failed with exit status 1 Can't roll back Cython; was not uninstalled Cleaning up... Command /home/vagrant/rpmbuild/BUILD/calamari-server-1.2.1/venv/bin/python -c import setuptools;__file__='/home/vagrant/rpmbuild/BUILD/calamari-server-1.2.1/venv/build/cython/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec')) install --record /tmp/pip-6R8jlS-record/install-record.txt --single-version-externally-managed --install-headers /home/vagrant/rpmbuild/BUILD/calamari-server-1.2.1/venv/include/site/python2.6 failed with error code 1 in /home/vagrant/rpmbuild/BUILD/calamari-server-1.2.1/venv/build/cython Storing complete log in /home/vagrant/.pip/pip.log RPM build errors: -- ID: cp-artifacts-to-share calamari/repobuild/calamari-repo-rhel6.tar.gz Function: cmd.run Name: cp calamari/repobuild/calamari-repo-rhel6.tar.gz /git/ Result: True Comment: Command cp calamari/repobuild/calamari-repo-rhel6.tar.gz /git/ run Started: 18:54:40.702746 Duration: 123.319 ms Changes: -- pid: 10380 retcode: 0
Re: [ceph-users] CephFS Attributes Question Marks
On 03/03/2015 22:57, John Spray wrote: On 03/03/2015 22:35, Scottix wrote: I was testing a little bit more and decided to run the cephfs-journal-tool I ran across some errors $ cephfs-journal-tool journal inspect 2015-03-03 14:18:54.453981 7f8e29f86780 -1 Bad entry start ptr (0x2aebf6) at 0x2aeb32279b 2015-03-03 14:18:54.539060 7f8e29f86780 -1 Bad entry start ptr (0x2aeb000733) at 0x2aeb322dd8 2015-03-03 14:18:54.584539 7f8e29f86780 -1 Bad entry start ptr (0x2aeb000d70) at 0x2aeb323415 2015-03-03 14:18:54.669991 7f8e29f86780 -1 Bad entry start ptr (0x2aeb0013ad) at 0x2aeb323a52 2015-03-03 14:18:54.707724 7f8e29f86780 -1 Bad entry start ptr (0x2aeb0019ea) at 0x2aeb32408f Overall journal integrity: DAMAGED I expect this is http://tracker.ceph.com/issues/9977, which is fixed in master. You are in *very* bleeding edge territory here, and I'd suggest using the latest development release if you want to experiment with the latest CephFS tooling. ...although at the risk of contradicting myself, I now notice that this particular bugfix is one that we did backport for 0.87.1 John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected OSD down during deep-scrub
Le 03/03/2015 22:03, Italo Santos a écrit : I realised that when the first OSD goes down, the cluster was performing a deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help me understand why the osd.8, and other osds, unexpected goes down? I'm afraid I've seen this this afternoon too on my test cluster, just after upgrading from 0.87 to 0.93. After an initial migration success, some OSD started to go down : All presented similar stack traces , with magic word scrub in it : ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4) 1: /usr/bin/ceph-osd() [0xbeb3dc] 2: (()+0xf0a0) [0x7f8f3ca130a0] 3: (gsignal()+0x35) [0x7f8f3b37d165] 4: (abort()+0x180) [0x7f8f3b3803e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d] 6: (()+0x63996) [0x7f8f3bbd1996] 7: (()+0x639c3) [0x7f8f3bbd19c3] 8: (()+0x63bee) [0x7f8f3bbd1bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcd74f0] 10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x1fc) [0x97259c] 11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) [0x97344a] 12: (ReplicatedPG::_scrub(ScrubMap, std::maphobject_t, std::pairunsigned int, unsigned int, std::lesshobject_t, std::allocatorstd::pairhobject_t const, std::pa irunsigned int, unsigned intconst)+0x2e4d) [0x9a5ded] 13: (PG::scrub_compare_maps()+0x658) [0x916378] 14: (PG::chunky_scrub(ThreadPool::TPHandle)+0x202) [0x917ee2] 15: (PG::scrub(ThreadPool::TPHandle)+0x3a3) [0x919f83] 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0x13) [0x7eff93] 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49] 18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40] 19: (()+0x6b50) [0x7f8f3ca0ab50] 20: (clone()+0x6d) [0x7f8f3b42695d] As a temporary measure, noscrub and nodeep-scrub are now set for this cluster, and all is working fine right now. So there is probably something wrong here. Need to investigate further. Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.80.8 and librbd performance
On 03/03/2015 04:19 PM, Sage Weil wrote: Hi, This is just a heads up that we've identified a performance regression in v0.80.8 from previous firefly releases. A v0.80.9 is working it's way through QA and should be out in a few days. If you haven't upgraded yet you may want to wait. Thanks! sage Hi Sage, I've seen a couple Redmine tickets on this (eg http://tracker.ceph.com/issues/9854 , http://tracker.ceph.com/issues/10956). It's not totally clear to me which of the 70+ unreleased commits on the firefly branch fix this librbd issue. Is it only the three commits in https://github.com/ceph/ceph/pull/3410 , or are there more? - Ken ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected OSD down during deep-scrub
Hi Yann, That seems related to http://tracker.ceph.com/issues/10536 which seems to be resolved. Could you create a new issue with a link to 10536 ? More logs and ceph report would also be useful to figure out why it resurfaced. Thanks ! On 04/03/2015 00:04, Yann Dupont wrote: Le 03/03/2015 22:03, Italo Santos a écrit : I realised that when the first OSD goes down, the cluster was performing a deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help me understand why the osd.8, and other osds, unexpected goes down? I'm afraid I've seen this this afternoon too on my test cluster, just after upgrading from 0.87 to 0.93. After an initial migration success, some OSD started to go down : All presented similar stack traces , with magic word scrub in it : ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4) 1: /usr/bin/ceph-osd() [0xbeb3dc] 2: (()+0xf0a0) [0x7f8f3ca130a0] 3: (gsignal()+0x35) [0x7f8f3b37d165] 4: (abort()+0x180) [0x7f8f3b3803e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d] 6: (()+0x63996) [0x7f8f3bbd1996] 7: (()+0x639c3) [0x7f8f3bbd19c3] 8: (()+0x63bee) [0x7f8f3bbd1bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcd74f0] 10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x1fc) [0x97259c] 11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) [0x97344a] 12: (ReplicatedPG::_scrub(ScrubMap, std::maphobject_t, std::pairunsigned int, unsigned int, std::lesshobject_t, std::allocatorstd::pairhobject_t const, std::pa irunsigned int, unsigned intconst)+0x2e4d) [0x9a5ded] 13: (PG::scrub_compare_maps()+0x658) [0x916378] 14: (PG::chunky_scrub(ThreadPool::TPHandle)+0x202) [0x917ee2] 15: (PG::scrub(ThreadPool::TPHandle)+0x3a3) [0x919f83] 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0x13) [0x7eff93] 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49] 18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40] 19: (()+0x6b50) [0x7f8f3ca0ab50] 20: (clone()+0x6d) [0x7f8f3b42695d] As a temporary measure, noscrub and nodeep-scrub are now set for this cluster, and all is working fine right now. So there is probably something wrong here. Need to investigate further. Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Clustering a few NAS into a Ceph cluster
Hi Ceph, Last week-end I discussed with a friend about a use case many of us thought about already: it would be cool to have a simple way to assemble Ceph aware NAS fresh from the store. I summarized the use case and interface we discussed here : https://wiki.ceph.com/Clustering_a_few_NAS_into_a_Ceph_cluster It is far from polished but I hope it will trigger some discussions. The best of comments would be: wait, that already exists at URL ;-) But if that's not the case, maybe we can improve it. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.80.8 and librbd performance
On Wed, 4 Mar 2015, Olivier Bonvalet wrote: Does kernel client affected by the problem ? Nope. The kernel client is unaffected.. the issue is in librbd. sage Le mardi 03 mars 2015 à 15:19 -0800, Sage Weil a écrit : Hi, This is just a heads up that we've identified a performance regression in v0.80.8 from previous firefly releases. A v0.80.9 is working it's way through QA and should be out in a few days. If you haven't upgraded yet you may want to wait. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] EC configuration questions...
Loic, Thank you, I got it created. One of these days, I am going to have to try to understand some of the crush map details... Anyway, on to the next step! -don- -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
I would be inclined to shut down both OSDs in a node, let the cluster recover. Once it is recovered, shut down the next two, let it recover. Repeat until all the OSDs are taken out of the cluster. Then I would set nobackfill and norecover. Then remove the hosts/disks from the CRUSH then unset nobackfill and norecover. That should give you a few small changes (when you shut down OSDs) and then one big one to get everything in the final place. If you are still adding new nodes, when nobackfill and norecover is set, you can add them in so that the one big relocate fills the new drives too. On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thx Irek. Number of replicas is 3. I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already decommissioned), which is further connected to a new 10G switch/network with 3 servers on it with 12 OSDs each. I'm decommissioning old 3 nodes on 1G network... So you suggest removing whole node with 2 OSDs manually from crush map? Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas were originally been distributed over all 3 nodes. So anyway It could be safe to remove 2 OSDs at once together with the node itself...since replica count is 3... ? Thx again for your time On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote: Once you have only three nodes in the cluster. I recommend you add new nodes to the cluster, and then delete the old. 2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com: You have a number of replication? 2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no
[ceph-users] problem in cephfs for remove empty directory
Hi, I have a problem when i will remove a empty directory in cephfs. The directory is empty, but it seems have files crashed in MDS. *$ls test-daniel-old/* total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ *$rm -rf test-daniel-old/* rm: cannot remove ‘test-daniel-old/’: Directory not empty *$ls test-daniel-old/* ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ l? ? ? ? ?? M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam Att. --- Daniel Takatori Ohara. System Administrator - Lab. of Bioinformatics Molecular Oncology Center Instituto Sírio-Libanês de Ensino e Pesquisa Hospital Sírio-Libanês Phone: +55 11 3155-0200 (extension 1927) R: Cel. Nicolau dos Santos, 69 São Paulo-SP. 01308-060 http://www.bioinfo.mochsl.org.br ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some long running ops may lock osd
Looking further, i guess what i tried to tell was a simplified version of sharded threadpools, released in giant. Is it possible for that to be backported to firefly? On Tue, Mar 3, 2015 at 9:33 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Thank you folks for bringing that up. I had some questions about sharding. We'd like blind buckets too, at least it's on the roadmap. For the current sharded implementation, what are the final details? Is number of shards defined per bucket or globally? Is there a way to split current indexes into shards? On the other hand what i'd like to point here is not necessarily large-bucket-index specific. The problem is the mechanism around thread pools. Any request may require locks on a pg and this should not block the requests for other pgs. I'm no expert but the threads may be able to requeue the requests to a locked pg, processing others for other pgs. Or maybe a thread per pg design was possible. Because, you know, it is somewhat OK not being able to do anything for a locked resource. Then you can go and improve your processing or your locks. But it's a whole different problem when a locked pg blocks requests for a few hundred other pgs in other pools for no good reason. On Tue, Mar 3, 2015 at 5:43 AM, Ben Hines bhi...@gmail.com wrote: Blind-bucket would be perfect for us, as we don't need to list the objects. We only need to list the bucket when doing a bucket deletion. If we could clean out/delete all objects in a bucket (without iterating/listing them) that would be ideal.. On Mon, Mar 2, 2015 at 7:34 PM, GuangYang yguan...@outlook.com wrote: We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency. Generally the bucket sharding should help, both for Write throughput and *stall with recovering/scrubbing*, but it comes with a prices - The X shards you have for each bucket, the listing/trimming would be X times weighted, from OSD's load's point of view. There was discussion to implement: 1) blind bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, which could improve the problem I mentioned above. They are on the roadmap... Thanks, Guang From: bhi...@gmail.com Date: Mon, 2 Mar 2015 18:13:25 -0800 To: erdem.agao...@gmail.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Some long running ops may lock osd We're seeing a lot of this as well. (as i mentioned to sage at SCALE..) Is there a rule of thumb at all for how big is safe to let a RGW bucket get? Also, is this theoretically resolved by the new bucket-sharding feature in the latest dev release? -Ben On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi Gregory, We are not using listomapkeys that way or in any way to be precise. I used it here just to reproduce the behavior/issue. What i am really interested in is if scrubbing-deep actually mitigates the problem and/or is there something that can be further improved. Or i guess we should go upgrade now and hope for the best :) On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my
Re: [ceph-users] backfill_toofull, but OSDs not full
ceph 0.80.1 The same quesiton. I have deleted 1/4 data, but the problem didn't disappear Does anyone have other way to solve it? At 2015-01-10 05:31:30,Udo Lembke ulem...@polarzone.de wrote: Hi, I had an similiar effect two weeks ago - 1PG backfill_toofull and due reweighting and delete there was enough free space but the rebuild process stopped after a while. After stop and start ceph on the second node, the rebuild process runs without trouble and the backfill_toofull are gone. This happens with firefly. Udo On 09.01.2015 21:29, c3 wrote: In this case the root cause was half denied reservations. http://tracker.ceph.com/issues/9626 This stopped backfills since, those listed as backfilling were actually half denied and doing nothing. The toofull status is not checked until a free backfill slot happens, so everything was just stuck. Interestingly, the toofull was created by other backfills which were not stoppped. http://tracker.ceph.com/issues/9594 Quite the log jam to clear. Quoting Craig Lewis cle...@centraldesktop.com: What was the osd_backfill_full_ratio? That's the config that controls backfill_toofull. By default, it's 85%. The mon_osd_*_ratio affect the ceph status. I've noticed that it takes a while for backfilling to restart after changing osd_backfill_full_ratio. Backfilling usually restarts for me in 10-15 minutes. Some PGs will stay in that state until the cluster is nearly done recoverying. I've only seen backfill_toofull happen after the OSD exceeds the ratio (so it's reactive, no proactive). Mine usually happen when I'm rebalancing a nearfull cluster, and an OSD backfills itself toofull. On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote: Hi, I am wondering how a PG gets marked backfill_toofull. I reweighted several OSDs using ceph osd crush reweight. As expected, PG began moving around (backfilling). Some PGs got marked +backfilling (~10), some +wait_backfill (~100). But some are marked +backfill_toofull. My OSDs are between 25% and 72% full. Looking at ceph pg dump, I can find the backfill_toofull PGs and verified the OSDs involved are less than 72% full. Do backfill reservations include a size? Are these OSDs projected to be toofull, once the current backfilling complete? Some of the backfill_toofull and backfilling point to the same OSDs. I did adjust the full ratios, but that did not change the backfill_toofull status. ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95' ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92' ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Problems with shadow objects
Hello, all I have ceph+RGW installation. And have some problems with shadow objects. For example: #rados ls -p .rgw.buckets|grep default.4507.1 . default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.1_5 default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.2_2 default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.6_4 default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.4_2 default.4507.1__shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.3_5 . Please give me advices and answer on my questions 1) How can I rm this shadow files? 2) What does the name of this shadow files? example with normal object: # radosgw-admin object stat --bucket=dev --object=RegExp_tutorial.png and I receive information about this object. with shadow object: default.4507.1_ - bucket-id radosgw-admin object stat --bucket=dev --object=_shadow_test_s3.2/2vO4WskQNBGMnC8MGaYPSLfGkhQY76U.2_7 ERROR: failed to stat object, returned error: (2) No such file or directory how can I determine name of this object -- Best Regards, Stanislav Butkeev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Understand RadosGW logs
Hi! After realizing the problem with log rotation (see http://thread.gmane.org/gmane.comp.file-systems.ceph.user/17708) and fixing it, I now for the first time have some meaningful (and recent) logs to look at. While from an application perspective there seem to be no issues, I would like to understand some messages I find with relatively high frequency in the logs: Exhibit 1 - 2015-03-03 11:14:53.685361 7fcf4bfef700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:15:57.476059 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:17:43.570986 7fcf25fcb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:00.881640 7fcf39ff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:22:48.147011 7fcf35feb700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:27:40.572723 7fcf50ff9700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:29:40.082954 7fcf36fed700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 11:30:32.204492 7fcf4dff3700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 I cannot find anything relevant by Googling for that, apart from the actual line of code that produces this line. What does that mean? Is it an indication of data corruption or are there more benign reasons for this line? Exhibit 2 -- Several of these blocks 2015-03-03 07:06:17.805772 7fcf36fed700 1 == starting new request req=0x7fcf5800f3b0 = 2015-03-03 07:06:17.836671 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836758 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 07:06:17.836918 7fcf36fed700 0 RGWObjManifest::operator++(): result: ofs=13055243 stripe_ofs=13055243 part_ofs=0 rule-part_size=0 2015-03-03 07:06:18.263126 7fcf36fed700 1 == req done req=0x7fcf5800f3b0 http_status=200 == ... 2015-03-03 09:27:29.855001 7fcf28fd1700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:27:29.866718 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866778 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866852 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=13107200 stripe_ofs=13107200 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.866917 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=17301504 stripe_ofs=17301504 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.875466 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=21495808 stripe_ofs=21495808 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.884434 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=25690112 stripe_ofs=25690112 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.906155 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=29884416 stripe_ofs=29884416 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.914364 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=34078720 stripe_ofs=34078720 part_ofs=0 rule-part_size=0 2015-03-03 09:27:29.940653 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=38273024 stripe_ofs=38273024 part_ofs=0 rule-part_size=0 2015-03-03 09:27:30.272816 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=42467328 stripe_ofs=42467328 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.125773 7fcf28fd1700 0 RGWObjManifest::operator++(): result: ofs=46661632 stripe_ofs=46661632 part_ofs=0 rule-part_size=0 2015-03-03 09:27:31.192661 7fcf28fd1700 0 ERROR: flush_read_list(): d-client_c-handle_data() returned -1 2015-03-03 09:27:31.194481 7fcf28fd1700 1 == req done req=0x7fcf580102a0 http_status=200 == ... 2015-03-03 09:28:43.008517 7fcf2a7d4700 1 == starting new request req=0x7fcf580102a0 = 2015-03-03 09:28:43.016414 7fcf2a7d4700 0 RGWObjManifest::operator++(): result: ofs=887579 stripe_ofs=887579 part_ofs=0 rule-part_size=0 2015-03-03 09:28:43.022387 7fcf2a7d4700 1 == req done req=0x7fcf580102a0 http_status=200 == First, what is the req= line? Is that a thread-id? I am asking, because the same id is used over and over in the same file over time. More importantly, what do the RGWObjManifest::operator++():... lines mean? In the middle case above the block even ends with one of the ERROR lines mentioned before, but the HTTP status is still 200, suggesting a succesful operation. Thanks in advance for shedding some light, because I would like to know if I need to take some action or at least keep an eye on these via monitoring? Cheers, Daniel ___ ceph-users
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić -- Andrija Panić -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
osd_recovery_delay_start - is the delay in seconds between iterations recovery (osd_recovery_max_active) It is described here: https://github.com/ceph/ceph/search?utf8=%E2%9C%93q=osd_recovery_delay_start 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić -- Andrija Panić -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show | grep osd_recovery_delay_start osd_recovery_delay_start: 10 2015-03-03 13:13 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić -- Andrija Panić -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Rebalance/Backfill Throtling - anything missing here?
HI Guys, I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused over 37% od the data to rebalance - let's say this is fine (this is when I removed it frm Crush Map). I'm wondering - I have previously set some throtling mechanism, but during first 1h of rebalancing, my rate of recovery was going up to 1500 MB/s - and VMs were unusable completely, and then last 4h of the duration of recover this recovery rate went down to, say, 100-200 MB.s and during this VM performance was still pretty impacted, but at least I could work more or a less So my question, is this behaviour expected, is throtling here working as expected, since first 1h was almoust no throtling applied if I check the recovery rate 1500MB/s and the impact on Vms. And last 4h seemed pretty fine (although still lot of impact in general) I changed these throtling on the fly with: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' ceph tell osd.* injectargs '--osd_max_backfills 1' My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one SSD, 6 journals on another SSD) - I have 3 of these hosts. Any thought are welcome. -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com