Re: [ceph-users] Issue with free Inodes
May be some one can spot a new light, 1. Only SSD-cache OSDs affected by this issue 2. Total cache OSD count is 12x60GiB, backend filesystem is ext4 3. I have created 2 cache tier pools with replica size=3 on that OSD, both with pg_num:400, pgp_num:400 4. There was a crush ruleset: superuser@admin:~$ ceph osd crush rule dump ssd { rule_id: 3, rule_name: ssd, ruleset: 3, type: 1, min_size: 1, max_size: 10, steps: [ { op: take, item: -21, item_name: ssd}, { op: chooseleaf_firstn, num: 0, type: disktype}, { op: emit}]} for gathering all SSD OSDs from all nodes by *disktype* I guess there may be a lot of *directories* that was created on filesystem for organizing placement groups, can that cause that very big amount of inodes occupied by directory records? 24.03.2015 16:52, Gregory Farnum пишет: On Tue, Mar 24, 2015 at 12:13 AM, Christian Balzer ch...@gol.com wrote: On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote: Yes I read it and do no not understand what you mean when say *verify this*? All 3335808 inodes are definetly files and direcories created by ceph OSD process: What I mean is how/why did Ceph create 3+ million files, where in the tree are they actually or are they evenly distributed in the respective PG sub-directories. Or to ask it differently, how large is your cluster (how many OSDs, objects), in short the output of ceph -s. If cache-tiers actually are reserving each object that exists on the backing store (even if there isn't data in it yet on the cache tier) and your cluster is large enough, it might explain this. Nope. As you've said, this doesn't make any sense unless the objects are all ludicrously small (and you can't actually get 10-byte objects in Ceph; the names alone tend to be bigger than that) or something else is using up inodes. And that should both be mentioned and precautions to not run out of inodes should be made by the Ceph code. If not, this may be a bug after all. Would be nice if somebody from the Ceph devs could have gander at this. Christian *tune2fs 1.42.5 (29-Jul-2012)* Filesystem volume name: none Last mounted on: /var/lib/ceph/tmp/mnt.05NAJ3 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33 Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux *Inode count: 3335808* Block count: 13342945 Reserved block count: 667147 Free blocks: 5674105 *Free inodes: 0* First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 1020 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8176 Inode blocks per group: 511 Flex block group size:16 Filesystem created: Fri Feb 20 16:44:25 2015 Last mount time: Tue Mar 24 09:33:19 2015 Last write time: Tue Mar 24 09:33:27 2015 Mount count: 7 Maximum mount count: -1 Last checked: Fri Feb 20 16:44:25 2015 Check interval: 0 (none) Lifetime writes: 4116 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b Journal backup: inode blocks *fsck.ext4 /dev/sda1* e2fsck 1.42.5 (29-Jul-2012) /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks 23.03.2015 17:09, Christian Balzer пишет: On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote: Yes, I understand that. The initial purpose of first email was just an advise for new comers. My fault was in that I was selected ext4 for SSD disks as backend. But I did not foresee that inode number can reach its limit before the free space :) And maybe there must be some sort of warning not only for free space in MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes for filesystems with static inode allocation like ext4. Because if OSD reach inode limit it becames totally unusable and immediately goes down, and from that moment there is no way to start it! While all that is true and should probably be addressed, please re-read what I wrote before. With the 3.3 million inodes used and thus likely as many files (did you verify this?) and 4MB objects that would make something in the 12TB ballpark area. Something very very strange and wrong is going on with your cache tier. Christian 23.03.2015
[ceph-users] ceph -w: Understanding MB data versus MB used
Hello there, I started to push data into my ceph cluster. There is something I cannot understand in the output of ceph -w. When I run ceph -w I get this kinkd of output: 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail 2379MB is actually the data I pushed into the cluster, I can see it also in the ceph df output, and the numbers are consistent. What I dont understand is 19788MB used. All my pools have size 3, so I expected something like 2379 * 3. Instead this number is very big. I really need to understand how MB used grows because I need to know how many disks to buy. Any hints ? thank you. Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded
Hi, due to two more hosts (now 7 storage nodes) I want to create an new ec-pool and get an strange effect: ceph@admin:~$ ceph health detail HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized pg 22.3e5 is stuck unclean since forever, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck unclean since forever, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck undersized for 406.614447, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck undersized for 406.616563, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck degraded for 406.614566, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck degraded for 406.616679, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is active+undersized+degraded, acting [76,15,82,11,57,29,2147483647] pg 22.240 is active+undersized+degraded, acting [38,85,17,74,2147483647,10,58] But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647! Where the heck came the 2147483647 from? I do following commands: ceph osd erasure-code-profile set 7hostprofile k=5 m=2 ruleset-failure-domain=host ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile my version: ceph -v ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) I found an issue in my crush-map - one SSD was twice in the map: host ceph-061-ssd { id -16 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } root ssd { id -13 # do not change unnecessarily # weight 0.780 alg straw hash 0 # rjenkins1 item ceph-01-ssd weight 0.170 item ceph-02-ssd weight 0.170 item ceph-03-ssd weight 0.000 item ceph-04-ssd weight 0.170 item ceph-05-ssd weight 0.170 item ceph-06-ssd weight 0.050 item ceph-07-ssd weight 0.050 item ceph-061-ssd weight 0.000 } Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd, but after fix the crusmap the issue with the osd 2147483647 still excist. Any idea how to fix that? regards Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ERROR: missing keyring, cannot use cephx for authentication
Hi,Jesus I encountered similar problem. 1. shut down one of nodes, but all osds can't reactive on the node after reboot. 2. run service ceph restart manually, got the same error message: [root@storage4 ~]# /etc/init.d/ceph start === osd.15 === 2015-03-23 14:43:32.399811 7fed0fcf4700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication 2015-03-23 14:43:32.399814 7fed0fcf4700 0 librados: osd.15 initialization error (2) No such file or directory Error connecting to cluster: ObjectNotFound failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.15 --keyring=/var/lib/ceph/osd/ceph-15/keyring osd crush create-or-move -- 15 0.19 host=storage4 root=default' .. 3. ll /var/lib/ceph/osd/ceph-15/ total 0 all files disappeared in the /var/lib/ceph/osd/ceph-15/ oyym...@gmail.com From: Jesus Chavez (jeschave) Date: 2015-03-24 05:09 To: ceph-users Subject: [ceph-users] ERROR: missing keyring, cannot use cephx for authentication Hi all, I did HA failover test shutting down 1 node and I see that only 1 OSD came up after reboot: [root@geminis ceph]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel-root 50G 4.5G 46G 9% / devtmpfs 126G 0 126G 0% /dev tmpfs 126G 80K 126G 1% /dev/shm tmpfs 126G 9.9M 126G 1% /run tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/sda1 494M 165M 330M 34% /boot /dev/mapper/rhel-home 36G 44M 36G 1% /home /dev/sdc1 3.7T 134M 3.7T 1% /var/lib/ceph/osd/ceph-14 If I run service ceph restart I got this error message… Stopping Ceph osd.94 on geminis...done === osd.94 === 2015-03-23 15:05:41.632505 7fe7b9941700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication 2015-03-23 15:05:41.632508 7fe7b9941700 0 librados: osd.94 initialization error (2) No such file or directory Error connecting to cluster: ObjectNotFound failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.94 --keyring=/var/lib/ceph/osd/ceph-94/keyring osd crush create-or-move -- 94 0.05 host=geminis root=default I have ceph.conf and ceph.client.admin.keyring under /etc/ceph: [root@geminis ceph]# ls /etc/ceph ceph.client.admin.keyring ceph.conf rbdmap tmp1OqNFi tmptQ0a1P [root@geminis ceph]# does anybody know what could be wrong? Thanks Jesus Chavez SYSTEMS ENGINEER-C.SALES jesch...@cisco.com Phone: +52 55 5267 3146 Mobile: +51 1 5538883255 CCIE - 44433 Cisco.com Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. Please click here for Company Registration Information. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Erasure coding
Hi guys, We've got a very small Ceph cluster (3 hosts, 5 OSD's each for cold data) that we intend to grow later on as more storage is needed. We would very much like to use Erasure Coding for some pools but are facing some challenges regarding the optimal initial profile “replication” settings given the limited number of initial hosts that we can use to spread the chunks. Could somebody please help me with the following questions? 1. Suppose we initially use replication in stead of erasure. Can we convert a replicated pool to an erasure coded pool later on? 2. Will Ceph gain the ability to change the K and N values for an existing pool in the near future? 3. Can the failure domain be changed for an existing pool? E.g. can we start with failure domain OSD and then switch it to Host after adding more hosts? 4. Where can I find a good comparison of the available erasure code plugins that allows me to properly decide which one suits are needs best? Many thanks for your help! Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Snapshots and fstrim with cache tiers ?
Hello, I have a few questions regarding snapshots and fstrim with cache tiers. In the cache tier and erasure coding FAQ related to ICE 1.2 (based on Firefly), Inktank says Snapshots are not supported in conjunction with cache tiers. What are the risks of using snapshots with cache tiers ? Would this better not use it recommandation still be true with Giant or Hammer ? Regarding the fstrim command, it doesn't seem to work with cache tiers. The freed up blocks don't get back in the ceph cluster. Can someone confirm this ? Is there something we can do to get those freed up blocks back in the cluster ? Also, can we run an fstrim task from the cluster side ? That is, without having to map and mount each rbd image or rely on the client to operate this task ? Best regards, -- Frédéric Nass Sous-direction Infrastructures Direction du Numérique Université de Lorraine email : frederic.n...@univ-lorraine.fr Tél : +33 3 83 68 53 83 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure coding
Hi Tom, On 25/03/2015 11:31, Tom Verdaat wrote: Hi guys, We've got a very small Ceph cluster (3 hosts, 5 OSD's each for cold data) that we intend to grow later on as more storage is needed. We would very much like to use Erasure Coding for some pools but are facing some challenges regarding the optimal initial profile “replication” settings given the limited number of initial hosts that we can use to spread the chunks. Could somebody please help me with the following questions? 1. Suppose we initially use replication in stead of erasure. Can we convert a replicated pool to an erasure coded pool later on? What you would do is create an erasure coded pool later and have the initial replicated pool as a cache in front of it. http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ Objects from the replicated pool will move to the erasure coded pool if they are not used and it will save space. You don't need to create the erasure coded pool on your small cluster. You can do it when it grows larger or when it becomes full. 2. Will Ceph gain the ability to change the K and N values for an existing pool in the near future? I don't think so. 3. Can the failure domain be changed for an existing pool? E.g. can we start with failure domain OSD and then switch it to Host after adding more hosts? The failure domain, although listed in the erasure code profile for convenience, really belongs to the crush ruleset applied to the pool. It can therefore be changed after the pool is created. It is likely to result in objects moving a lot during the transition but it should work fine otherwise. 4. Where can I find a good comparison of the available erasure code plugins that allows me to properly decide which one suits are needs best? In a nutshell, jerasure is flexible and is likely to be what you want, isa computes faster than jerasure but only works on intel processors (note however that the erasure code computation does not make a significant difference overall), lrc and shec (to be published in hammer) minimize network usage during recovery but uses more space than jerasure or isa. Cheers Many thanks for your help! Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ERROR: missing keyring, cannot use cephx for authentication
It doesn't look like your OSD is mounted. What do you have when you run mount? How did you create your OSDs? Robert LeBlanc Sent from a mobile device please excuse any typos. On Mar 25, 2015 1:31 AM, oyym...@gmail.com oyym...@gmail.com wrote: Hi,Jesus I encountered similar problem. *1.* shut down one of nodes, but all osds can't reactive on the node after reboot. *2.* run service ceph restart manually, got the same error message: [root@storage4 ~]# /etc/init.d/ceph start === osd.15 === 2015-03-23 14:43:32.399811 7fed0fcf4700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication 2015-03-23 14:43:32.399814 7fed0fcf4700 0 librados: osd.15 initialization error (2) No such file or directory Error connecting to cluster: ObjectNotFound failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.15 --keyring=/var/lib/ceph/osd/ceph-15/keyring osd crush create-or-move -- 15 0.19 host=storage4 root=default' .. 3. ll /var/lib/ceph/osd/ceph-15/ total 0 all files *disappeared* in the /var/lib/ceph/osd/ceph-15/ -- oyym...@gmail.com *From:* Jesus Chavez (jeschave) jesch...@cisco.com *Date:* 2015-03-24 05:09 *To:* ceph-users ceph-users@lists.ceph.com *Subject:* [ceph-users] ERROR: missing keyring, cannot use cephx for authentication Hi all, I did HA failover test shutting down 1 node and I see that only 1 OSD came up after reboot: [root@geminis ceph]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel-root 50G 4.5G 46G 9% / devtmpfs 126G 0 126G 0% /dev tmpfs 126G 80K 126G 1% /dev/shm tmpfs 126G 9.9M 126G 1% /run tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/sda1 494M 165M 330M 34% /boot /dev/mapper/rhel-home 36G 44M 36G 1% /home /dev/sdc1 3.7T 134M 3.7T 1% /var/lib/ceph/osd/ceph-14 If I run service ceph restart I got this error message… Stopping Ceph osd.94 on geminis...done === osd.94 === 2015-03-23 15:05:41.632505 7fe7b9941700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication 2015-03-23 15:05:41.632508 7fe7b9941700 0 librados: osd.94 initialization error (2) No such file or directory Error connecting to cluster: ObjectNotFound failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.94 --keyring=/var/lib/ceph/osd/ceph-94/keyring osd crush create-or-move -- 94 0.05 host=geminis root=default I have ceph.conf and ceph.client.admin.keyring under /etc/ceph: [root@geminis ceph]# ls /etc/ceph ceph.client.admin.keyring ceph.conf rbdmap tmp1OqNFi tmptQ0a1P [root@geminis ceph]# does anybody know what could be wrong? Thanks * Jesus Chavez* SYSTEMS ENGINEER-C.SALES jesch...@cisco.com Phone: *+52 55 5267 3146 %2B52%2055%205267%203146* Mobile: *+51 1 5538883255* CCIE - 44433 Cisco.com http://www.cisco.com/ Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. Please click here http://www.cisco.com/web/about/doing_business/legal/cri/index.html for Company Registration Information. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded
On Wed, Mar 25, 2015 at 1:20 AM, Udo Lembke ulem...@polarzone.de wrote: Hi, due to two more hosts (now 7 storage nodes) I want to create an new ec-pool and get an strange effect: ceph@admin:~$ ceph health detail HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized This is the big clue: you have two undersized PGs! pg 22.3e5 is stuck unclean since forever, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] 2147483647 is the largest number you can represent in a signed 32-bit integer. There's an output error of some kind which is fixed elsewhere; this should be -1. So for whatever reason (in general it's hard on CRUSH trying to select N entries out of N choices), CRUSH hasn't been able to map an OSD to this slot for you. You'll want to figure out why that is and fix it. -Greg pg 22.240 is stuck unclean since forever, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck undersized for 406.614447, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck undersized for 406.616563, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck degraded for 406.614566, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck degraded for 406.616679, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is active+undersized+degraded, acting [76,15,82,11,57,29,2147483647] pg 22.240 is active+undersized+degraded, acting [38,85,17,74,2147483647,10,58] But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647! Where the heck came the 2147483647 from? I do following commands: ceph osd erasure-code-profile set 7hostprofile k=5 m=2 ruleset-failure-domain=host ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile my version: ceph -v ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) I found an issue in my crush-map - one SSD was twice in the map: host ceph-061-ssd { id -16 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } root ssd { id -13 # do not change unnecessarily # weight 0.780 alg straw hash 0 # rjenkins1 item ceph-01-ssd weight 0.170 item ceph-02-ssd weight 0.170 item ceph-03-ssd weight 0.000 item ceph-04-ssd weight 0.170 item ceph-05-ssd weight 0.170 item ceph-06-ssd weight 0.050 item ceph-07-ssd weight 0.050 item ceph-061-ssd weight 0.000 } Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd, but after fix the crusmap the issue with the osd 2147483647 still excist. Any idea how to fix that? regards Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] error creating image in rbd-erasure-pool
Yes. On Wed, Mar 25, 2015 at 4:13 AM, Frédéric Nass frederic.n...@univ-lorraine.fr wrote: Hi Greg, Thank you for this clarification. It helps a lot. Does this can't think of any issues apply to both rbd and pool snapshots ? Frederic. On Tue, Mar 24, 2015 at 12:09 PM, Brendan Moloney molo...@ohsu.edu wrote: Hi Loic and Markus, By the way, Inktank do not support snapshot of a pool with cache tiering : * https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Hi, You seem to be talking about pool snapshots rather than RBD snapshots. But in the linked document it is not clear that there is a distinction: Can I use snapshots with a cache tier? Snapshots are not supported in conjunction with cache tiers. Can anyone clarify if this is just pool snapshots? I think that was just a decision based on the newness and complexity of the feature for product purposes. Snapshots against cache tiered pools certainly should be fine in Giant/Hammer and we can't think of any issues in Firefly off the tops of our heads. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cordialement, Frédéric Nass. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph -w: Understanding MB data versus MB used
On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote: Hello there, I started to push data into my ceph cluster. There is something I cannot understand in the output of ceph -w. When I run ceph -w I get this kinkd of output: 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail 2379MB is actually the data I pushed into the cluster, I can see it also in the ceph df output, and the numbers are consistent. What I dont understand is 19788MB used. All my pools have size 3, so I expected something like 2379 * 3. Instead this number is very big. I really need to understand how MB used grows because I need to know how many disks to buy. MB used is the summation of (the programmatic equivalent to) df across all your nodes, whereas MB data is calculated by the OSDs based on data they've written down. Depending on your configuration MB used can include thing like the OSD journals, or even totally unrelated data if the disks are shared with other applications. MB used including the space used by the OSD journals is my first guess about what you're seeing here, in which case you'll notice that it won't grow any faster than MB data does once the journal is fully allocated. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw authorization failed
- Original Message - From: Neville neville.tay...@hotmail.co.uk To: ceph-users@lists.ceph.com Sent: Wednesday, March 25, 2015 8:16:39 AM Subject: [ceph-users] Radosgw authorization failed Hi all, I'm testing backup product which supports Amazon S3 as target for Archive storage and I'm trying to setup a Ceph cluster configured with the S3 API to use as an internal target for backup archives instead of AWS. I've followed the online guide for setting up Radosgw and created a default region and zone based on the AWS naming convention US-East-1. I'm not sure if this is relevant but since I was having issues I thought it might need to be the same. I've tested the radosgw using boto.s3 and it seems to work ok i.e. I can create a bucket, create a folder, list buckets etc. The problem is when the backup software tries to create an object I get an authorization failure. It's using the same user/access/secret as I'm using from boto.s3 and I'm sure the creds are right as it lets me create the initial connection, it just fails when trying to create an object (backup folder). Here's the extract from the radosgw log: - 2015-03-25 15:07:26.449227 7f1050dc7700 2 req 5:0.000419:s3:GET /:list_bucket:init op 2015-03-25 15:07:26.449232 7f1050dc7700 2 req 5:0.000424:s3:GET /:list_bucket:verifying op mask 2015-03-25 15:07:26.449234 7f1050dc7700 20 required_mask= 1 user.op_mask=7 2015-03-25 15:07:26.449235 7f1050dc7700 2 req 5:0.000427:s3:GET /:list_bucket:verifying op permissions 2015-03-25 15:07:26.449237 7f1050dc7700 5 Searching permissions for uid=test mask=49 2015-03-25 15:07:26.449238 7f1050dc7700 5 Found permission: 15 2015-03-25 15:07:26.449239 7f1050dc7700 5 Searching permissions for group=1 mask=49 2015-03-25 15:07:26.449240 7f1050dc7700 5 Found permission: 15 2015-03-25 15:07:26.449241 7f1050dc7700 5 Searching permissions for group=2 mask=49 2015-03-25 15:07:26.449242 7f1050dc7700 5 Found permission: 15 2015-03-25 15:07:26.449243 7f1050dc7700 5 Getting permissions id=test owner=test perm=1 2015-03-25 15:07:26.449244 7f1050dc7700 10 uid=test requested perm (type)=1, policy perm=1, user_perm_mask=1, acl perm=1 2015-03-25 15:07:26.449245 7f1050dc7700 2 req 5:0.000437:s3:GET /:list_bucket:verifying op params 2015-03-25 15:07:26.449247 7f1050dc7700 2 req 5:0.000439:s3:GET /:list_bucket:executing 2015-03-25 15:07:26.449252 7f1050dc7700 10 cls_bucket_list test1(@{i=.us-east.rgw.buckets.index}.us-east.rgw.buckets[us-east.280959.2]) start num 1001 2015-03-25 15:07:26.450828 7f1050dc7700 2 req 5:0.002020:s3:GET /:list_bucket:http status=200 2015-03-25 15:07:26.450832 7f1050dc7700 1 == req done req=0x7f107000e2e0 http_status=200 == 2015-03-25 15:07:26.516999 7f1069df9700 20 enqueued request req=0x7f107000f0e0 2015-03-25 15:07:26.517006 7f1069df9700 20 RGWWQ: 2015-03-25 15:07:26.517007 7f1069df9700 20 req: 0x7f107000f0e0 2015-03-25 15:07:26.517010 7f1069df9700 10 allocated request req=0x7f107000f6b0 2015-03-25 15:07:26.517021 7f1058dd7700 20 dequeued request req=0x7f107000f0e0 2015-03-25 15:07:26.517023 7f1058dd7700 20 RGWWQ: empty 2015-03-25 15:07:26.517081 7f1058dd7700 20 CONTENT_LENGTH=88 2015-03-25 15:07:26.517084 7f1058dd7700 20 CONTENT_TYPE=application/octet-stream 2015-03-25 15:07:26.517085 7f1058dd7700 20 CONTEXT_DOCUMENT_ROOT=/var/www 2015-03-25 15:07:26.517086 7f1058dd7700 20 CONTEXT_PREFIX= 2015-03-25 15:07:26.517087 7f1058dd7700 20 DOCUMENT_ROOT=/var/www 2015-03-25 15:07:26.517088 7f1058dd7700 20 FCGI_ROLE=RESPONDER 2015-03-25 15:07:26.517089 7f1058dd7700 20 GATEWAY_INTERFACE=CGI/1.1 2015-03-25 15:07:26.517090 7f1058dd7700 20 HTTP_AUTHORIZATION=AWS F79L68W19B3GCLOSE3F8:AcXqtvlBzBMpwdL+WuhDRoLT/Bs= 2015-03-25 15:07:26.517091 7f1058dd7700 20 HTTP_CONNECTION=Keep-Alive 2015-03-25 15:07:26.517092 7f1058dd7700 20 HTTP_DATE=Wed, 25 Mar 2015 15:07:26 GMT 2015-03-25 15:07:26.517092 7f1058dd7700 20 HTTP_EXPECT=100-continue 2015-03-25 15:07:26.517093 7f1058dd7700 20 HTTP_HOST=test1.devops-os-cog01.devops.local 2015-03-25 15:07:26.517094 7f1058dd7700 20 HTTP_USER_AGENT=aws-sdk-java/unknown-version Windows_Server_2008_R2/6.1 Java_HotSpot(TM)_Client_VM/24.55-b03 2015-03-25 15:07:26.517096 7f1058dd7700 20 HTTP_X_AMZ_META_CREATIONTIME=2015-03-25T15:07:26 2015-03-25 15:07:26.517097 7f1058dd7700 20 HTTP_X_AMZ_META_SIZE=88 2015-03-25 15:07:26.517098 7f1058dd7700 20 HTTP_X_AMZ_STORAGE_CLASS=STANDARD 2015-03-25 15:07:26.517099 7f1058dd7700 20 HTTPS=on 2015-03-25 15:07:26.517100 7f1058dd7700 20 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2015-03-25 15:07:26.517100 7f1058dd7700 20 QUERY_STRING= 2015-03-25 15:07:26.517101 7f1058dd7700 20 REMOTE_ADDR=10.40.41.106 2015-03-25 15:07:26.517102 7f1058dd7700 20 REMOTE_PORT=55439 2015-03-25 15:07:26.517103 7f1058dd7700 20
Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded
Sorry all: my company's e-mail security got in the way there. Try these references... *http://tracker.ceph.com/issues/10350 * http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Don Doerner Sent: 25 March, 2015 08:01 To: Udo Lembke; ceph-us...@ceph.com Subject: Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded Assuming you've calculated the number of PGs reasonably, see herehttps://urldefense.proofpoint.com/v1/url?u=http://tracker.ceph.com/issues/10350k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=paGdpY4XEjd5skha6nYQHvvZ31Gx2psGdOhHbuywrRU%3D%0As=dc3fc62fa581494703a491f5e7090feafb1dc52128f072e3e4d4a5a882ef9c90 and herehttps://urldefense.proofpoint.com/v1/url?u=http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/%23crush-gives-up-too-soonhttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=paGdpY4XEjd5skha6nYQHvvZ31Gx2psGdOhHbuywrRU%3D%0As=1683ddcb2c3bb9c786555c0aad19daaa03b91ad8f3241035f496d16c0e57b552. I'm guessing these will address your issue. That weird number means that no OSD was found/assigned to the PG. -don- -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Udo Lembke Sent: 25 March, 2015 01:21 To: ceph-us...@ceph.commailto:ceph-us...@ceph.com Subject: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded Hi, due to two more hosts (now 7 storage nodes) I want to create an new ec-pool and get an strange effect: ceph@admin:~$ ceph health detail HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized pg 22.3e5 is stuck unclean since forever, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck unclean since forever, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck undersized for 406.614447, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck undersized for 406.616563, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck degraded for 406.614566, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck degraded for 406.616679, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is active+undersized+degraded, acting [76,15,82,11,57,29,2147483647] pg 22.240 is active+undersized+degraded, acting [38,85,17,74,2147483647,10,58] But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647! Where the heck came the 2147483647 from? I do following commands: ceph osd erasure-code-profile set 7hostprofile k=5 m=2 ruleset-failure-domain=host ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile my version: ceph -v ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) I found an issue in my crush-map - one SSD was twice in the map: host ceph-061-ssd { id -16 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } root ssd { id -13 # do not change unnecessarily # weight 0.780 alg straw hash 0 # rjenkins1 item ceph-01-ssd weight 0.170 item ceph-02-ssd weight 0.170 item ceph-03-ssd weight 0.000 item ceph-04-ssd weight 0.170 item ceph-05-ssd weight 0.170 item ceph-06-ssd weight 0.050 item ceph-07-ssd weight 0.050 item ceph-061-ssd weight 0.000 } Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd, but after fix the crusmap the issue with the osd 2147483647 still excist. Any idea how to fix that? regards Udo ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com https://urldefense.proofpoint.com/v1/url?u=http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.comk=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=7L%2Bu4ghQ7Cz2ppDjpUHHs74BvxHqx4qrftnh0Jo1y68%3D%0As=4cbce863e3e10b02556b5b7be498e83c60fb4e16cf29235bb0a35dd2bb68828b The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam
Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded
Hi Gregory, thanks for the answer! I have look which storage nodes are missing, and it's two differrent: pg 22.240 is stuck undersized for 24437.862139, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.240 is stuck undersized for 24437.862139, current state active+undersized+degraded, last acting [ceph-04,ceph-07,ceph-02,ceph-06,2147483647,ceph-01,ceph-05] ceph-03 is missing pg 22.3e5 is stuck undersized for 24437.860025, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.3e5 is stuck undersized for 24437.860025, current state active+undersized+degraded, last acting [ceph-06,ceph-ceph-02,ceph-07,ceph-01,ceph-05,ceph-03,2147483647] ceph-04 is missing Perhaps I hit an PGs/OSD max?! I look with the script from http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd pool : 17 18 19 9 10 20 21 13 22 23 16 | SUM ... host ceph-03: osd.24 0 12 2 2 4 76 16 5 74 0 66 | 257 osd.25 0 17 3 4 4 89 16 4 82 0 60 | 279 osd.26 0 20 2 5 3 71 12 5 81 0 61 | 260 osd.27 0 18 2 4 3 73 21 3 76 0 61 | 261 osd.28 0 14 2 9 4 73 23 9 94 0 64 | 292 osd.29 0 19 3 3 4 54 25 4 89 0 62 | 263 osd.30 0 22 2 6 3 80 15 6 92 0 47 | 273 osd.31 0 25 4 2 3 87 20 3 76 0 62 | 282 osd.32 0 13 4 2 2 64 14 1 82 0 69 | 251 osd.33 0 12 2 5 5 89 25 7 83 0 68 | 296 osd.34 0 28 0 8 5 81 18 3 99 0 65 | 307 osd.35 0 17 3 2 4 74 21 3 95 0 58 | 277 host ceph-04: osd.36 0 13 1 9 6 72 17 5 93 0 56 | 272 osd.37 0 21 2 5 6 83 20 4 78 0 71 | 290 osd.38 0 17 3 2 5 64 22 7 76 0 57 | 253 osd.39 0 21 3 7 6 79 27 4 80 0 68 | 295 osd.40 0 15 1 5 7 71 17 6 93 0 74 | 289 osd.41 0 16 5 5 6 76 18 6 95 0 70 | 297 osd.42 0 13 0 6 1 71 25 4 83 0 56 | 259 osd.43 0 20 2 2 6 81 23 4 89 0 59 | 286 osd.44 0 21 2 5 6 77 9 5 76 0 52 | 253 osd.45 0 11 4 8 3 76 24 6 82 0 49 | 263 osd.46 0 17 2 5 6 57 15 4 84 0 62 | 252 osd.47 0 19 3 2 3 84 19 5 94 0 48 | 277 ... SUM : 768 1536192 384 384 61441536384 7168 24 5120| Pool 22 is the new ec7archiv. But on ceph-04 there aren't OSD with more than 300 PGs... Udo Am 25.03.2015 14:52, schrieb Gregory Farnum: On Wed, Mar 25, 2015 at 1:20 AM, Udo Lembke ulem...@polarzone.de wrote: Hi, due to two more hosts (now 7 storage nodes) I want to create an new ec-pool and get an strange effect: ceph@admin:~$ ceph health detail HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized This is the big clue: you have two undersized PGs! pg 22.3e5 is stuck unclean since forever, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] 2147483647 is the largest number you can represent in a signed 32-bit integer. There's an output error of some kind which is fixed elsewhere; this should be -1. So for whatever reason (in general it's hard on CRUSH trying to select N entries out of N choices), CRUSH hasn't been able to map an OSD to this slot for you. You'll want to figure out why that is and fix it. -Greg pg 22.240 is stuck unclean since forever, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg
[ceph-users] Radosgw authorization failed
Hi all, I'm testing backup product which supports Amazon S3 as target for Archive storage and I'm trying to setup a Ceph cluster configured with the S3 API to use as an internal target for backup archives instead of AWS. I've followed the online guide for setting up Radosgw and created a default region and zone based on the AWS naming convention US-East-1. I'm not sure if this is relevant but since I was having issues I thought it might need to be the same. I've tested the radosgw using boto.s3 and it seems to work ok i.e. I can create a bucket, create a folder, list buckets etc. The problem is when the backup software tries to create an object I get an authorization failure. It's using the same user/access/secret as I'm using from boto.s3 and I'm sure the creds are right as it lets me create the initial connection, it just fails when trying to create an object (backup folder). Here's the extract from the radosgw log: - 2015-03-25 15:07:26.449227 7f1050dc7700 2 req 5:0.000419:s3:GET /:list_bucket:init op 2015-03-25 15:07:26.449232 7f1050dc7700 2 req 5:0.000424:s3:GET /:list_bucket:verifying op mask 2015-03-25 15:07:26.449234 7f1050dc7700 20 required_mask= 1 user.op_mask=7 2015-03-25 15:07:26.449235 7f1050dc7700 2 req 5:0.000427:s3:GET /:list_bucket:verifying op permissions 2015-03-25 15:07:26.449237 7f1050dc7700 5 Searching permissions for uid=test mask=49 2015-03-25 15:07:26.449238 7f1050dc7700 5 Found permission: 15 2015-03-25 15:07:26.449239 7f1050dc7700 5 Searching permissions for group=1 mask=49 2015-03-25 15:07:26.449240 7f1050dc7700 5 Found permission: 15 2015-03-25 15:07:26.449241 7f1050dc7700 5 Searching permissions for group=2 mask=49 2015-03-25 15:07:26.449242 7f1050dc7700 5 Found permission: 15 2015-03-25 15:07:26.449243 7f1050dc7700 5 Getting permissions id=test owner=test perm=1 2015-03-25 15:07:26.449244 7f1050dc7700 10 uid=test requested perm (type)=1, policy perm=1, user_perm_mask=1, acl perm=1 2015-03-25 15:07:26.449245 7f1050dc7700 2 req 5:0.000437:s3:GET /:list_bucket:verifying op params 2015-03-25 15:07:26.449247 7f1050dc7700 2 req 5:0.000439:s3:GET /:list_bucket:executing 2015-03-25 15:07:26.449252 7f1050dc7700 10 cls_bucket_list test1(@{i=.us-east.rgw.buckets.index}.us-east.rgw.buckets[us-east.280959.2]) start num 1001 2015-03-25 15:07:26.450828 7f1050dc7700 2 req 5:0.002020:s3:GET /:list_bucket:http status=200 2015-03-25 15:07:26.450832 7f1050dc7700 1 == req done req=0x7f107000e2e0 http_status=200 == 2015-03-25 15:07:26.516999 7f1069df9700 20 enqueued request req=0x7f107000f0e0 2015-03-25 15:07:26.517006 7f1069df9700 20 RGWWQ: 2015-03-25 15:07:26.517007 7f1069df9700 20 req: 0x7f107000f0e0 2015-03-25 15:07:26.517010 7f1069df9700 10 allocated request req=0x7f107000f6b0 2015-03-25 15:07:26.517021 7f1058dd7700 20 dequeued request req=0x7f107000f0e0 2015-03-25 15:07:26.517023 7f1058dd7700 20 RGWWQ: empty 2015-03-25 15:07:26.517081 7f1058dd7700 20 CONTENT_LENGTH=88 2015-03-25 15:07:26.517084 7f1058dd7700 20 CONTENT_TYPE=application/octet-stream 2015-03-25 15:07:26.517085 7f1058dd7700 20 CONTEXT_DOCUMENT_ROOT=/var/www 2015-03-25 15:07:26.517086 7f1058dd7700 20 CONTEXT_PREFIX= 2015-03-25 15:07:26.517087 7f1058dd7700 20 DOCUMENT_ROOT=/var/www 2015-03-25 15:07:26.517088 7f1058dd7700 20 FCGI_ROLE=RESPONDER 2015-03-25 15:07:26.517089 7f1058dd7700 20 GATEWAY_INTERFACE=CGI/1.1 2015-03-25 15:07:26.517090 7f1058dd7700 20 HTTP_AUTHORIZATION=AWS F79L68W19B3GCLOSE3F8:AcXqtvlBzBMpwdL+WuhDRoLT/Bs= 2015-03-25 15:07:26.517091 7f1058dd7700 20 HTTP_CONNECTION=Keep-Alive 2015-03-25 15:07:26.517092 7f1058dd7700 20 HTTP_DATE=Wed, 25 Mar 2015 15:07:26 GMT 2015-03-25 15:07:26.517092 7f1058dd7700 20 HTTP_EXPECT=100-continue 2015-03-25 15:07:26.517093 7f1058dd7700 20 HTTP_HOST=test1.devops-os-cog01.devops.local 2015-03-25 15:07:26.517094 7f1058dd7700 20 HTTP_USER_AGENT=aws-sdk-java/unknown-version Windows_Server_2008_R2/6.1 Java_HotSpot(TM)_Client_VM/24.55-b03 2015-03-25 15:07:26.517096 7f1058dd7700 20 HTTP_X_AMZ_META_CREATIONTIME=2015-03-25T15:07:26 2015-03-25 15:07:26.517097 7f1058dd7700 20 HTTP_X_AMZ_META_SIZE=88 2015-03-25 15:07:26.517098 7f1058dd7700 20 HTTP_X_AMZ_STORAGE_CLASS=STANDARD 2015-03-25 15:07:26.517099 7f1058dd7700 20 HTTPS=on 2015-03-25 15:07:26.517100 7f1058dd7700 20 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2015-03-25 15:07:26.517100 7f1058dd7700 20 QUERY_STRING= 2015-03-25 15:07:26.517101 7f1058dd7700 20 REMOTE_ADDR=10.40.41.106 2015-03-25 15:07:26.517102 7f1058dd7700 20 REMOTE_PORT=55439 2015-03-25 15:07:26.517103 7f1058dd7700 20 REQUEST_METHOD=PUT 2015-03-25 15:07:26.517104 7f1058dd7700 20 REQUEST_SCHEME=https 2015-03-25 15:07:26.517105 7f1058dd7700 20 REQUEST_URI=/ca_ccifs_c6dccf63-ec57-45b2-87e7-d9b14b971ca3 2015-03-25 15:07:26.517106 7f1058dd7700 20
Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded
Hi Don, thanks for the info! looks that choose_tries set to 200 do the trick. But the setcrushmap takes a long long time (alarming, but the client have still IO)... hope it's finished soon ;-) Udo Am 25.03.2015 16:00, schrieb Don Doerner: Assuming you've calculated the number of PGs reasonably, see here http://tracker.ceph.com/issues/10350 and here http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soonhttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/. I’m guessing these will address your issue. That weird number means that no OSD was found/assigned to the PG. -don- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven CPU usage on OSD nodes
Hi Somnath, Thanks, the tcmalloc env variable trick definitely had an impact on FetchFromSpans calls. export TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=1310851072; /etc/init.d/ceph stop; /etc/init.d/ceph start Nevertheless, if these FetchFromSpans library calls activity is now even on all hosts, the CPU activity of the ceph-osd processes remains twice as high on 2 hosts : http://www.4shared.com/photo/3IP8jGPWba/UnevenLoad4-perf.html http://www.4shared.com/photo/XX4C9NHTba/UnevenLoad4-top.html and this can be observed under load of a benchmark or when idling too : http://www.4shared.com/photo/x2Fl_in-ce/UnevenLoad4-top-idle.html I'm now almost doubting of the values reported by the command 'top' as 'perf top' doesn't reveal major differences in calls ... Could you elaborate on your sentence saw the node consuming more cpus has more memory pressure as well ? You mean on your site ? I can't see memory pressure on my hosts (~28GB available mem) but perhaps I'm not looking at the right thing. And no swap on the hosts. Here is the osd tree leading to linear distribution I mentionned : ceph osd tree # idweighttype nameup/downreweight -1217.8root default -254.45host siggy 03.63osd.0up1 13.63osd.1up1 23.63osd.2up1 33.63osd.3up1 43.63osd.4up1 53.63osd.5up1 63.63osd.6up1 73.63osd.7up1 83.63osd.8up1 93.63osd.9up1 103.63osd.10up1 113.63osd.11up1 123.63osd.12up1 133.63osd.13up1 143.63osd.14up1 -354.45host horik 153.63osd.15up1 163.63osd.16up1 173.63osd.17up1 183.63osd.18up1 193.63osd.19up1 203.63osd.20up1 213.63osd.21up1 223.63osd.22up1 233.63osd.23up1 243.63osd.24up1 253.63osd.25up1 263.63osd.26up1 273.63osd.27up1 283.63osd.28up1 293.63osd.29up1 -454.45host floki 303.63osd.30up1 313.63osd.31up1 323.63osd.32up1 333.63osd.33up1 343.63osd.34up1 353.63osd.35up1 363.63osd.36up1 373.63osd.37up1 383.63osd.38up1 393.63osd.39up1 403.63osd.40up1 413.63osd.41up1 423.63osd.42up1 433.63osd.43up1 443.63osd.44up1 -554.45host borg 453.63osd.45up1 463.63osd.46up1 473.63osd.47up1 483.63osd.48up1 493.63osd.49up1 503.63osd.50up1 513.63osd.51up1 523.63osd.52up1 533.63osd.53up1 543.63osd.54up1 553.63osd.55up1 563.63osd.56up1 573.63osd.57up1 583.63osd.58up1 593.63osd.59up1 Regards, Frederic Somnath Roy somnath@sandisk.com a écrit le 23/03/15 17:33 : Yes, we are also facing similar issue on load (and running after some time). This is a tcmalloc behavior. You can try setting the following env variable to a bigger value say 128MB or so. TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES This env variable is supposed to alleviate the issue but what we found in the Ubuntu 14.04 version of tcmalloc this env variable is noop. This was a bug in tcmalloc which is been fixed in latest tcmalloc code base. Not sure about RHEL though. In that case, you may want to try with latest tcmalloc. Just replacing LD_LIBRARY_PATH to the new tcmalloc location should work good. Latest Ceph master has support for jemalloc and you may want to try with that if this is your test cluster. Another point, I saw the node consuming more cpus has more memory pressure as well (and that’s why tcmalloc also having that issue). Can you give us output of ‘ceph osd tree’ to check if the load distribution is even ? Also, check if those systems are swapping or not. Hope
Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded
Assuming you've calculated the number of PGs reasonably, see herehttp://tracker.ceph.com/issues/10350 and herehttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soonhttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/. I'm guessing these will address your issue. That weird number means that no OSD was found/assigned to the PG. -don- -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Udo Lembke Sent: 25 March, 2015 01:21 To: ceph-us...@ceph.com Subject: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded Hi, due to two more hosts (now 7 storage nodes) I want to create an new ec-pool and get an strange effect: ceph@admin:~$ ceph health detail HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized pg 22.3e5 is stuck unclean since forever, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck unclean since forever, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck undersized for 406.614447, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck undersized for 406.616563, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck degraded for 406.614566, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck degraded for 406.616679, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is active+undersized+degraded, acting [76,15,82,11,57,29,2147483647] pg 22.240 is active+undersized+degraded, acting [38,85,17,74,2147483647,10,58] But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647! Where the heck came the 2147483647 from? I do following commands: ceph osd erasure-code-profile set 7hostprofile k=5 m=2 ruleset-failure-domain=host ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile my version: ceph -v ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) I found an issue in my crush-map - one SSD was twice in the map: host ceph-061-ssd { id -16 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } root ssd { id -13 # do not change unnecessarily # weight 0.780 alg straw hash 0 # rjenkins1 item ceph-01-ssd weight 0.170 item ceph-02-ssd weight 0.170 item ceph-03-ssd weight 0.000 item ceph-04-ssd weight 0.170 item ceph-05-ssd weight 0.170 item ceph-06-ssd weight 0.050 item ceph-07-ssd weight 0.050 item ceph-061-ssd weight 0.000 } Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd, but after fix the crusmap the issue with the osd 2147483647 still excist. Any idea how to fix that? regards Udo ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com https://urldefense.proofpoint.com/v1/url?u=http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.comk=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=7L%2Bu4ghQ7Cz2ppDjpUHHs74BvxHqx4qrftnh0Jo1y68%3D%0As=4cbce863e3e10b02556b5b7be498e83c60fb4e16cf29235bb0a35dd2bb68828b -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven CPU usage on OSD nodes
Hi Fredrick, See my response inline. Thanks Regards Somnath From: f...@univ-lr.fr [mailto:f...@univ-lr.fr] Sent: Wednesday, March 25, 2015 8:07 AM To: Somnath Roy Cc: Ceph Users Subject: Re: [ceph-users] Uneven CPU usage on OSD nodes Hi Somnath, Thanks, the tcmalloc env variable trick definitely had an impact on FetchFromSpans calls. export TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=1310851072; /etc/init.d/ceph stop; /etc/init.d/ceph start Nevertheless, if these FetchFromSpans library calls activity is now even on all hosts, the CPU activity of the ceph-osd processes remains twice as high on 2 hosts : http://www.4shared.com/photo/3IP8jGPWba/UnevenLoad4-perf.html http://www.4shared.com/photo/XX4C9NHTba/UnevenLoad4-top.html and this can be observed under load of a benchmark or when idling too : http://www.4shared.com/photo/x2Fl_in-ce/UnevenLoad4-top-idle.html [Somnath] Hope you are using latest tcmalloc as I said there is a bug in tcmalloc coming with Ubuntu 14.04. Not sure about RHEL though. Nevertheless, the tcmalloc stuff went away it seems. Now, it is all about crc. As you can see (from perf top), the cpu usage for this crc calculation is taking more cpus on the two nodes. I guess that’s the difference now. Please turn off crc calculation by using the following config option. #ms_nocrc = true--- This is in Giant and prior //Following two for the latest master/hammer ms_crc_data = false ms_crc_header = false The idle time cpu difference is not that bad. Need ‘perf top’ to see what is going on in idle time. I'm now almost doubting of the values reported by the command 'top' as 'perf top' doesn't reveal major differences in calls ... Could you elaborate on your sentence saw the node consuming more cpus has more memory pressure as well ? You mean on your site ? I can't see memory pressure on my hosts (~28GB available mem) but perhaps I'm not looking at the right thing. And no swap on the hosts. [Somnath] In your previous screen shots, the node having more cpu usage was using more memory. The mem% reported by top is more against ceph-osds. That’s what I was pointing. But, now it is similar for both the cases. Here is the osd tree leading to linear distribution I mentionned : ceph osd tree # idweighttype nameup/downreweight -1217.8root default -254.45host siggy 03.63osd.0up1 13.63osd.1up1 23.63osd.2up1 33.63osd.3up1 43.63osd.4up1 53.63osd.5up1 63.63osd.6up1 73.63osd.7up1 83.63osd.8up1 93.63osd.9up1 103.63osd.10up1 113.63osd.11up1 123.63osd.12up1 133.63osd.13up1 143.63osd.14up1 -354.45host horik 153.63osd.15up1 163.63osd.16up1 173.63osd.17up1 183.63osd.18up1 193.63osd.19up1 203.63osd.20up1 213.63osd.21up1 223.63osd.22up1 233.63osd.23up1 243.63osd.24up1 253.63osd.25up1 263.63osd.26up1 273.63osd.27up1 283.63osd.28up1 293.63osd.29up1 -454.45host floki 303.63osd.30up1 313.63osd.31up1 323.63osd.32up1 333.63osd.33up1 343.63osd.34up1 353.63osd.35up1 363.63osd.36up1 373.63osd.37up1 383.63osd.38up1 393.63osd.39up1 403.63osd.40up1 413.63osd.41up1 423.63osd.42up1 433.63osd.43up1 443.63osd.44up1 -554.45host borg 453.63osd.45up1 463.63osd.46up1 473.63osd.47up1 483.63osd.48up1 493.63osd.49up1 503.63osd.50up1 513.63osd.51up1 523.63osd.52up1 533.63osd.53up1 543.63osd.54up1 553.63osd.55up1 563.63osd.56up1 573.63osd.57up1 583.63osd.58up1 593.63osd.59up1 Regards, Frederic Somnath Roy somnath@sandisk.commailto:somnath@sandisk.com a écrit le 23/03/15 17:33 : Yes, we
Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)
On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: Dear All, Please forgive this post if it's naive, I'm trying to familiarise myself with cephfs! I'm using Scientific Linux 6.6. with Ceph 0.87.1 My first steps with cephfs using a replicated pool worked OK. Now trying now to test cephfs via a replicated caching tier on top of an erasure pool. I've created an erasure pool, cannot put it under the existing replicated pool. My thoughts were to delete the existing cephfs, and start again, however I cannot delete the existing cephfs: errors are as follows: [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem I've tried killing the ceph-mds process, but this does not prevent the above error. I've also tried this, which also errors: [root@ceph1 ~]# ceph mds stop 0 Error EBUSY: must decrease max_mds or else MDS will immediately reactivate Right, so did you run ceph mds set_max_mds 0 and then repeating the stop command? :) This also fail... [root@ceph1 ~]# ceph-deploy mds destroy [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy [ceph_deploy.mds][ERROR ] subcommand destroy not implemented Am I doing the right thing in trying to wipe the original cephfs config before attempting to use an erasure cold tier? Or can I just redefine the cephfs? Yeah, unfortunately you need to recreate it if you want to try and use an EC pool with cache tiering, because CephFS knows what pools it expects data to belong to. Things are unlikely to behave correctly if you try and stick an EC pool under an existing one. :( Sounds like this is all just testing, which is good because the suitability of EC+cache is very dependent on how much hot data you have, etc...good luck! -Greg many thanks, Jake Grimmett ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New deployment: errors starting OSDs: invalid (someone else's?) journal
I don't know much about ceph-deploy, but I know that ceph-disk has problems automatically adding an SSD OSD when there are journals of other disks already on it. I've had to partition the disk ahead of time and pass in the partitions to make ceph-disk work. Also, unless you are sure that the dev devices will be deterministicly named the same each time, I'd recommend you not use /dev/sd* for pointing to your journals. Instead use something that will always be the same, since Ceph with partition the disks with GPT, you can use the partuuid to point to the journal partition and it will always be right. A while back I used this to fix my journal links when I did it wrong. You will want to double check that it will work right for you. no warranty and all that jazz... #convert the /dev/sd* links for journals into UUIDs for lnk in $(ls /var/lib/ceph/osd/); do OSD=/var/lib/ceph/osd/$lnk; DEV=$(readlink $OSD/journal | cut -d'/' -f3); echo $DEV; PUUID=$(ls -lh /dev/disk/by-partuuid/ | grep $DEV | cut -d' ' -f 9); ln -sf /dev/disk/by-partuuid/$PUUID $OSD/journal; done On Wed, Mar 25, 2015 at 10:46 AM, Antonio Messina antonio.s.mess...@gmail.com wrote: Hi all, I'm trying to install ceph on a 7-nodes preproduction cluster. Each node has 24x 4TB SAS disks (2x dell md1400 enclosures) and 6x 800GB SSDs (for cache tiering, not journals). I'm using Ubuntu 14.04 and ceph-deploy to install the cluster, I've been trying both Firefly and Giant and getting the same error. However, the logs I'm reporting are relative to the Firefly installation. The installation seems to go fine until I try to install the last 2 OSDs (they are SSD disks) of each host. All the OSDs from 0 to 195 are UP and IN, but when I try to deploy the next OSD (no matter what host) ceph-osd daemon won't start. The error I get is: 2015-03-25 17:00:17.130937 7fe231312800 0 ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-osd, pid 20280 2015-03-25 17:00:17.133601 7fe231312800 10 filestore(/var/lib/ceph/osd/ceph-196) dump_stop 2015-03-25 17:00:17.133694 7fe231312800 5 filestore(/var/lib/ceph/osd/ceph-196) basedir /var/lib/ceph/osd/ceph-196 journal /var/lib/ceph/osd/ceph-196/journal 2015-03-25 17:00:17.133725 7fe231312800 10 filestore(/var/lib/ceph/osd/ceph-196) mount fsid is 8c2fa707-750a-4773-8918-a368367d9cf5 2015-03-25 17:00:17.133789 7fe231312800 0 filestore(/var/lib/ceph/osd/ceph-196) mount detected xfs (libxfs) 2015-03-25 17:00:17.133810 7fe231312800 1 filestore(/var/lib/ceph/osd/ceph-196) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2015-03-25 17:00:17.135882 7fe231312800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features: FIEMAP ioctl is supported and appears to work 2015-03-25 17:00:17.135892 7fe231312800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-03-25 17:00:17.136318 7fe231312800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2015-03-25 17:00:17.136373 7fe231312800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_feature: extsize is disabled by conf 2015-03-25 17:00:17.136640 7fe231312800 5 filestore(/var/lib/ceph/osd/ceph-196) mount op_seq is 1 2015-03-25 17:00:17.137547 7fe231312800 20 filestore (init)dbobjectmap: seq is 1 2015-03-25 17:00:17.137560 7fe231312800 10 filestore(/var/lib/ceph/osd/ceph-196) open_journal at /var/lib/ceph/osd/ceph-196/journal 2015-03-25 17:00:17.137575 7fe231312800 0 filestore(/var/lib/ceph/osd/ceph-196) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2015-03-25 17:00:17.137580 7fe231312800 10 filestore(/var/lib/ceph/osd/ceph-196) list_collections 2015-03-25 17:00:17.137661 7fe231312800 10 journal journal_replay fs op_seq 1 2015-03-25 17:00:17.137668 7fe231312800 2 journal open /var/lib/ceph/osd/ceph-196/journal fsid 8c2fa707-750a-4773-8918-a368367d9cf5 fs_op_seq 1 2015-03-25 17:00:17.137670 7fe22b8b1700 20 filestore(/var/lib/ceph/osd/ceph-196) sync_entry waiting for max_interval 5.00 2015-03-25 17:00:17.137690 7fe231312800 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 5367661056) 2015-03-25 17:00:17.162489 7fe231312800 1 journal _open /var/lib/ceph/osd/ceph-196/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-03-25 17:00:17.162502 7fe231312800 10 journal read_header 2015-03-25 17:00:17.172249 7fe231312800 10 journal header: block_size 4096 alignment 4096 max_size 5367660544 2015-03-25 17:00:17.172256 7fe231312800 10 journal header: start 50987008 2015-03-25 17:00:17.172257 7fe231312800 10 journal write_pos 4096 2015-03-25 17:00:17.172259 7fe231312800 10 journal open header.fsid = 942f2d62-dd99-42a8-878a-feea443aaa61 2015-03-25 17:00:17.172264
Re: [ceph-users] New deployment: errors starting OSDs: invalid (someone else's?) journal
On Wed, Mar 25, 2015 at 6:06 PM, Robert LeBlanc rob...@leblancnet.us wrote: I don't know much about ceph-deploy, but I know that ceph-disk has problems automatically adding an SSD OSD when there are journals of other disks already on it. I've had to partition the disk ahead of time and pass in the partitions to make ceph-disk work. This is not my case: the journal is created automatically by ceph-deploy on the same disk, so that for each disk, /dev/sdX1 is the data partition and /dev/sdX2 is the journal partition. This is also what I want: I know there is a performance drop, but I expect it to be mitigated by the cache tier. (and I plan to test both configuration anyway) Also, unless you are sure that the dev devices will be deterministicly named the same each time, I'd recommend you not use /dev/sd* for pointing to your journals. Instead use something that will always be the same, since Ceph with partition the disks with GPT, you can use the partuuid to point to the journal partition and it will always be right. A while back I used this to fix my journal links when I did it wrong. You will want to double check that it will work right for you. no warranty and all that jazz... Thank you for pointing this out, it's an important point. However, the links are actually created using the partuuid. The command I posted in my previous email included the output of a pair of nested readlink in order to get the /dev/sd* names, because in this way it's easier to see if there are duplicates and where :) The output of ls -l /var/lib/ceph/osd/ceph-*/journal is actually: lrwxrwxrwx 1 root root 58 Mar 25 11:38 /var/lib/ceph/osd/ceph-0/journal - /dev/disk/by-partuuid/18305316-96b0-4654-aaad-7aeb891429f6 lrwxrwxrwx 1 root root 58 Mar 25 11:49 /var/lib/ceph/osd/ceph-7/journal - /dev/disk/by-partuuid/a263b19a-cb0d-4b4c-bd81-314619d5755d lrwxrwxrwx 1 root root 58 Mar 25 12:21 /var/lib/ceph/osd/ceph-14/journal - /dev/disk/by-partuuid/79734e0e-87dd-40c7-ba83-0d49695a75fb lrwxrwxrwx 1 root root 58 Mar 25 12:31 /var/lib/ceph/osd/ceph-21/journal - /dev/disk/by-partuuid/73a504bc-3179-43fd-942c-13c6bd8633c5 lrwxrwxrwx 1 root root 58 Mar 25 12:42 /var/lib/ceph/osd/ceph-28/journal - /dev/disk/by-partuuid/ecff10df-d757-4b1f-bef4-88dd84d84ef1 lrwxrwxrwx 1 root root 58 Mar 25 12:52 /var/lib/ceph/osd/ceph-35/journal - /dev/disk/by-partuuid/5be30238-3f07-4950-b39f-f5e4c7305e4c lrwxrwxrwx 1 root root 58 Mar 25 13:02 /var/lib/ceph/osd/ceph-42/journal - /dev/disk/by-partuuid/3cdb65f2-474c-47fb-8d07-83e7518418ff lrwxrwxrwx 1 root root 58 Mar 25 13:12 /var/lib/ceph/osd/ceph-49/journal - /dev/disk/by-partuuid/a47fe2b7-e375-4eea-b7a9-0354a24548dc lrwxrwxrwx 1 root root 58 Mar 25 13:22 /var/lib/ceph/osd/ceph-56/journal - /dev/disk/by-partuuid/fb42b7d6-bc6c-4063-8b73-29beb1f65107 lrwxrwxrwx 1 root root 58 Mar 25 13:33 /var/lib/ceph/osd/ceph-63/journal - /dev/disk/by-partuuid/72aff32b-ca56-4c25-b8ea-ff3aba8db507 lrwxrwxrwx 1 root root 58 Mar 25 13:43 /var/lib/ceph/osd/ceph-70/journal - /dev/disk/by-partuuid/b7c17a75-47cd-401e-b963-afe910612bd6 lrwxrwxrwx 1 root root 58 Mar 25 13:53 /var/lib/ceph/osd/ceph-77/journal - /dev/disk/by-partuuid/2c1c2501-fa82-4fc9-a586-03cc4d68faef lrwxrwxrwx 1 root root 58 Mar 25 14:03 /var/lib/ceph/osd/ceph-84/journal - /dev/disk/by-partuuid/46f619a5-3edf-44e9-99a6-24d98bcd174a lrwxrwxrwx 1 root root 58 Mar 25 14:13 /var/lib/ceph/osd/ceph-91/journal - /dev/disk/by-partuuid/5feef832-dd82-4aa0-9264-dc9496a3f93a lrwxrwxrwx 1 root root 58 Mar 25 14:24 /var/lib/ceph/osd/ceph-98/journal - /dev/disk/by-partuuid/055793a0-99d4-49c4-9698-bd8880c21d9c lrwxrwxrwx 1 root root 58 Mar 25 14:34 /var/lib/ceph/osd/ceph-105/journal - /dev/disk/by-partuuid/20547f26-6ef3-422b-9732-ad8b0b5b5379 lrwxrwxrwx 1 root root 58 Mar 25 14:44 /var/lib/ceph/osd/ceph-112/journal - /dev/disk/by-partuuid/2abea809-59c4-41da-bb52-28ef1911ec43 lrwxrwxrwx 1 root root 58 Mar 25 14:54 /var/lib/ceph/osd/ceph-119/journal - /dev/disk/by-partuuid/d8d15bb8-4b3d-4375-b6e1-62794971df7e lrwxrwxrwx 1 root root 58 Mar 25 15:05 /var/lib/ceph/osd/ceph-126/journal - /dev/disk/by-partuuid/ff6ee2b2-9c33-4902-a5e3-f6e9db5714e9 lrwxrwxrwx 1 root root 58 Mar 25 15:15 /var/lib/ceph/osd/ceph-133/journal - /dev/disk/by-partuuid/9faccb6e-ada9-4742-aa31-eb1308769205 lrwxrwxrwx 1 root root 58 Mar 25 15:25 /var/lib/ceph/osd/ceph-140/journal - /dev/disk/by-partuuid/2df13c88-ee58-4881-a373-a36a09fb6366 lrwxrwxrwx 1 root root 58 Mar 25 15:36 /var/lib/ceph/osd/ceph-147/journal - /dev/disk/by-partuuid/13cda9d1-0fec-40cc-a6fc-7cc56f7ffb78 lrwxrwxrwx 1 root root 58 Mar 25 15:46 /var/lib/ceph/osd/ceph-154/journal - /dev/disk/by-partuuid/5d37bfe9-c0f9-49e0-a951-b0ed04c5de51 lrwxrwxrwx 1 root root 58 Mar 25 15:57 /var/lib/ceph/osd/ceph-161/journal - /dev/disk/by-partuuid/d34f3abb-3fb7-4875-90d3-d2d3836f6e4d lrwxrwxrwx 1 root root 58 Mar 25 16:07 /var/lib/ceph/osd/ceph-168/journal - /dev/disk/by-partuuid/02c3db3e-159c-47d9-8a63-0389ea89fad1 lrwxrwxrwx 1 root root 58 Mar 25 16:16
Re: [ceph-users] New deployment: errors starting OSDs: invalid (someone else's?) journal
Probably a case of trying to read too fast. Sorry about that. As far as your theory on the cache pool, I haven't tried that, but my gut feeling is that it won't help as much as having the journal on the SSD. The Cache tier isn't trying to collate writes, not like the journal is doing. Then on the spindle you are having to write to two very different parts of the drive for every piece of data, although this is somewhat reduced by the journal, I feel it will still be significant. When I see writes coming off my SSD journals to the spindles, I'm still getting a lot of merged IO (at least during a backfill/recovery). I'm interested in your results. As far as the foreign journal, I would run dd over the journal partition and try it again. It sounds like something didn't get cleaned up from a previous run. On Wed, Mar 25, 2015 at 11:14 AM, Antonio Messina antonio.s.mess...@gmail.com wrote: On Wed, Mar 25, 2015 at 6:06 PM, Robert LeBlanc rob...@leblancnet.us wrote: I don't know much about ceph-deploy, but I know that ceph-disk has problems automatically adding an SSD OSD when there are journals of other disks already on it. I've had to partition the disk ahead of time and pass in the partitions to make ceph-disk work. This is not my case: the journal is created automatically by ceph-deploy on the same disk, so that for each disk, /dev/sdX1 is the data partition and /dev/sdX2 is the journal partition. This is also what I want: I know there is a performance drop, but I expect it to be mitigated by the cache tier. (and I plan to test both configuration anyway) Also, unless you are sure that the dev devices will be deterministicly named the same each time, I'd recommend you not use /dev/sd* for pointing to your journals. Instead use something that will always be the same, since Ceph with partition the disks with GPT, you can use the partuuid to point to the journal partition and it will always be right. A while back I used this to fix my journal links when I did it wrong. You will want to double check that it will work right for you. no warranty and all that jazz... Thank you for pointing this out, it's an important point. However, the links are actually created using the partuuid. The command I posted in my previous email included the output of a pair of nested readlink in order to get the /dev/sd* names, because in this way it's easier to see if there are duplicates and where :) The output of ls -l /var/lib/ceph/osd/ceph-*/journal is actually: lrwxrwxrwx 1 root root 58 Mar 25 11:38 /var/lib/ceph/osd/ceph-0/journal - /dev/disk/by-partuuid/18305316-96b0-4654-aaad-7aeb891429f6 lrwxrwxrwx 1 root root 58 Mar 25 11:49 /var/lib/ceph/osd/ceph-7/journal - /dev/disk/by-partuuid/a263b19a-cb0d-4b4c-bd81-314619d5755d lrwxrwxrwx 1 root root 58 Mar 25 12:21 /var/lib/ceph/osd/ceph-14/journal - /dev/disk/by-partuuid/79734e0e-87dd-40c7-ba83-0d49695a75fb lrwxrwxrwx 1 root root 58 Mar 25 12:31 /var/lib/ceph/osd/ceph-21/journal - /dev/disk/by-partuuid/73a504bc-3179-43fd-942c-13c6bd8633c5 lrwxrwxrwx 1 root root 58 Mar 25 12:42 /var/lib/ceph/osd/ceph-28/journal - /dev/disk/by-partuuid/ecff10df-d757-4b1f-bef4-88dd84d84ef1 lrwxrwxrwx 1 root root 58 Mar 25 12:52 /var/lib/ceph/osd/ceph-35/journal - /dev/disk/by-partuuid/5be30238-3f07-4950-b39f-f5e4c7305e4c lrwxrwxrwx 1 root root 58 Mar 25 13:02 /var/lib/ceph/osd/ceph-42/journal - /dev/disk/by-partuuid/3cdb65f2-474c-47fb-8d07-83e7518418ff lrwxrwxrwx 1 root root 58 Mar 25 13:12 /var/lib/ceph/osd/ceph-49/journal - /dev/disk/by-partuuid/a47fe2b7-e375-4eea-b7a9-0354a24548dc lrwxrwxrwx 1 root root 58 Mar 25 13:22 /var/lib/ceph/osd/ceph-56/journal - /dev/disk/by-partuuid/fb42b7d6-bc6c-4063-8b73-29beb1f65107 lrwxrwxrwx 1 root root 58 Mar 25 13:33 /var/lib/ceph/osd/ceph-63/journal - /dev/disk/by-partuuid/72aff32b-ca56-4c25-b8ea-ff3aba8db507 lrwxrwxrwx 1 root root 58 Mar 25 13:43 /var/lib/ceph/osd/ceph-70/journal - /dev/disk/by-partuuid/b7c17a75-47cd-401e-b963-afe910612bd6 lrwxrwxrwx 1 root root 58 Mar 25 13:53 /var/lib/ceph/osd/ceph-77/journal - /dev/disk/by-partuuid/2c1c2501-fa82-4fc9-a586-03cc4d68faef lrwxrwxrwx 1 root root 58 Mar 25 14:03 /var/lib/ceph/osd/ceph-84/journal - /dev/disk/by-partuuid/46f619a5-3edf-44e9-99a6-24d98bcd174a lrwxrwxrwx 1 root root 58 Mar 25 14:13 /var/lib/ceph/osd/ceph-91/journal - /dev/disk/by-partuuid/5feef832-dd82-4aa0-9264-dc9496a3f93a lrwxrwxrwx 1 root root 58 Mar 25 14:24 /var/lib/ceph/osd/ceph-98/journal - /dev/disk/by-partuuid/055793a0-99d4-49c4-9698-bd8880c21d9c lrwxrwxrwx 1 root root 58 Mar 25 14:34 /var/lib/ceph/osd/ceph-105/journal - /dev/disk/by-partuuid/20547f26-6ef3-422b-9732-ad8b0b5b5379 lrwxrwxrwx 1 root root 58 Mar 25 14:44 /var/lib/ceph/osd/ceph-112/journal - /dev/disk/by-partuuid/2abea809-59c4-41da-bb52-28ef1911ec43 lrwxrwxrwx 1 root root 58 Mar 25 14:54 /var/lib/ceph/osd/ceph-119/journal - /dev/disk/by-partuuid/d8d15bb8-4b3d-4375-b6e1-62794971df7e lrwxrwxrwx 1
[ceph-users] won leader election with quorum during osd setcrushmap
Hi, due to PG-trouble with an EC-Pool I modify the crushmap (step set_choose_tries 200) from rule ec7archiv { ruleset 6 type erasure min_size 3 max_size 20 step set_chooseleaf_tries 5 step take default step chooseleaf indep 0 type host step emit } to rule ec7archiv { ruleset 6 type erasure min_size 3 max_size 20 step set_chooseleaf_tries 5 step set_choose_tries 200 step take default step chooseleaf indep 0 type host step emit } ceph osd setcrushmap runs since one hour and ceph -w give following output: 2015-03-25 17:20:18.163295 mon.0 [INF] mdsmap e766: 1/1/1 up {0=b=up:active}, 1 up:standby 2015-03-25 17:20:18.163370 mon.0 [INF] osdmap e130004: 91 osds: 91 up, 91 in 2015-03-25 17:20:28.525445 mon.0 [INF] from='client.? 172.20.2.1:0/1007537' entity='client.admin' cmd=[{prefix: osd setcrushmap}]: dispatch 2015-03-25 17:20:28.525580 mon.0 [INF] mon.0 calling new monitor election 2015-03-25 17:20:28.526263 mon.0 [INF] mon.0@0 won leader election with quorum 0,1,2 Fortunaly the clients have still access to the cluster (kvm)!! How long take such an setcrushmap?? Normaly it's done in few seconds. Has the setcrushmap chance to get ready? Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)
Dear All, Please forgive this post if it's naive, I'm trying to familiarise myself with cephfs! I'm using Scientific Linux 6.6. with Ceph 0.87.1 My first steps with cephfs using a replicated pool worked OK. Now trying now to test cephfs via a replicated caching tier on top of an erasure pool. I've created an erasure pool, cannot put it under the existing replicated pool. My thoughts were to delete the existing cephfs, and start again, however I cannot delete the existing cephfs: errors are as follows: [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem I've tried killing the ceph-mds process, but this does not prevent the above error. I've also tried this, which also errors: [root@ceph1 ~]# ceph mds stop 0 Error EBUSY: must decrease max_mds or else MDS will immediately reactivate This also fail... [root@ceph1 ~]# ceph-deploy mds destroy [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy [ceph_deploy.mds][ERROR ] subcommand destroy not implemented Am I doing the right thing in trying to wipe the original cephfs config before attempting to use an erasure cold tier? Or can I just redefine the cephfs? many thanks, Jake Grimmett ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New deployment: errors starting OSDs: invalid (someone else's?) journal
Hi all, I'm trying to install ceph on a 7-nodes preproduction cluster. Each node has 24x 4TB SAS disks (2x dell md1400 enclosures) and 6x 800GB SSDs (for cache tiering, not journals). I'm using Ubuntu 14.04 and ceph-deploy to install the cluster, I've been trying both Firefly and Giant and getting the same error. However, the logs I'm reporting are relative to the Firefly installation. The installation seems to go fine until I try to install the last 2 OSDs (they are SSD disks) of each host. All the OSDs from 0 to 195 are UP and IN, but when I try to deploy the next OSD (no matter what host) ceph-osd daemon won't start. The error I get is: 2015-03-25 17:00:17.130937 7fe231312800 0 ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), process ceph-osd, pid 20280 2015-03-25 17:00:17.133601 7fe231312800 10 filestore(/var/lib/ceph/osd/ceph-196) dump_stop 2015-03-25 17:00:17.133694 7fe231312800 5 filestore(/var/lib/ceph/osd/ceph-196) basedir /var/lib/ceph/osd/ceph-196 journal /var/lib/ceph/osd/ceph-196/journal 2015-03-25 17:00:17.133725 7fe231312800 10 filestore(/var/lib/ceph/osd/ceph-196) mount fsid is 8c2fa707-750a-4773-8918-a368367d9cf5 2015-03-25 17:00:17.133789 7fe231312800 0 filestore(/var/lib/ceph/osd/ceph-196) mount detected xfs (libxfs) 2015-03-25 17:00:17.133810 7fe231312800 1 filestore(/var/lib/ceph/osd/ceph-196) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2015-03-25 17:00:17.135882 7fe231312800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features: FIEMAP ioctl is supported and appears to work 2015-03-25 17:00:17.135892 7fe231312800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-03-25 17:00:17.136318 7fe231312800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2015-03-25 17:00:17.136373 7fe231312800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-196) detect_feature: extsize is disabled by conf 2015-03-25 17:00:17.136640 7fe231312800 5 filestore(/var/lib/ceph/osd/ceph-196) mount op_seq is 1 2015-03-25 17:00:17.137547 7fe231312800 20 filestore (init)dbobjectmap: seq is 1 2015-03-25 17:00:17.137560 7fe231312800 10 filestore(/var/lib/ceph/osd/ceph-196) open_journal at /var/lib/ceph/osd/ceph-196/journal 2015-03-25 17:00:17.137575 7fe231312800 0 filestore(/var/lib/ceph/osd/ceph-196) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2015-03-25 17:00:17.137580 7fe231312800 10 filestore(/var/lib/ceph/osd/ceph-196) list_collections 2015-03-25 17:00:17.137661 7fe231312800 10 journal journal_replay fs op_seq 1 2015-03-25 17:00:17.137668 7fe231312800 2 journal open /var/lib/ceph/osd/ceph-196/journal fsid 8c2fa707-750a-4773-8918-a368367d9cf5 fs_op_seq 1 2015-03-25 17:00:17.137670 7fe22b8b1700 20 filestore(/var/lib/ceph/osd/ceph-196) sync_entry waiting for max_interval 5.00 2015-03-25 17:00:17.137690 7fe231312800 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 5367661056) 2015-03-25 17:00:17.162489 7fe231312800 1 journal _open /var/lib/ceph/osd/ceph-196/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 2015-03-25 17:00:17.162502 7fe231312800 10 journal read_header 2015-03-25 17:00:17.172249 7fe231312800 10 journal header: block_size 4096 alignment 4096 max_size 5367660544 2015-03-25 17:00:17.172256 7fe231312800 10 journal header: start 50987008 2015-03-25 17:00:17.172257 7fe231312800 10 journal write_pos 4096 2015-03-25 17:00:17.172259 7fe231312800 10 journal open header.fsid = 942f2d62-dd99-42a8-878a-feea443aaa61 2015-03-25 17:00:17.172264 7fe231312800 -1 journal FileJournal::open: ondisk fsid 942f2d62-dd99-42a8-878a-feea443aaa61 doesn't match expected 8c2fa707-750a-4773-8918-a368367d9cf5, invalid (someone else's?) journal 2015-03-25 17:00:17.172268 7fe231312800 3 journal journal_replay open failed with (22) Invalid argument 2015-03-25 17:00:17.172284 7fe231312800 -1 filestore(/var/lib/ceph/osd/ceph-196) mount failed to open journal /var/lib/ceph/osd/ceph-196/journal: (22) Invalid argument 2015-03-25 17:00:17.172304 7fe22b8b1700 20 filestore(/var/lib/ceph/osd/ceph-196) sync_entry woke after 0.034632 2015-03-25 17:00:17.172330 7fe22b8b1700 10 journal commit_start max_applied_seq 1, open_ops 0 2015-03-25 17:00:17.172333 7fe22b8b1700 10 journal commit_start blocked, all open_ops have completed 2015-03-25 17:00:17.172334 7fe22b8b1700 10 journal commit_start nothing to do 2015-03-25 17:00:17.172465 7fe231312800 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-196: (22) Invalid argument I'm attaching the full log of ceph-deploy osd create osd-l2-05:sde and the /var/log/ceph/ceph-osd.196.log, after trying to re-start the osd with increased verbosing, as long as the ceph.conf I'm using. I've also checked if the journal symlinks were correct, and they all
Re: [ceph-users] Erasure coding
Great info! Many thanks! Tom 2015-03-25 13:30 GMT+01:00 Loic Dachary l...@dachary.org: Hi Tom, On 25/03/2015 11:31, Tom Verdaat wrote: Hi guys, We've got a very small Ceph cluster (3 hosts, 5 OSD's each for cold data) that we intend to grow later on as more storage is needed. We would very much like to use Erasure Coding for some pools but are facing some challenges regarding the optimal initial profile “replication” settings given the limited number of initial hosts that we can use to spread the chunks. Could somebody please help me with the following questions? 1. Suppose we initially use replication in stead of erasure. Can we convert a replicated pool to an erasure coded pool later on? What you would do is create an erasure coded pool later and have the initial replicated pool as a cache in front of it. http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ Objects from the replicated pool will move to the erasure coded pool if they are not used and it will save space. You don't need to create the erasure coded pool on your small cluster. You can do it when it grows larger or when it becomes full. 2. Will Ceph gain the ability to change the K and N values for an existing pool in the near future? I don't think so. 3. Can the failure domain be changed for an existing pool? E.g. can we start with failure domain OSD and then switch it to Host after adding more hosts? The failure domain, although listed in the erasure code profile for convenience, really belongs to the crush ruleset applied to the pool. It can therefore be changed after the pool is created. It is likely to result in objects moving a lot during the transition but it should work fine otherwise. 4. Where can I find a good comparison of the available erasure code plugins that allows me to properly decide which one suits are needs best? In a nutshell, jerasure is flexible and is likely to be what you want, isa computes faster than jerasure but only works on intel processors (note however that the erasure code computation does not make a significant difference overall), lrc and shec (to be published in hammer) minimize network usage during recovery but uses more space than jerasure or isa. Cheers Many thanks for your help! Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] clients and monitors
On Wed, 25 Mar 2015, Deneau, Tom wrote: A couple of client-monitor questions: 1) When a client contacts a monitor to get the cluster map, how does it decide which monitor to try to contact? It picks a random monitor what the information it's seeded with at startup (via ceph.conf or the -m command line option). Once it reaches one mon, it gets the MonMap which tells it who all of the mons are. 2) Having gotten the cluster map, assuming a client wants to do multiple reads and writes, does the client have to re-contact the monitor to get the latest cluster map for each operation? No, the mons are primarily needed for the initial startup/authentication step. They are also queried as a last resort if the client thinks it has an out of data osdmap and hasn't gotten one from an OSD, and to renew authentication tickets. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW Ceph Tech Talk Tomorrow
Hey cephers, Just a reminder that the monthly Ceph Tech Talk tomorrow at 1p EDT will be by Yehuda on the RADOS Gateway. Make sure you stop by to get a deeper technical understanding of RGW if you're interested. It's an open virtual meeting for those that wish to attend, and will also be recorded and put on YouTube for those unable to make it. http://ceph.com/ceph-tech-talks/ Please let me know if you have any questions. Thanks. -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] clients and monitors
A couple of client-monitor questions: 1) When a client contacts a monitor to get the cluster map, how does it decide which monitor to try to contact? 2) Having gotten the cluster map, assuming a client wants to do multiple reads and writes, does the client have to re-contact the monitor to get the latest cluster map for each operation? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] error creating image in rbd-erasure-pool
Hi Greg, Thank you for this clarification. It helps a lot. Does this can't think of any issues apply to both rbd and pool snapshots ? Frederic. - Mail original - On Tue, Mar 24, 2015 at 12:09 PM, Brendan Moloney molo...@ohsu.edu wrote: Hi Loic and Markus, By the way, Inktank do not support snapshot of a pool with cache tiering : * https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Hi, You seem to be talking about pool snapshots rather than RBD snapshots. But in the linked document it is not clear that there is a distinction: Can I use snapshots with a cache tier? Snapshots are not supported in conjunction with cache tiers. Can anyone clarify if this is just pool snapshots? I think that was just a decision based on the newness and complexity of the feature for product purposes. Snapshots against cache tiered pools certainly should be fine in Giant/Hammer and we can't think of any issues in Firefly off the tops of our heads. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cordialement, Frédéric Nass. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com