Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
On Wed, Jan 7, 2015 at 9:55 PM, Christian Balzer ch...@gol.com wrote: On Wed, 7 Jan 2015 17:07:46 -0800 Craig Lewis wrote: On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote: However, I suspect that temporarily setting min size to a lower number could be enough for the PGs to recover. If ceph osd pool pool set min_size 1 doesn't get the PGs going, I suppose restarting at least one of the OSDs involved in the recovery, so that they PG undergoes peering again, would get you going again. It depends on how incomplete your incomplete PGs are. min_size is defined as Sets the minimum number of replicas required for I/O.. By default, size is 3 and min_size is 2 on recent versions of ceph. If the number of replicas you have drops below min_size, then Ceph will mark the PG as incomplete. As long as you have one copy of the PG, you can recover by lowering the min_size to the number of copies you do have, then restoring the original value after recovery is complete. I did this last week when I deleted the wrong PGs as part of a toofull experiment. Which of course begs the question of why not having min_size at 1 permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the same time your cluster still keeps working (as it should with a size of 3). You no longer have write durability if you only have one copy of a PG. Sam is fixing things up so that recovery will work properly as long as you have a whole copy of the PG, which should make things behave as people expect. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Documentation of ceph pg num query
On Fri, Jan 9, 2015 at 1:24 AM, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi all, as mentioned last year, our ceph cluster is still broken and unusable. We are still investigating what has happened and I am taking more deep looks into the output of ceph pg pgnum query. The problem is that I can find some informations about what some of the sections mean, but mostly I can only guess. Is there any kind of documentation where I can find some explanations of whats state there? Because without that the output is barely usefull. There is unfortunately not really documentation around this right now. If you have specific questions someone can probably help you with them, though. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uniform distribution
100GB objects (or ~40 on a hard drive!) are way too large for you to get an effective random distribution. -Greg On Thu, Jan 8, 2015 at 5:25 PM, Mark Nelson mark.nel...@inktank.com wrote: On 01/08/2015 03:35 PM, Michael J Brewer wrote: Hi all, I'm working on filling a cluster to near capacity for testing purposes. Though I'm noticing that it isn't storing the data uniformly between OSDs during the filling process. I currently have the following levels: Node 1: /dev/sdb1 3904027124 2884673100 1019354024 74% /var/lib/ceph/osd/ceph-0 /dev/sdc1 3904027124 2306909388 1597117736 60% /var/lib/ceph/osd/ceph-1 /dev/sdd1 3904027124 3296767276 607259848 85% /var/lib/ceph/osd/ceph-2 /dev/sde1 3904027124 3670063612 233963512 95% /var/lib/ceph/osd/ceph-3 Node 2: /dev/sdb1 3904027124 3250627172 653399952 84% /var/lib/ceph/osd/ceph-4 /dev/sdc1 3904027124 3611337492 292689632 93% /var/lib/ceph/osd/ceph-5 /dev/sdd1 3904027124 2831199600 1072827524 73% /var/lib/ceph/osd/ceph-6 /dev/sde1 3904027124 2466292856 1437734268 64% /var/lib/ceph/osd/ceph-7 I am using rados put to upload 100g files to the cluster, doing two at a time from two different locations. Is this expected behavior, or can someone shed light on why it is doing this? We're using the opensource version 80.7. We're also using the default CRUSH configuration. So crush utilizes pseudo-random distributions, but sadly random distributions tend to be clumpy and not perfectly uniform until you get to very high sample counts. The gist of it is that if you have a really low density of PGs/OSD and/or are very unlucky, you can end up with a skewed distribution. If you are even more unlucky, you could compound that with a streak of objects landing on PGs associated with some specific OSD. This particular case looks rather bad. How many PGs and OSDs do you have? Regards, *MICHAEL J. BREWER* *Phone:* 1-512-286-5596 | *Tie-Line:* 363-5596* E-mail:*_mjbre...@us.ibm.com_ mailto:mjbre...@us.ibm.com 11501 Burnet Rd Austin, TX 78758-3400 United States ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph on peta scale
On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote: I just finished configuring ceph up to 100 TB with openstack ... Since we are also using Lustre in our HPC machines , just wondering what is the bottle neck in ceph going on Peta Scale like Lustre . any idea ? or someone tried it If you're talking about people building a petabyte Ceph system, there are *many* who run clusters of that size. If you're talking about the Ceph filesystem as a replacement for Lustre at that scale, the concern is less about the raw amount of data and more about the resiliency of the current code base at that size...but if you want to try it out and tell us what problems you run into we will love you forever. ;) (The scalable file system use case is what actually spawned the Ceph project, so in theory there shouldn't be any serious scaling bottlenecks. In practice it will depend on what kind of metadata throughput you need because the multi-MDS stuff is improving but still less stable.) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]
On Fri, Jan 9, 2015 at 2:00 AM, Nico Schottelius nico-ceph-us...@schottelius.org wrote: Lionel, Christian, we do have the exactly same trouble as Christian, namely Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]: We still don't know what caused this specific error... and ...there is currently no way to make ceph forget about the data of this pg and create it as an empty one. So the only way to make this pool usable again is to loose all your data in there. I wonder what is the position of ceph developers regarding dropping (emptying) specific pgs? Is that a use case that was never thought of or tested? I've never worked directly on any of the cluster this has happened to, but I believe every time we've seen issues like this with somebody we have a relationship with it's either: 1) been resolved by using the existing tools to stuff lost, or 2) been the result of local filesystems/disks silently losing data due to some fault or other. The second case means the OSDs have corrupted state and trusting them is tricky. Also, most people we've had relationships with that this has happened to really want to not lose all the data in the PG, which necessitates manually mucking around anyway. ;) Mailing list issues are obviously a lot harder to categorize, but the ones we've taken time on where people say the commands don't work have generally fallen into the second bucket. If you want to experiment, I think all the manual mucking around has been done with the objectstore tool and removing bad PGs, moving them around, or faking journal entries, but I've not done it myself so I could be mistaken. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph on peta scale
On Mon, Jan 12, 2015 at 3:55 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote: Thanks Greg, No i am more into large scale RADOS system not filesystem . however for geographic distributed datacentres specially when network flactuate how to handle that as i read it seems CEPH need big pipe of network Ceph isn't really suited for WAN-style distribution. Some users have high-enough and consistent-enough bandwidth (with low enough latency) to do it, but otherwise you probably want to use Ceph within the data centers and layer something else on top of it. -Greg /Zee On Fri, Jan 9, 2015 at 7:15 PM, Gregory Farnum g...@gregs42.com wrote: On Thu, Jan 8, 2015 at 5:46 AM, Zeeshan Ali Shah zas...@pdc.kth.se wrote: I just finished configuring ceph up to 100 TB with openstack ... Since we are also using Lustre in our HPC machines , just wondering what is the bottle neck in ceph going on Peta Scale like Lustre . any idea ? or someone tried it If you're talking about people building a petabyte Ceph system, there are *many* who run clusters of that size. If you're talking about the Ceph filesystem as a replacement for Lustre at that scale, the concern is less about the raw amount of data and more about the resiliency of the current code base at that size...but if you want to try it out and tell us what problems you run into we will love you forever. ;) (The scalable file system use case is what actually spawned the Ceph project, so in theory there shouldn't be any serious scaling bottlenecks. In practice it will depend on what kind of metadata throughput you need because the multi-MDS stuff is improving but still less stable.) -Greg -- Regards Zeeshan Ali Shah System Administrator - PDC HPC PhD researcher (IT security) Kungliga Tekniska Hogskolan +46 8 790 9115 http://www.pdc.kth.se/members/zashah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] reset osd perf counters
perf reset on the admin socket. I'm not sure what version it went in to; you can check the release logs if it doesn't work on whatever you have installed. :) -Greg On Mon, Jan 12, 2015 at 2:26 PM, Shain Miley smi...@npr.org wrote: Is there a way to 'reset' the osd perf counters? The numbers for osd 73 though osd 83 look really high compared to the rest of the numbers I see here. I was wondering if I could clear the counters out, so that I have a fresh set of data to work with. root@cephmount1:/var/log/samba# ceph osd perf osdid fs_commit_latency(ms) fs_apply_latency(ms) 0 0 45 1 0 14 2 0 47 3 0 25 4 1 44 5 12 6 12 7 0 39 8 0 32 9 0 34 10 2 186 11 0 68 12 11 13 0 34 14 01 15 2 37 16 0 23 17 0 28 18 0 26 19 0 22 20 02 21 2 24 22 0 33 23 01 24 3 98 25 2 70 26 01 27 3 99 28 02 29 2 101 30 2 72 31 2 81 32 3 112 33 3 94 34 4 152 35 0 56 36 02 37 2 58 38 01 39 03 40 02 41 02 42 11 43 02 44 1 44 45 02 46 01 47 3 85 48 01 49 2 75 50 4 398 51 3 115 52 01 53 2 47 54 6 290 55 5 153 56 7 453 57 2 66 58 11 59 5 196 60 00 61 0 93 62 09 63 01 64 01 65 04 66 01 67 0 18 68 0 16 69 0 81 70 0 70 71 00 72 01 7374 1217 74 01 7564 1238 7692 1248 77 01 78 01 79 109 1333 8068 1451 8166 1192 8295 1215 8381 1331 84 3 56 85 3 65 86 01 87 3 55 88
Re: [ceph-users] cephfs modification time
Zheng, this looks like a kernel client issue to me, or else something funny is going on with the cap flushing and the timestamps (note how the reading client's ctime is set to an even second, while the mtime is ~.63 seconds later and matches what the writing client sees). Any ideas? -Greg On Mon, Jan 12, 2015 at 12:19 PM, Lorieri lori...@gmail.com wrote: Hi Gregory, $ uname -a Linux coreos2 3.17.7+ #2 SMP Tue Jan 6 08:22:04 UTC 2015 x86_64 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz GenuineIntel GNU/Linux Kernel Client, using `mount -t ceph ...` core@coreos2 /var/run/systemd/system $ modinfo ceph filename: /lib/modules/3.17.7+/kernel/fs/ceph/ceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net alias: fs-ceph depends:libceph intree: Y vermagic: 3.17.7+ SMP mod_unload signer: Magrathea: Glacier signing key sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2 sig_hashalgo: sha256 core@coreos2 /var/run/systemd/system $ modinfo libceph filename: /lib/modules/3.17.7+/kernel/net/ceph/libceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net depends:libcrc32c intree: Y vermagic: 3.17.7+ SMP mod_unload signer: Magrathea: Glacier signing key sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2 sig_hashalgo: sha256 ceph is installed on a ubuntu containers (same kernel): $ dpkg -l |grep ceph ii ceph 0.87-1trusty amd64distributed storage and file system ii ceph-common 0.87-1trusty amd64common utilities to mount and interact with a ceph storage cluster ii ceph-fs-common 0.87-1trusty amd64common utilities to mount and interact with a ceph file system ii ceph-fuse0.87-1trusty amd64FUSE-based client for the Ceph distributed file system ii ceph-mds 0.87-1trusty amd64metadata server for the ceph distributed file system ii libcephfs1 0.87-1trusty amd64Ceph distributed file system client library ii python-ceph 0.87-1trusty amd64Python libraries for the Ceph distributed filesystem Reproducing the error: at machine 1: core@coreos1 /var/lib/deis/store/logs $ test.log core@coreos1 /var/lib/deis/store/logs $ echo 1 test.log core@coreos1 /var/lib/deis/store/logs $ stat test.log File: 'test.log' Size: 2 Blocks: 1 IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511629882 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/core) Gid: ( 500/core) Access: 2015-01-12 20:05:03.0 + Modify: 2015-01-12 20:06:09.637234229 + Change: 2015-01-12 20:06:09.637234229 + Birth: - at machine 2: core@coreos2 /var/lib/deis/store/logs $ stat test.log File: 'test.log' Size: 2 Blocks: 1 IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511629882 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/core) Gid: ( 500/core) Access: 2015-01-12 20:05:03.0 + Modify: 2015-01-12 20:06:09.637234229 + Change: 2015-01-12 20:06:09.0 + Birth: - Change time is not updated making some tail libs to not show new content until you force the change time be updated, like running a touch in the file. Some tools freeze and trigger other issues in the system. Tests, all in the machine #2: FAILED - https://github.com/ActiveState/tail FAILED - /usr/bin/tail of a Google docker image running debian wheezy PASSED - /usr/bin/tail of a ubuntu 14.04 docker image PASSED - /usr/bin/tail of the coreos release 494.5.0 Tests in machine #1 (same machine that is writing the file) all tests pass. On Mon, Jan 12, 2015 at 5:14 PM, Gregory Farnum g...@gregs42.com wrote: What versions of all the Ceph pieces are you using? (Kernel client/ceph-fuse, MDS, etc) Can you provide more details on exactly what the program is doing on which nodes? -Greg On Fri, Jan 9, 2015 at 5:15 PM, Lorieri lori...@gmail.com wrote: first 3 stat commands shows blocks and size changing, but not the times after a touch it changes and tail works I saw some cephfs freezes related to it, it came back after touching the files coreos2 logs # stat deis-router.log File: 'deis-router.log' Size: 148564 Blocks: 291IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511628780 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2015-01-10 01:13:00.100582619
Re: [ceph-users] cephfs modification time
Awesome, thanks for the bug report and the fix, guys. :) -Greg On Mon, Jan 12, 2015 at 11:18 PM, 严正 z...@redhat.com wrote: I tracked down the bug. Please try the attached patch Regards Yan, Zheng 在 2015年1月13日,07:40,Gregory Farnum g...@gregs42.com 写道: Zheng, this looks like a kernel client issue to me, or else something funny is going on with the cap flushing and the timestamps (note how the reading client's ctime is set to an even second, while the mtime is ~.63 seconds later and matches what the writing client sees). Any ideas? -Greg On Mon, Jan 12, 2015 at 12:19 PM, Lorieri lori...@gmail.com wrote: Hi Gregory, $ uname -a Linux coreos2 3.17.7+ #2 SMP Tue Jan 6 08:22:04 UTC 2015 x86_64 Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz GenuineIntel GNU/Linux Kernel Client, using `mount -t ceph ...` core@coreos2 /var/run/systemd/system $ modinfo ceph filename: /lib/modules/3.17.7+/kernel/fs/ceph/ceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net alias: fs-ceph depends:libceph intree: Y vermagic: 3.17.7+ SMP mod_unload signer: Magrathea: Glacier signing key sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2 sig_hashalgo: sha256 core@coreos2 /var/run/systemd/system $ modinfo libceph filename: /lib/modules/3.17.7+/kernel/net/ceph/libceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net depends:libcrc32c intree: Y vermagic: 3.17.7+ SMP mod_unload signer: Magrathea: Glacier signing key sig_key:D4:BB:DE:E9:C6:D8:FC:90:9F:23:59:B2:19:1B:B8:FA:57:A1:AF:D2 sig_hashalgo: sha256 ceph is installed on a ubuntu containers (same kernel): $ dpkg -l |grep ceph ii ceph 0.87-1trusty amd64distributed storage and file system ii ceph-common 0.87-1trusty amd64common utilities to mount and interact with a ceph storage cluster ii ceph-fs-common 0.87-1trusty amd64common utilities to mount and interact with a ceph file system ii ceph-fuse0.87-1trusty amd64FUSE-based client for the Ceph distributed file system ii ceph-mds 0.87-1trusty amd64metadata server for the ceph distributed file system ii libcephfs1 0.87-1trusty amd64Ceph distributed file system client library ii python-ceph 0.87-1trusty amd64Python libraries for the Ceph distributed filesystem Reproducing the error: at machine 1: core@coreos1 /var/lib/deis/store/logs $ test.log core@coreos1 /var/lib/deis/store/logs $ echo 1 test.log core@coreos1 /var/lib/deis/store/logs $ stat test.log File: 'test.log' Size: 2 Blocks: 1 IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511629882 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/core) Gid: ( 500/core) Access: 2015-01-12 20:05:03.0 + Modify: 2015-01-12 20:06:09.637234229 + Change: 2015-01-12 20:06:09.637234229 + Birth: - at machine 2: core@coreos2 /var/lib/deis/store/logs $ stat test.log File: 'test.log' Size: 2 Blocks: 1 IO Block: 4194304 regular file Device: 0h/0d Inode: 1099511629882 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/core) Gid: ( 500/core) Access: 2015-01-12 20:05:03.0 + Modify: 2015-01-12 20:06:09.637234229 + Change: 2015-01-12 20:06:09.0 + Birth: - Change time is not updated making some tail libs to not show new content until you force the change time be updated, like running a touch in the file. Some tools freeze and trigger other issues in the system. Tests, all in the machine #2: FAILED - https://github.com/ActiveState/tail FAILED - /usr/bin/tail of a Google docker image running debian wheezy PASSED - /usr/bin/tail of a ubuntu 14.04 docker image PASSED - /usr/bin/tail of the coreos release 494.5.0 Tests in machine #1 (same machine that is writing the file) all tests pass. On Mon, Jan 12, 2015 at 5:14 PM, Gregory Farnum g...@gregs42.com wrote: What versions of all the Ceph pieces are you using? (Kernel client/ceph-fuse, MDS, etc) Can you provide more details on exactly what the program is doing on which nodes? -Greg On Fri, Jan 9, 2015 at 5:15 PM, Lorieri lori...@gmail.com wrote: first 3 stat commands shows blocks and size changing, but not the times after a touch it changes and tail works I saw some cephfs freezes related to it, it came back after touching the files coreos2 logs # stat deis
Re: [ceph-users] NUMA zone_reclaim_mode
On Mon, Jan 12, 2015 at 8:25 AM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.net wrote: On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Sounds good to me. Do you mind submitting a patch that prints a warning from either FileStore::_detect_fs()? That will appear in the local ceph-osd.NNN.log. Alternatively, we should send something to the cluster log (osd-clog.warning() ...) but if we go that route we need to be careful that the logger it up and running first, which (I think) rules out FileStore::_detect_fs(). It could go in OSD itself although that seems less clean since the recommendation probably doesn't apply when using a backend that doesn't use a file system… Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB already solved the heuristic: https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp It’s licensed as AGPLv3 -- do you already know if we can borrow such code into Ceph? https://www.gnu.org/licenses/license-list.html#AGPL I've read that and the linked Affero Article 13 and I actually can't tell if Ceph is safe to integrate or not, but I'm thinking no since the servers are under LGPL. :/ Also I'm not sure if storage system users qualify as remote users but I don't think we're going to print an Affero string every time somebody runs a ceph tool. ;) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [rbd] Ceph RBD kernel client using with cephx
Unmapping is an operation local to the host and doesn't communicate with the cluster at all (at least, in the kernel you're running...in very new code it might involve doing an unwatch, which will require communication). That means there's no need for a keyring, since its purpose is to validate communication with the cluster. -Greg On Mon, Feb 9, 2015 at 6:58 AM, Vikhyat Umrao vum...@redhat.com wrote: Hi, While using rbd kernel client with cephx , admin user without admin keyring was not able to map the rbd image to a block device and this should be the work flow. But issue is once I unmap rbd image without admin keyring it is allowing to unmap the image and as per my understanding it should not be the case , it should not all and give error as when it has given while mapping. Is it a normal behaviour or I am missing something , may be needed a fix (bug) ? [ceph@dell-per620-1 ceph]$ ls -l /etc/ceph/ total 16 -rw-r--r--. 1 root root 63 Feb 9 22:30 ceph.client.admin.keyring -rw-r--r--. 1 root root 71 Feb 9 22:23 ceph.client.dell-per620-1.keyring -rw-r--r--. 1 root root 467 Feb 9 22:22 ceph.conf -rwxr-xr-x. 1 root root 92 Oct 15 01:03 rbdmap [ceph@dell-per620-1 ceph]$ [ceph@dell-per620-1 ceph]$ sudo mv /etc/ceph/ceph.client.admin.keyring /tmp/. [ceph@dell-per620-1 ceph]$ ls -l /etc/ceph/ total 12 -rw-r--r--. 1 root root 71 Feb 9 22:23 ceph.client.dell-per620-1.keyring -rw-r--r--. 1 root root 467 Feb 9 22:22 ceph.conf -rwxr-xr-x. 1 root root 92 Oct 15 01:03 rbdmap [ceph@dell-per620-1 ceph]$ [ceph@dell-per620-1 ceph]$ sudo rbd map testcephx rbd: add failed: (22) Invalid argument [ceph@dell-per620-1 ceph]$ sudo dmesg [437447.308705] libceph: no secret set (for auth_x protocol) [437447.308761] libceph: error -22 on auth protocol 2 init [437447.308809] libceph: client4954 fsid d57d909f-8adf-46aa-8cc6-3168974df332 [ceph@dell-per620-1 ceph]$ sudo mv /tmp/ceph.client.admin.keyring /etc/ceph/ [ceph@dell-per620-1 ceph]$ ls -l /etc/ceph/ total 16 -rw-r--r--. 1 root root 63 Feb 9 22:30 ceph.client.admin.keyring -rw-r--r--. 1 root root 71 Feb 9 22:23 ceph.client.dell-per620-1.keyring -rw-r--r--. 1 root root 467 Feb 9 22:22 ceph.conf -rwxr-xr-x. 1 root root 92 Oct 15 01:03 rbdmap [ceph@dell-per620-1 ceph]$ sudo rbd map testcephx [ceph@dell-per620-1 ceph]$ sudo rbd showmapped id pool image snap device 0 rbd testcephx -/dev/rbd0 [ceph@dell-per620-1 ceph]$ sudo dmesg [437447.308705] libceph: no secret set (for auth_x protocol) [437447.308761] libceph: error -22 on auth protocol 2 init [437447.308809] libceph: client4954 fsid d57d909f-8adf-46aa-8cc6-3168974df332 [437496.444701] libceph: client4961 fsid d57d909f-8adf-46aa-8cc6-3168974df332 [437496.447833] libceph: mon1 10.65.200.118:6789 session established [437496.482913] rbd0: unknown partition table [437496.483037] rbd: rbd0: added with size 0x800 [ceph@dell-per620-1 ceph]$ [ceph@dell-per620-1 ceph]$ sudo mv /etc/ceph/ceph.client.admin.keyring /tmp/. [ceph@dell-per620-1 ceph]$ ls -l /etc/ceph/ total 12 -rw-r--r--. 1 root root 71 Feb 9 22:23 ceph.client.dell-per620-1.keyring -rw-r--r--. 1 root root 467 Feb 9 22:22 ceph.conf -rwxr-xr-x. 1 root root 92 Oct 15 01:03 rbdmap [ceph@dell-per620-1 ceph]$ sudo rbd unmap /dev/rbd/rbd/testcephx --- If we see here it has allowed unmaping rbd image without keyring [ceph@dell-per620-1 ceph]$ sudo rbd showmapped --- no mapped image - Regards, Vikhyat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] requests are blocked 32 sec woes
There are a lot of next steps on http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ You probably want to look at the bits about using the admin socket, and diagnosing slow requests. :) -Greg On Sun, Feb 8, 2015 at 8:48 PM, Matthew Monaco m...@monaco.cx wrote: Hello! *** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I believe some of the members of your thesis committee were students of his =) We have a modest cluster at CU Boulder and are frequently plagued by requests are blocked issues. I'd greatly appreciate any insight or pointers. The issue is not specific to any one OSD; I'm pretty sure they've all showed up in ceph health detail at this point. We have 8 identical nodes: - 5 * 1TB Seagate enterprise SAS drives - btrfs - 1 * Intel 480G S3500 SSD - with 5*16G partitions as journals - also hosting the OS, unfortunately - 64G RAM - 2 * Xeon E5-2630 v2 - So 24 hyperthreads @ 2.60 GHz - 10G-ish IPoIB for networking So the cluster has 40TB over 40 OSDs total with a very straightforward crushmap. These nodes are also (unfortunately for the time being) OpenStack compute nodes and 99% of the usage is OpenStack volumes/images. I see a lot of kernel messages like: ib_mthca :02:00.0: Async event 16 for bogus QP 00dc0408 which may or may not be correlated w/ the Ceph hangs. Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack volumes pool has 4096 pgs and is sized 3. This is probably too many PGs, but came from an initial misunderstanding of the formula in the documentation. Thanks, Matt PS - I'm trying to secure funds to get an additional 8 nodes with a little less RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA DOM for the OS so the SSD will be strictly journal. I may even be able to get an additional SSD or two per-node to use for caching or simply to set a higher primary affinity ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Compilation problem
On Fri, Feb 6, 2015 at 3:37 PM, David J. Arias david.ar...@getecsa.co wrote: Hello! I am sysadmin for a small IT consulting enterprise in México. We are trying to integrate three servers running RHEL 5.9 into a new CEPH cluster. I downloaded the source code and tried compiling it, though I got stuck with the requirements for leveldb and libblkid. The versions installed by the OS are behind the ones recommended so I am wondering if it is possible to compile updated ones from source, install them in another location (/usr/local/{} )and use those for CEPH. Upgrading the OS is (although not impossible) difficult since these are production servers which hold critical applications, and some of those are legacy ones :-( I tried googling around but had no luck as to how to accomplish this, ./configure --help doesn't show anyway and tried --system-root without success. I am following the instructions from: https://wiki.ceph.com/FAQs/What_Kind_of_OS_Does_Ceph_Require%3F http://docs.ceph.com/docs/master/install/install-storage-cluster/#installing-a-build http://docs.ceph.com/docs/master/install/#get-software http://wiki.ceph.com/FAQs The only data I've found so far although related doesn't really apply to my case: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041683.html http://article.gmane.org/gmane.comp.file-systems.ceph.user/3010/match=redhat+5.9 Any help/ideas/pointers would be great. I think there's ongoing work to backport (portions of?) Ceph to RHEL5, but it definitely doesn't build out of the box. Even beyond the library dependencies you've noticed you'll find more issues with e.g. the boost and gcc versions. :/ -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Performance vs PG counts
On Sun, Feb 8, 2015 at 6:00 PM, Sumit Gaur sumitkg...@gmail.com wrote: Hi I have installed 6 node ceph cluster and doing a performance bench mark for the same using Nova VMs. What I have observed that FIO random write reports around 250 MBps for 1M block size and PGs 4096 and 650MBps for iM block size and PG counts 2048 . Can some body let me know if I am missing any ceph Architecture point here ? As per my understanding PG numbers are mainly involved in calculating the hash and should not effect performance so much. PGs are also serialization points within the codebase, so depending on how you're testing you can run into contention if you have multiple objects within a single PG that you're trying to write to at once. This isn't normally a problem, but for a single benchmark run the random collisions can become noticeable. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] requests are blocked 32 sec woes
On Mon, Feb 9, 2015 at 7:12 PM, Matthew Monaco m...@monaco.cx wrote: On 02/09/2015 08:20 AM, Gregory Farnum wrote: There are a lot of next steps on http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ You probably want to look at the bits about using the admin socket, and diagnosing slow requests. :) -Greg Yeah, I've been through most of that. It's still been difficult to pinpoint what's causing the blocking. Can I get some clarification on this comment: Ceph acknowledges writes after journaling, so fast SSDs are an attractive option to accelerate the response time–particularly when using the ext4 or XFS filesystems. By contrast, the btrfs filesystem can write and journal simultaneously. Does this mean btrfs doesn't need separate journal partition/block device? I.e., is what ceph-disk does when creating with --fs-type btrfs entirely non-optimal (creates a 5G journal partition and the rest a btrfs partition). I just don't get the by contrast. If the OSD is btrfs+rotational, then why doesn't putting the journal on an SSD help (as much?) if writes are returned after journaling? Yeah, that's not quite the best phrasing. btrfs' parallel journaling can be a big advantage in all-spinner cases where under the right kinds of load the filesystem actually has a chance of committing data to disk faster than the journal does. There aren't many situations where that's likely, though — it's more useful for direct librados users who might want to proceed once data is readable rather than when it's durable. That's not an option with xfs. -Greg On Sun, Feb 8, 2015 at 8:48 PM, Matthew Monaco m...@monaco.cx wrote: Hello! *** Shameless plug: Sage, I'm working with Dirk Grunwald on this cluster; I believe some of the members of your thesis committee were students of his =) We have a modest cluster at CU Boulder and are frequently plagued by requests are blocked issues. I'd greatly appreciate any insight or pointers. The issue is not specific to any one OSD; I'm pretty sure they've all showed up in ceph health detail at this point. We have 8 identical nodes: - 5 * 1TB Seagate enterprise SAS drives - btrfs - 1 * Intel 480G S3500 SSD - with 5*16G partitions as journals - also hosting the OS, unfortunately - 64G RAM - 2 * Xeon E5-2630 v2 - So 24 hyperthreads @ 2.60 GHz - 10G-ish IPoIB for networking So the cluster has 40TB over 40 OSDs total with a very straightforward crushmap. These nodes are also (unfortunately for the time being) OpenStack compute nodes and 99% of the usage is OpenStack volumes/images. I see a lot of kernel messages like: ib_mthca :02:00.0: Async event 16 for bogus QP 00dc0408 which may or may not be correlated w/ the Ceph hangs. Other info: we have 3 mons on 3 of the 8 nodes listed above. The openstack volumes pool has 4096 pgs and is sized 3. This is probably too many PGs, but came from an initial misunderstanding of the formula in the documentation. Thanks, Matt PS - I'm trying to secure funds to get an additional 8 nodes with a little less RAM and CPU to move the OSDs to, with dual 10G Ethernet, and a SATA DOM for the OS so the SSD will be strictly journal. I may even be able to get an additional SSD or two per-node to use for caching or simply to set a higher primary affinity ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CRUSHMAP for chassis balance
With sufficiently new CRUSH versions (all the latest point releases on LTS?) I think you can simply have the rule return extra IDs which are dropped if they exceed the number required. So you can choose two chassis, then have those both choose to lead OSDs, and return those 4 from the rule. -Greg On Fri, Feb 13, 2015 at 6:13 AM Luke Kao luke@mycom-osi.com wrote: Dear cepher, Currently I am working on crushmap to try to make sure the at least one copy are going to different chassis. Say chassis1 has host1,host2,host3, and chassis2 has host4,host5,host6. With replication =2, it’s not a problem, I can use the following step in rule step take chasses1 step chooseleaf firstn 1 type host step emit step take chasses2 step chooseleaf firstn 1 type host step emit But for replication=3, I tried step take chasses1 step chooseleaf firstn 1 type host step emit step take chasses2 step chooseleaf firstn 1 type host step emit step take default step chooseleaf firstn 1 type host step emit At the end, the 3rd osd returned in rule test is always duplicate with first 1 or first 2. Any idea or what’s the direction to move forward? Thanks in advance BR, Luke MYCOM-OSI -- This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random OSDs respawning continuously
It's not entirely clear, but it looks like all the ops are just your caching pool OSDs trying to promote objects, and your backing pool OSD's aren't fast enough to satisfy all the IO demanded of them. You may be overloading the system. -Greg On Fri, Feb 13, 2015 at 6:06 AM Mohamed Pakkeer mdfakk...@gmail.com wrote: Hi all, When i stop the respawning osd on an OSD node, another osd is respawning on the same node. when the OSD is started to respawing, it puts the following info in the osd log. slow request 31.129671 seconds old, received at 2015-02-13 19:09:32.180496: osd_op(*osd.551*.95229:11 191 10005c4.0033 [copy-get max 8388608] 13.f4ccd256 RETRY=50 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg OSD.551 is part of cache tier. All the respawning osds have the log with different cache tier OSDs. If i restart all the osds in the cache tier osd node, respawning is stopped and cluster become active + clean state. But when i try to write some data on the cluster, random osd starts the respawning. can anyone help me how to solve this issue? 2015-02-13 19:10:02.309848 7f53eef54700 0 log_channel(default) log [WRN] : 11 slow requests, 11 included below; oldest blocked for 30.132629 secs 2015-02-13 19:10:02.309854 7f53eef54700 0 log_channel(default) log [WRN] : slow request 30.132629 seconds old, received at 2015-02-13 19:09:32.177075: osd_op(osd.551.95229:63 10002ae. [copy-from ver 7622] 13.7273b256 RETRY=130 snapc 1=[] ondisk+retry+write+ignore_overlay+enforce_snapc+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:02.309858 7f53eef54700 0 log_channel(default) log [WRN] : slow request 30.131608 seconds old, received at 2015-02-13 19:09:32.178096: osd_op(osd.551.95229:41 5 10003a0.0006 [copy-get max 8388608] 13.aefb256 RETRY=118 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:02.309861 7f53eef54700 0 log_channel(default) log [WRN] : slow request 30.130994 seconds old, received at 2015-02-13 19:09:32.178710: osd_op(osd.551.95229:26 83 100029d.003b [copy-get max 8388608] 13.a2be1256 RETRY=115 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:02.309864 7f53eef54700 0 log_channel(default) log [WRN] : slow request 30.130426 seconds old, received at 2015-02-13 19:09:32.179278: osd_op(osd.551.95229:39 39 10004e9.0032 [copy-get max 8388608] 13.6a25b256 RETRY=105 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:02.309868 7f53eef54700 0 log_channel(default) log [WRN] : slow request 30.129697 seconds old, received at 2015-02-13 19:09:32.180007: osd_op(osd.551.95229:97 49 1000553.007e [copy-get max 8388608] 13.c8645256 RETRY=59 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:03.310284 7f53eef54700 0 log_channel(default) log [WRN] : 11 slow requests, 6 included below; oldest blocked for 31.133092 secs 2015-02-13 19:10:03.310305 7f53eef54700 0 log_channel(default) log [WRN] : slow request 31.129671 seconds old, received at 2015-02-13 19:09:32.180496: osd_op(osd.551.95229:11 191 10005c4.0033 [copy-get max 8388608] 13.f4ccd256 RETRY=50 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:03.310308 7f53eef54700 0 log_channel(default) log [WRN] : slow request 31.128616 seconds old, received at 2015-02-13 19:09:32.181551: osd_op(osd.551.95229:12 903 10002e4.00d6 [copy-get max 8388608] 13.f56a3256 RETRY=41 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:03.310322 7f53eef54700 0 log_channel(default) log [WRN] : slow request 31.127807 seconds old, received at 2015-02-13 19:09:32.182360: osd_op(osd.551.95229:14 165 1000480.0110 [copy-get max 8388608] 13.fd8c1256 RETRY=32 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:03.310327 7f53eef54700 0 log_channel(default) log [WRN] : slow request 31.127320 seconds old, received at 2015-02-13 19:09:32.182847: osd_op(osd.551.95229:15 013 100047f.0133 [copy-get max 8388608] 13.b7b05256 RETRY=27 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e95518) currently reached_pg 2015-02-13 19:10:03.310331 7f53eef54700 0 log_channel(default) log [WRN] : slow request 31.126935 seconds old, received at 2015-02-13 19:09:32.183232: osd_op(osd.551.95229:15 767 100066d.001e [copy-get max 8388608] 13.3b017256 RETRY=25 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
Re: [ceph-users] kernel crash after 'ceph: mds0 caps stale' and 'mds0 hung' -- issue with timestamps or HVM virtualization on EC2?
On Mon, Feb 9, 2015 at 11:58 AM, Christopher Armstrong ch...@opdemand.com wrote: Hi folks, One of our users is seeing machine crashes almost daily. He's using Ceph v0.87 giant, and is seeing this crash: https://gist.githubusercontent.com/ianblenke/b74e5aa5547130ebc0fb/raw/c3eeab076310d149443fd6118113b9d94f176303/gistfile1.txt It seems easy to trigger this by rsyncing to the CephFS mount. We're using the kernel client here, so I'm wondering if it's related to this timestamp bug: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-January/045838.html These are definitely not related. Does anyone have any insight into the crash? Some confirmation that it's related to system clocks/timestamps would be helpful. Another note is that we're using HVM virtualization on EC2. Not sure if people have run into this before or not. Zheng might have some idea about these, but I'm guessing there's a code issue and some deadlock with file capabilities. If you can look at the MDS' admin socket and dump the ops in flight and the session info that might be helpful too. (ceph daemon mds.a dump_ops_in_flight, etc) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS removal.
What version of Ceph are you running? It's varied by a bit. But I think you want to just turn off the MDS and run the fail command — deactivate is actually the command for removing a logical MDS from the cluster, and you can't do that for a lone MDS because there's nobody to pass off the data to. I'll make a ticket to clarify this. When you've done that you should be able to delete it. -Greg On Mon, Feb 2, 2015 at 1:40 AM, warren.je...@stfc.ac.uk wrote: Hi All, Having a few problems removing cephfs file systems. I want to remove my current pools (was used for test data) – wiping all current data, and start a fresh file system on my current cluster. I have looked over the documentation but I can’t find anything on this. I have an object store pool, Which I don’t want to remove – but I’d like to remove the cephfs file system pools and remake them. My cephfs is called ‘data’. Running ceph fs delete data returns: Error EINVAL: all MDS daemons must be inactive before removing filesystem To make an MDS inactive I believe the command is: ceph mds deactivate 0 Which returns: telling mds.0 135.248.53.134:6809/16692 to deactivate Checking the status of the mds using: ceph mds stat returns: e105: 1/1/0 up {0=node2=up:stopping} This has been sitting at this status for the whole weekend with no change. I don’t have any clients connected currently. When trying to manually just remove the pools, it’s not allowed as there is a cephfs file system on them. I’m happy that all of the failsafe’s to stop someone removing a pool are all working correctly. If this is currently undoable. Is there a way to quickly wipe a cephfs filesystem – using RM from a kernel client is really slow. Many thanks Warren Jeffs ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS removal.
Oh, hah, your initial email had a very delayed message delivery...probably got stuck in the moderation queue. :) On Thu, Feb 12, 2015 at 8:26 AM, warren.je...@stfc.ac.uk wrote: I am running 0.87, In the end I just wiped the cluster and started again - it was quicker. Warren -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: 12 February 2015 16:25 To: Jeffs, Warren (STFC,RAL,ISIS) Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] CephFS removal. What version of Ceph are you running? It's varied by a bit. But I think you want to just turn off the MDS and run the fail command — deactivate is actually the command for removing a logical MDS from the cluster, and you can't do that for a lone MDS because there's nobody to pass off the data to. I'll make a ticket to clarify this. When you've done that you should be able to delete it. -Greg On Mon, Feb 2, 2015 at 1:40 AM, warren.je...@stfc.ac.uk wrote: Hi All, Having a few problems removing cephfs file systems. I want to remove my current pools (was used for test data) – wiping all current data, and start a fresh file system on my current cluster. I have looked over the documentation but I can’t find anything on this. I have an object store pool, Which I don’t want to remove – but I’d like to remove the cephfs file system pools and remake them. My cephfs is called ‘data’. Running ceph fs delete data returns: Error EINVAL: all MDS daemons must be inactive before removing filesystem To make an MDS inactive I believe the command is: ceph mds deactivate 0 Which returns: telling mds.0 135.248.53.134:6809/16692 to deactivate Checking the status of the mds using: ceph mds stat returns: e105: 1/1/0 up {0=node2=up:stopping} This has been sitting at this status for the whole weekend with no change. I don’t have any clients connected currently. When trying to manually just remove the pools, it’s not allowed as there is a cephfs file system on them. I’m happy that all of the failsafe’s to stop someone removing a pool are all working correctly. If this is currently undoable. Is there a way to quickly wipe a cephfs filesystem – using RM from a kernel client is really slow. Many thanks Warren Jeffs ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs with btrfs are down
I'm afraid I don't know what would happen if you change those options. Hopefully we've set it up so things continue to work, but we definitely don't test it. -Greg On Tue, Jan 6, 2015 at 8:22 AM Lionel Bouton lionel+c...@bouton.name wrote: On 01/06/15 02:36, Gregory Farnum wrote: [...] filestore btrfs snap controls whether to use btrfs snapshots to keep the journal and backing store in check. WIth that option disabled it handles things in basically the same way we do with xfs. filestore btrfs clone range I believe controls how we do RADOS object clones. With this option enabled we use the btrfs clone range ioctl (? I think that's the interface); without it we do our own copies, again basically the same as we do with xfs. Thanks for these informations I think I have a clearer picture now, the next time I have the opportunity, I'll test BTRFS based OSD using manual defragmentation (which I suspect might help performance) and if I still get stability or performance problems I'll try disabling BTRFS specific features. My impression is that the core of BTRFS is stable and performant enough for Ceph and that lzo compression and checksums are reasons enough to use it instead of XFS but to get stable and performant OSDs some features might have to be disabled. Hopefully we will expand our storage network in the near future and I'll have the opportunity to test my theories with very limited impact on stability and performance. Quick follow-up question: can the options filestore btrfs snap, filestore btrfs clone range and filestore journal parallel be modified on an existing/used OSD? I don't see why not for the last 2 as COW being used or not doesn't change other filesystem semantics, but for snapshots I'm not sure: at startup the available snapshots could have to match a precise OSD filestore state which they wouldn't do after (for example) disabling them and enabling them again. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Privileges for read-only CephFS access?
On Wed, Feb 18, 2015 at 3:30 PM, Florian Haas flor...@hastexo.com wrote: On Wed, Feb 18, 2015 at 11:41 PM, Gregory Farnum g...@gregs42.com wrote: On Wed, Feb 18, 2015 at 1:58 PM, Florian Haas flor...@hastexo.com wrote: On Wed, Feb 18, 2015 at 10:28 PM, Oliver Schulz osch...@mpp.mpg.de wrote: Dear Ceph Experts, is it possible to define a Ceph user/key with privileges that allow for read-only CephFS access but do not allow write or other modifications to the Ceph cluster? Warning, read this to the end, don't blindly do as I say. :) All you should need to do is define a CephX identity that has only r capabilities on the data pool (assuming you're using a default configuration where your CephFS uses the data and metadata pools): sudo ceph auth get-or-create client.readonly mds 'allow' osd 'allow r pool=data' mon 'allow r' That identity should then be able to mount the filesystem but not write any data (use ceph-fuse -n client.readonly or mount -t ceph -o name=readonly) That said, just touching files or creating them is only a metadata operation that doesn't change anything in the data pool, so I think that might still be allowed under these circumstances. ...and deletes, unfortunately. :( If the file being deleted is empty, yes. If the file has any content, then the removal should hit the data pool before it hits metadata, and should fail there. No? No, all data deletion is handled by the MDS, for two reasons: 1) You don't want clients to have to block on deletes in time linear with the number of objects 2) (IMPORTANT) if clients unlink a file which is still opened elsewhere, it can't be deleted until closed. ;) I don't think this is presently a thing it's possible to do until we get a much better user auth capabilities system into CephFS. However, I've just tried the above with ceph-fuse on firefly, and I was able to mount the filesystem that way and then echo something into a previously existing file. After unmounting, remounting, and trying to cat that file, I/O just hangs. It eventually does complete, but this looks really fishy. This is happening because the CephFS clients don't (can't, really, for all the time we've spent thinking about it) check whether they have read permissions on the underlying pool when buffering writes for a file. I believe if you ran an fsync on the file you'd get an EROFS or similar. Anyway, the client happily buffers up the writes. Depending on how exactly you remount then it might not be able to drop the MDS caps for file access (due to having dirty data it can't get rid of), and those caps have to time out before anybody else can access the file again. So you've found an unpleasant oddity of how the POSIX interfaces map onto this kind of distributed system, but nothing unexpected. :) Oliver's point is valid though; I would be nice if you could somehow make CephFS read-only to some (or all) clients server side, the way an NFS ro export does. Yeah. Yet another thing that would be good but requires real permission bits on the MDS. It'll happen eventually, but we have other bits that seem a lot more important...fsck, stability, single-tenant usability ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?
On Wed, Mar 18, 2015 at 3:28 AM, Chris Murray chrismurra...@gmail.com wrote: Hi again Greg :-) No, it doesn't seem to progress past that point. I started the OSD again a couple of nights ago: 2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish 13288339 open_ops 1 - 0, max_applied_seq 13288338 - 13288339 2015-03-16 21:34:46.221445 7fe4a8aa7780 3 journal journal_replay: r = 0, op_seq now 13288339 2015-03-16 21:34:46.221513 7fe4a8aa7780 2 journal read_entry 3951706112 : seq 13288340 1755 bytes 2015-03-16 21:34:46.221547 7fe4a8aa7780 3 journal journal_replay: applying op seq 13288340 2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start 13288340 open_ops 0 - 1 2015-03-16 21:34:46.221610 7fe4a8aa7780 10 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480 2015-03-16 21:34:46.221651 7fe4a8aa7780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1 2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1 2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore header.spos 0.0.0 2015-03-16 21:34:48.096002 7fe49a5ac700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178 2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start max_applied_seq 13288339, open_ops 1 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 open ops to drain There's the success line for 13288339, like you mentioned. But not one for 13288340. Intriguing. So, those same 1755 bytes seem problematic every time the journal is replayed? Interestingly, there is a lot (in time, not exactly data mass or IOPs, but still more than 1755 bytes!) of activity while the log is at this line: 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 open ops to drain ... but then the IO ceases and the log still doesn't go any further. I wonder why 13288339 doesn't have that same 'waiting for ... open ops to drain' line. Or the 'woke after' one for that matter. While there is activity on sdb, it 'pulses' every 10 seconds or so, like this: sdb 20.00 0.00 3404.00 0 3404 sdb 16.00 0.00 2100.00 0 2100 sdb 10.00 0.00 1148.00 0 1148 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 1.00 0.00 496.00 0496 sdb 32.00 0.00 4940.00 0 4940 sdb 8.00 0.00 1144.00 0 1144 sdb 1.00 0.00 4.00 0 4 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 17.00 0.00 3340.00 0 3340 sdb 23.00 0.00 3368.00 0 3368 sdb 1.00 0.00 4.00 0 4 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 13.00 0.00 3332.00 0 3332 sdb 18.00 0.00 2360.00 0 2360 sdb 59.00 0.00 7464.00 0 7464 sdb 0.00 0.00 0.00 0 0 I was hoping Google may have held some clues, but it seems I'm the only one :-) https://www.google.co.uk/?gws_rd=ssl#q=%22journal+commit_start+waiting+for%22+%22open+ops+to+drain%22 I tried removing compress-force=lzo from osd mount options btrfs in ceph.conf, in case it was the
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Greg, Thanks for your input and completely agree that we cannot expect developers to fully document what impact each setting has on a cluster, particularly in a performance related way That said, if you or others could spare some time for a few pointers it would be much appreciated and I will endeavour to create some useful results/documents that are more relevant to end users. I have taken on board what you said about the WB throttle and have been experimenting with it by switching it on and off. I know it's a bit of a blunt configuration change, but it was useful to understand its effect. With it off, I do see initially quite a large performance increase but overtime it actually starts to slow the average throughput down. Like you said, I am guessing this is to do with it making sure the journal doesn't get to far ahead, leaving it with massive sync's to carry out. One thing I do see with the WBT enabled and to some extent with it disabled, is that there are large periods of small block writes at the max speed of the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces of performing an OSD bench (64kb io's for 500MB) where this behaviour can be seen. If you're doing 64k IOs then I believe it's creating a new on-disk file for each of those writes. How that's laid out on-disk will depend on your filesystem and the specific config options that we're using to try to avoid running too far ahead of the journal. I think you're just using these config options in conflict with eachother. You've set the min sync time to 20 seconds for some reason, presumably to try and batch stuff up? So in that case you probably want to let your journal run for twenty seconds worth of backing disk IO before you start throttling it, and probably 10-20 seconds worth of IO before forcing file flushes. That means increasing the throttle limits while still leaving the flusher enabled. -Greg http://www.sys-pro.co.uk/misc/wbt_on.png http://www.sys-pro.co.uk/misc/wbt_off.png I would really appreciate if someone could comment on why this type of behaviour happens? As can be seen in the trace, if the blocks are submitted to the disk as larger IO's and with higher concurrency, hundreds of Mb of data can be flushed in seconds. Is this something specific to the filesystem behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes which can't be merged into larger IO's? For sequential writes, I would have thought that in an optimum scenario, a spinning disk should be able to almost maintain its large block write speed (100MB/s) no matter the underlying block size. That being said, from what I understand when a sync is called it will try and flush all dirty data so the end result is probably slightly different to a traditional battery backed write back cache. Chris, would you be interested in forming a ceph-users based performance team? There's a developer performance meeting which is mainly concerned with improving the internals of Ceph. There is also a raft of information on the mailing list archives where people have said hey look at my SSD speed at x,y,z settings, but making comparisons or recommendations is not that easy. It may also reduce a lot of the repetitive posts of why is X so slowetc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd laggy algorithm
On Wed, Mar 11, 2015 at 8:40 AM, Artem Savinov asavi...@asdco.ru wrote: hello. ceph transfers osd node in the down status by default , after receiving 3 reports about disabled nodes. Reports are sent per osd heartbeat grace seconds, but the settings of mon_osd_adjust_heartbeat_gratse = true, mon_osd_adjust_down_out_interval = true timeout to transfer nodes in down status may vary. Tell me please: what algorithm enables changes timeout for the transfer nodes occur in down/out status and which parameters are affected? thanks. The monitors keep track of which detected failures are incorrect (based on reports from the marked-down/out OSDs) and build up an expectation about how often the failures are correct based on an exponential backoff of the data points. You can look at the code in OSDMonitor.cc if you're interested, but basically they apply that expectation to modify the down interval and the down-out interval to a value large enough that they believe the OSD is really down (assuming these config options are set). It's not terribly interesting. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out
On Wed, Mar 11, 2015 at 3:49 PM, Francois Lafont flafdiv...@free.fr wrote: Hi, I was always in the same situation: I couldn't remove an OSD without have some PGs definitely stuck to the active+remapped state. But I remembered I read on IRC that, before to mark out an OSD, it could be sometimes a good idea to reweight it to 0. So, instead of doing [1]: ceph osd out 3 I have tried [2]: ceph osd crush reweight osd.3 0 # waiting for the rebalancing... ceph osd out 3 and it worked. Then I could remove my osd with the online documentation: http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual Now, the osd is removed and my cluster is HEALTH_OK. \o/ Now, my question is: why my cluster was definitely stuck to active+remapped with [1] but was not with [2]? Personally, I have absolutely no explanation. If you have an explanation, I'd love to know it. If I remember/guess correctly, if you mark an OSD out it won't necessarily change the weight of the bucket above it (ie, the host), whereas if you change the weight of the OSD then the host bucket's weight changes. That makes for different mappings, and since you only have a couple of OSDs per host (normally: hurray!) and not many hosts (normally: sadness) then marking one OSD out makes things harder for the CRUSH algorithm. -Greg Should the reweight command be present in the online documentation? http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual If yes, I can make a pull request on the doc with pleasure. ;) Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS unexplained writes
The information you're giving sounds a little contradictory, but my guess is that you're seeing the impacts of object promotion and flushing. You can sample the operations the OSDs are doing at any given time by running ops_in_progress (or similar, I forget exact phrasing) command on the OSD admin socket. I'm not sure if rados df is going to report cache movement activity or not. That though would mostly be written to the SSDs, not the hard drives — although the hard drives could still get metadata updates written when objects are flushed. What data exactly are you seeing that's leading you to believe writes are happening against these drives? What is the exact CephFS and cache pool configuration? -Greg On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, I forgot to mention: while I am seeing these writes in iotop and /proc/diskstats for the hdd's, I am -not- seeing any writes in rados df for the pool residing on these disks. There is only one pool active on the hdd's and according to rados df it is getting zero writes when I'm just reading big files from cephfs. So apparently the osd's are doing some non-trivial amount of writing on their own behalf. What could it be? Thanks, Erik. On 03/16/2015 10:26 PM, Erik Logtenberg wrote: Hi, I am getting relatively bad performance from cephfs. I use a replicated cache pool on ssd in front of an erasure coded pool on rotating media. When reading big files (streaming video), I see a lot of disk i/o, especially writes. I have no clue what could cause these writes. The writes are going to the hdd's and they stop when I stop reading. I mounted everything with noatime and nodiratime so it shouldn't be that. On a related note, the Cephfs metadata is stored on ssd too, so metadata-related changes shouldn't hit the hdd's anyway I think. Any thoughts? How can I get more information about what ceph is doing? Using iotop I only see that the osd processes are busy but it doesn't give many hints as to what they are doing. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
Nothing here particularly surprises me. I don't remember all the details of the filestore's rate limiting off the top of my head, but it goes to great lengths to try and avoid letting the journal get too far ahead of the backing store. Disabling the filestore flusher and increasing the sync intervals without also increasing the filestore_wbthrottle_* limits is not going to work well for you. -Greg On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 16 March 2015 17:33 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync? On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000- 1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg Hi Greg, It's not the caching behaviour that I confused about, it’s the journal behaviour on the base disks during flushing. I've been doing some more tests and can do something reproducible which seems strange to me. First off 10MB of 4kb writes: time ceph tell osd.1 bench 1000 4096 { bytes_written: 1000, blocksize: 4096, bytes_per_sec: 16009426.00} real0m0.760s user0m0.063s sys 0m0.022s Now split this into 2x5mb writes: time ceph tell osd.1 bench 500 4096 time ceph tell osd.1 bench 500 4096 { bytes_written: 500, blocksize: 4096, bytes_per_sec: 10580846.00} real0m0.595s user0m0.065s sys 0m0.018s { bytes_written: 500, blocksize: 4096, bytes_per_sec: 9944252.00} real0m4.412s user0m0.053s sys 0m0.071s 2nd bench takes a lot longer even though both should easily fit in the 5GB journal. Looking at iostat, I think I can see that no writes happen to the journal whilst the writes from the 1st bench are being flushed. Is this the expected behaviour? I would have thought as long as there is space available in the journal it shouldn't block on new writes. Also I see in iostat writes to the underlying disk happening at a QD of 1 and 16kb IO's for a number of seconds, with a large blip or activity just before the flush finishes. Is this the correct behaviour? I would have thought if this tell osd bench is doing sequential IO then the journal should be able to flush 5-10mb of data in a fraction a second. Ceph.conf [osd] filestore max sync interval = 30 filestore min sync interval = 20 filestore flusher = false osd_journal_size = 5120 osd_crush_location_hook = /usr/local/bin/crush-location osd_op_threads = 5 filestore_op_threads = 4 iostat during period where writes seem to be blocked (journal=sda disk=sdd) Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.002.00 0.00 4.00 4.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdd 0.00 0.000.00 76.00 0.00 760.0020.00 0.99 13.110.00 13.11 13.05 99.20 iostat during
Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?
On Fri, Mar 20, 2015 at 4:03 PM, Chris Murray chrismurra...@gmail.com wrote: Ah, I was wondering myself if compression could be causing an issue, but I'm reconsidering now. My latest experiment should hopefully help troubleshoot. So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I try that: find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec btrfs filesystem defragment -v -czlib -- {} + After much, much waiting, all files have been rewritten, but the OSD still gets stuck at the same point. I've now unset the compress attribute on all files and started the defragment process again, but I'm not too hopeful since the files must be readable/writeable if I didn't get some failure during the defrag process. find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec chattr -c -- {} + find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec btrfs filesystem defragment -v -- {} + (latter command still running) Any other ideas at all? In the absence of the problem being spelled out to me with an error of some sort, I'm not sure how to troubleshoot further. Not much, sorry. Is it safe to upgrade a problematic cluster, when the time comes, in case this ultimately is a CEPH bug which is fixed in something later than 0.80.9? In general it should be fine since we're careful about backwards compatibility, but without knowing the actual issue I can't promise anything. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Readonly cache tiering and rbd.
On Thu, Mar 19, 2015 at 4:46 AM, Matthijs Möhlmann matth...@cacholong.nl wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, - From the documentation: Cache Tier readonly: Read-only Mode: When admins configure tiers with readonly mode, Ceph clients write data to the backing tier. On read, Ceph copies the requested object(s) from the backing tier to the cache tier. Stale objects get removed from the cache tier based on the defined policy. This approach is ideal for immutable data (e.g., presenting pictures/videos on a social network, DNA data, X-Ray imaging, etc.), because reading data from a cache pool that might contain out-of-date data provides weak consistency. Do not use readonly mode for mutable data. Does this mean that when a client (xen / kvm with a RBD volume) writes some data that the OSD does not mark the readonly cache dirty? Yes, exactly. Reads are directed to the cache but writes go directly to the base tier, and there's no attempt at communication about the changed objects. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Wed, Mar 18, 2015 at 11:10 PM, Christian Balzer ch...@gol.com wrote: Hello, On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote: On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Greg, Thanks for your input and completely agree that we cannot expect developers to fully document what impact each setting has on a cluster, particularly in a performance related way That said, if you or others could spare some time for a few pointers it would be much appreciated and I will endeavour to create some useful results/documents that are more relevant to end users. I have taken on board what you said about the WB throttle and have been experimenting with it by switching it on and off. I know it's a bit of a blunt configuration change, but it was useful to understand its effect. With it off, I do see initially quite a large performance increase but overtime it actually starts to slow the average throughput down. Like you said, I am guessing this is to do with it making sure the journal doesn't get to far ahead, leaving it with massive sync's to carry out. One thing I do see with the WBT enabled and to some extent with it disabled, is that there are large periods of small block writes at the max speed of the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces of performing an OSD bench (64kb io's for 500MB) where this behaviour can be seen. If you're doing 64k IOs then I believe it's creating a new on-disk file for each of those writes. How that's laid out on-disk will depend on your filesystem and the specific config options that we're using to try to avoid running too far ahead of the journal. Could you elaborate on that a bit? I would have expected those 64KB writes to go to the same object (file) until it is full (4MB). Because this behavior would explain some (if not all) of the write amplification I've seen in the past with small writes (see the SSD Hardware recommendation thread). Ah, no, you're right. With the bench command it all goes in to one object, it's just a separate transaction for each 64k write. But again depending on flusher and throttler settings in the OSD, and the backing FS' configuration, it can be a lot of individual updates — in particular, every time there's a sync it has to update the inode. Certainly that'll be the case in the described configuration, with relatively low writeahead limits on the journal but high sync intervals — once you hit the limits, every write will get an immediate flush request. But none of that should have much impact on your write amplification tests unless you're actually using osd bench to test it. You're more likely to be seeing the overhead of the pg log entry, pg info change, etc that's associated with each write. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
On Thu, Mar 19, 2015 at 2:41 PM, Nick Fisk n...@fisk.me.uk wrote: I'm looking at trialling OSD's with a small flashcache device over them to hopefully reduce the impact of metadata updates when doing small block io. Inspiration from here:- http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 One thing I suspect will happen, is that when the OSD node starts up udev could possibly mount the base OSD partition instead of flashcached device, as the base disk will have the ceph partition uuid type. This could result in quite nasty corruption. I have had a look at the Ceph udev rules and can see that something similar has been done for encrypted OSD's. Am I correct in assuming that what I need to do is to create a new partition uuid type for flashcached OSD's and then create a udev rule to activate these new uuid'd OSD's once flashcache has finished assembling them? I haven't worked with the udev rules in a while, but that sounds like the right way to go. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds log message
On Fri, Mar 20, 2015 at 12:39 PM, Daniel Takatori Ohara dtoh...@mochsl.org.br wrote: Hello, Anybody help me, please? Appear any messages in log of my mds. And after the shell of my clients freeze. 2015-03-20 12:23:54.068005 7f1608d49700 0 log_channel(default) log [WRN] : client.3197487 isn't responding to mclientcaps(revoke), ino 11b1696 pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 962.02 Well, this one means that it asked a client to revoke some file capabilities 962 seconds ago, and the client still hasn't. 2015-03-20 12:23:54.068135 7f1608d49700 0 log_channel(default) log [WRN] : 1 slow requests, 1 included below; oldest blocked for 962.028297 secs 2015-03-20 12:23:54.068142 7f1608d49700 0 log_channel(default) log [WRN] : slow request 962.028297 seconds old, received at 2015-03-20 12:07:52.039805: client_request(client.3197487:391527 create #11b And this is a request from the same client to create a file, also received ~962 seconds ago. This is probably blocked by the aforementioned capability drop. Everything that follows these have a good chance of being follow-on effects. The issue will probably clear itself up if you just restart the MDS. We've fixed a lot of bugs around this recently (although it's an ongoing source of them), so unless you're running very new code I would just restart and not worry about it. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS
On Fri, Mar 20, 2015 at 1:05 PM, Ridwan Rashid ridwan...@gmail.com wrote: Gregory Farnum greg@... writes: On Thu, Mar 19, 2015 at 5:57 PM, Ridwan Rashid ridwan064@... wrote: Hi, I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop with cephFS. I have installed hadoop-1.1.1 in the nodes and changed the conf/core-site.xml file according to the ceph documentation http://ceph.com/docs/master/cephfs/hadoop/ but after changing the file the namenode is not starting (namenode can be formatted) but the other services(datanode, jobtracker, tasktracker) are running in hadoop. The default hadoop works fine but when I change the core-site.xml file as above I get the following bindException as can be seen from the namenode log: 2015-03-19 01:37:31,436 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to node1/10.242.144.225:6789 : Cannot assign requested address I have one monitor for the ceph cluster (node1/10.242.144.225) and I included in the core-site.xml file ceph://10.242.144.225:6789 as the value of fs.default.name. The 6789 port is the default port being used by the monitor node of ceph, so that may be the reason for the bindException but the ceph documentation mentions that it should be included like this in the core-site.xml file. It would be really helpful to get some pointers to where I am doing wrong in the setup. I'm a bit confused. The NameNode is only used by HDFS, and so shouldn't be running at all if you're using CephFS. Nor do I have any idea why you've changed anything in a way that tells the NameNode to bind to the monitor's IP address; none of the instructions that I see can do that, and they certainly shouldn't be. -Greg Hi Greg, I want to run a hadoop job (e.g. terasort) and want to use cephFS instead of HDFS. In Using Hadoop with cephFS documentation in http://ceph.com/docs/master/cephfs/hadoop/ if you look into the Hadoop configuration section, the first property fs.default.name has to be set as the ceph URI and in the notes it's mentioned as ceph://[monaddr:port]/. My core-site.xml of hadoop conf looks like this configuration property namefs.default.name/name valueceph://10.242.144.225:6789/value /property Yeah, that all makes sense. But I don't understand why or how you're starting up a NameNode at all, nor what config values it's drawing from to try and bind to that port. The NameNode is the problem because it shouldn't even be invoked. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS
On Thu, Mar 19, 2015 at 5:57 PM, Ridwan Rashid ridwan...@gmail.com wrote: Hi, I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop with cephFS. I have installed hadoop-1.1.1 in the nodes and changed the conf/core-site.xml file according to the ceph documentation http://ceph.com/docs/master/cephfs/hadoop/ but after changing the file the namenode is not starting (namenode can be formatted) but the other services(datanode, jobtracker, tasktracker) are running in hadoop. The default hadoop works fine but when I change the core-site.xml file as above I get the following bindException as can be seen from the namenode log: 2015-03-19 01:37:31,436 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to node1/10.242.144.225:6789 : Cannot assign requested address I have one monitor for the ceph cluster (node1/10.242.144.225) and I included in the core-site.xml file ceph://10.242.144.225:6789 as the value of fs.default.name. The 6789 port is the default port being used by the monitor node of ceph, so that may be the reason for the bindException but the ceph documentation mentions that it should be included like this in the core-site.xml file. It would be really helpful to get some pointers to where I am doing wrong in the setup. I'm a bit confused. The NameNode is only used by HDFS, and so shouldn't be running at all if you're using CephFS. Nor do I have any idea why you've changed anything in a way that tells the NameNode to bind to the monitor's IP address; none of the instructions that I see can do that, and they certainly shouldn't be. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW Direct Upload Limitation
On Mon, Mar 16, 2015 at 11:14 AM, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote: Hi all! I have recently updated to CEPH version 0.80.9 (latest Firefly release) which presumably supports direct upload. I 've tried to upload a file using this functionality and it seems that is working for files up to 5GB. For files above 5GB there is an error. I believe that this is because of a hardcoded limit: #define RGW_MAX_PUT_SIZE(5ULL*1024*1024*1024) Is there a way to increase that limit other than compiling CEPH from source? No. Could we somehow put it as a configuration parameter? Maybe, but I'm not sure if Yehuda would want to take it upstream or not. This limit is present because it's part of the S3 spec. For larger objects you should use multi-part upload, which can get much bigger. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Shadow files
On Mon, Mar 16, 2015 at 12:12 PM, Craig Lewis cle...@centraldesktop.com wrote: Out of curiousity, what's the frequency of the peaks and troughs? RadosGW has configs on how long it should wait after deleting before garbage collecting, how long between GC runs, and how many objects it can GC in per run. The defaults are 2 hours, 1 hour, and 32 respectively. Search http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc. If your peaks and troughs have a frequency less than 1 hour, then GC is going to delay and alias the disk usage w.r.t. the object count. If you have millions of objects, you probably need to tweak those values. If RGW is only GCing 32 objects an hour, it's never going to catch up. Now that I think about it, I bet I'm having issues here too. I delete more than (32*24) objects per day... Uh, that's not quite what rgw_gc_max_objs mean. That param configures how the garbage control data objects and internal classes are sharded, and each grouping will only delete one object at a time. So it controls the parallelism, but not the total number of objects! Also, Yehuda says that changing this can be a bit dangerous because it currently needs to be consistent across any program doing or generating GC work. -Greg On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote: It is either a problem with CEPH, Civetweb or something else in our configuration. But deletes in user buckets is still leaving a high number of old shadow files. Since we have millions and millions of objects, it is hard to reconcile what should and shouldnt exist. Looking at our cluster usage, there are no troughs, it is just a rising peak. But when looking at users data usage, we can see peaks and troughs as you would expect as data is deleted and added. Our ceph version 0.80.9 Please ideas? On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote: - Original Message - From: Ben b@benjackson.email To: ceph-us...@ceph.com Sent: Wednesday, March 11, 2015 8:46:25 PM Subject: Re: [ceph-users] Shadow files Anyone got any info on this? Is it safe to delete shadow files? It depends. Shadow files are badly named objects that represent part of the objects data. They are only safe to remove if you know that the corresponding objects no longer exist. Yehuda On 2015-03-11 10:03, Ben wrote: We have a large number of shadow files in our cluster that aren't being deleted automatically as data is deleted. Is it safe to delete these files? Is there something we need to be aware of when deleting them? Is there a script that we can run that will delete these safely? Is there something wrong with our cluster that it isn't deleting these files when it should be? We are using civetweb with radosgw, with tengine ssl proxy infront of it Any advice please Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cascading Failure of OSDs
This might be related to the backtrace assert, but that's the problem you need to focus on. In particular, both of these errors are caused by the scrub code, which Sage suggested temporarily disabling — if you're still getting these messages, you clearly haven't done so successfully. That said, it looks like the problem is that the object and/or object info specified here are just totally busted. You probably want to figure out what happened there since these errors are normally a misconfiguration somewhere (e.g., setting nobarrier on fs mount and then losing power). I'm not sure if there's a good way to repair the object, but if you can lose the data I'd grab the ceph-objectstore tool and remove the object from each OSD holding it that way. (There's a walkthrough of using it for a similar situation in a recent Ceph blog post.) On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman qhart...@direwolfdigital.com wrote: Alright, tried a few suggestions for repairing this state, but I don't seem to have any PG replicas that have good copies of the missing / zero length shards. What do I do now? telling the pg's to repair doesn't seem to help anything? I can deal with data loss if I can figure out which images might be damaged, I just need to get the cluster consistent enough that the things which aren't damaged can be usable. Also, I'm seeing these similar, but not quite identical, error messages as well. I assume they are referring to the same root problem: -1 2015-03-07 03:12:49.217295 7fc8ab343700 0 log [ERR] : 3.69d shard 22: soid dd85669d/rbd_data.3f7a2ae8944a.19a5/7//3 size 0 != known size 4194304 Mmm, unfortunately that's a different object than the one referenced in the earlier crash. Maybe it's repairable, or it might be the same issue — looks like maybe you've got some widespread data loss. -Greg On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman qhart...@direwolfdigital.com wrote: Finally found an error that seems to provide some direction: -1 2015-03-07 02:52:19.378808 7f175b1cf700 0 log [ERR] : scrub 3.18e e08a418e/rbd_data.3f7a2ae8944a.16c8/7//3 on disk size (0) does not match object info size (4120576) ajusted for ondisk to (4120576) I'm diving into google now and hoping for something useful. If anyone has a suggestion, I'm all ears! QH On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman qhart...@direwolfdigital.com wrote: Thanks for the suggestion, but that doesn't seem to have made a difference. I've shut the entire cluster down and brought it back up, and my config management system seems to have upgraded ceph to 0.80.8 during the reboot. Everything seems to have come back up, but I am still seeing the crash loops, so that seems to indicate that this is definitely something persistent, probably tied to the OSD data, rather than some weird transient state. On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil s...@newdream.net wrote: It looks like you may be able to work around the issue for the moment with ceph osd set nodeep-scrub as it looks like it is scrub that is getting stuck? sage On Fri, 6 Mar 2015, Quentin Hartman wrote: Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ an osd crash log (in github gist because it was too big for pastebin) - https://gist.github.com/qhartman/cb0e290df373d284cfb5 And now I've got four OSDs that are looping. On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman qhart...@direwolfdigital.com wrote: So I'm in the middle of trying to triage a problem with my ceph cluster running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has been running happily for about a year. This last weekend, something caused the box running the MDS to sieze hard, and when we came in on monday, several OSDs were down or unresponsive. I brought the MDS and the OSDs back on online, and managed to get things running again with minimal data loss. Had to mark a few objects as lost, but things were apparently running fine at the end of the day on Monday. This afternoon, I noticed that one of the OSDs was apparently stuck in a crash/restart loop, and the cluster was unhappy. Performance was in the tank and ceph status is reporting all manner of problems, as one would expect if an OSD is misbehaving. I marked the offending OSD out, and the cluster started rebalancing as expected. However, I noticed a short while later, another OSD has started into a crash/restart loop. So, I repeat the process. And it happens again. At this point I notice, that there are actually two at a time which are in this state. It's as if there's some toxic chunk of data that is getting passed around, and when it lands on an OSD it kills it. Contrary to that, however, I tried just stopping an OSD when it's in a bad state, and once the cluster starts
Re: [ceph-users] flock() supported on CephFS through Fuse ?
On Tue, Mar 10, 2015 at 4:20 AM, Florent B flor...@coppint.com wrote: Hi all, I'm testing flock() locking system on CephFS (Giant) using Fuse. It seems that lock works per client, and not over all clients. Am I right or is it supposed to work over different clients ? Does MDS has such a locking system and is it supported through Fuse ? Thank you. P.S.: I use a simple PHP script to test it, attached. flock and fcntl locking has been supported in the kernel client for many years, but was only implemented for ceph-fuse recently. It will be in hammer and was backported for the next firefly point release, but is unlikely to go into giant (unless of course somebody from the community does the backport and enough testing ;). -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with free Inodes
On Tue, Mar 24, 2015 at 12:13 AM, Christian Balzer ch...@gol.com wrote: On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote: Yes I read it and do no not understand what you mean when say *verify this*? All 3335808 inodes are definetly files and direcories created by ceph OSD process: What I mean is how/why did Ceph create 3+ million files, where in the tree are they actually or are they evenly distributed in the respective PG sub-directories. Or to ask it differently, how large is your cluster (how many OSDs, objects), in short the output of ceph -s. If cache-tiers actually are reserving each object that exists on the backing store (even if there isn't data in it yet on the cache tier) and your cluster is large enough, it might explain this. Nope. As you've said, this doesn't make any sense unless the objects are all ludicrously small (and you can't actually get 10-byte objects in Ceph; the names alone tend to be bigger than that) or something else is using up inodes. And that should both be mentioned and precautions to not run out of inodes should be made by the Ceph code. If not, this may be a bug after all. Would be nice if somebody from the Ceph devs could have gander at this. Christian *tune2fs 1.42.5 (29-Jul-2012)* Filesystem volume name: none Last mounted on: /var/lib/ceph/tmp/mnt.05NAJ3 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33 Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux *Inode count: 3335808* Block count: 13342945 Reserved block count: 667147 Free blocks: 5674105 *Free inodes: 0* First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 1020 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8176 Inode blocks per group: 511 Flex block group size:16 Filesystem created: Fri Feb 20 16:44:25 2015 Last mount time: Tue Mar 24 09:33:19 2015 Last write time: Tue Mar 24 09:33:27 2015 Mount count: 7 Maximum mount count: -1 Last checked: Fri Feb 20 16:44:25 2015 Check interval: 0 (none) Lifetime writes: 4116 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b Journal backup: inode blocks *fsck.ext4 /dev/sda1* e2fsck 1.42.5 (29-Jul-2012) /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks 23.03.2015 17:09, Christian Balzer пишет: On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote: Yes, I understand that. The initial purpose of first email was just an advise for new comers. My fault was in that I was selected ext4 for SSD disks as backend. But I did not foresee that inode number can reach its limit before the free space :) And maybe there must be some sort of warning not only for free space in MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes for filesystems with static inode allocation like ext4. Because if OSD reach inode limit it becames totally unusable and immediately goes down, and from that moment there is no way to start it! While all that is true and should probably be addressed, please re-read what I wrote before. With the 3.3 million inodes used and thus likely as many files (did you verify this?) and 4MB objects that would make something in the 12TB ballpark area. Something very very strange and wrong is going on with your cache tier. Christian 23.03.2015 13:42, Thomas Foster пишет: You could fix this by changing your block size when formatting the mount-point with the mkfs -b command. I had this same issue when dealing with the filesystem using glusterfs and the solution is to either use a filesystem that allocates inodes automatically or change the block size when you build the filesystem. Unfortunately, the only way to fix the problem that I have seen is to reformat On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote: In my case there was cache pool for ec-pool serving RBD-images, and object size is 4Mb, and client was an /kernel-rbd /client each SSD disk is 60G disk, 2 disk per node, 6 nodes in total = 12 OSDs in total
Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded
On Wed, Mar 25, 2015 at 1:20 AM, Udo Lembke ulem...@polarzone.de wrote: Hi, due to two more hosts (now 7 storage nodes) I want to create an new ec-pool and get an strange effect: ceph@admin:~$ ceph health detail HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized This is the big clue: you have two undersized PGs! pg 22.3e5 is stuck unclean since forever, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] 2147483647 is the largest number you can represent in a signed 32-bit integer. There's an output error of some kind which is fixed elsewhere; this should be -1. So for whatever reason (in general it's hard on CRUSH trying to select N entries out of N choices), CRUSH hasn't been able to map an OSD to this slot for you. You'll want to figure out why that is and fix it. -Greg pg 22.240 is stuck unclean since forever, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck undersized for 406.614447, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck undersized for 406.616563, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is stuck degraded for 406.614566, current state active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647] pg 22.240 is stuck degraded for 406.616679, current state active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58] pg 22.3e5 is active+undersized+degraded, acting [76,15,82,11,57,29,2147483647] pg 22.240 is active+undersized+degraded, acting [38,85,17,74,2147483647,10,58] But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647! Where the heck came the 2147483647 from? I do following commands: ceph osd erasure-code-profile set 7hostprofile k=5 m=2 ruleset-failure-domain=host ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile my version: ceph -v ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e) I found an issue in my crush-map - one SSD was twice in the map: host ceph-061-ssd { id -16 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 } root ssd { id -13 # do not change unnecessarily # weight 0.780 alg straw hash 0 # rjenkins1 item ceph-01-ssd weight 0.170 item ceph-02-ssd weight 0.170 item ceph-03-ssd weight 0.000 item ceph-04-ssd weight 0.170 item ceph-05-ssd weight 0.170 item ceph-06-ssd weight 0.050 item ceph-07-ssd weight 0.050 item ceph-061-ssd weight 0.000 } Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd, but after fix the crusmap the issue with the osd 2147483647 still excist. Any idea how to fix that? regards Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] error creating image in rbd-erasure-pool
Yes. On Wed, Mar 25, 2015 at 4:13 AM, Frédéric Nass frederic.n...@univ-lorraine.fr wrote: Hi Greg, Thank you for this clarification. It helps a lot. Does this can't think of any issues apply to both rbd and pool snapshots ? Frederic. On Tue, Mar 24, 2015 at 12:09 PM, Brendan Moloney molo...@ohsu.edu wrote: Hi Loic and Markus, By the way, Inktank do not support snapshot of a pool with cache tiering : * https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Hi, You seem to be talking about pool snapshots rather than RBD snapshots. But in the linked document it is not clear that there is a distinction: Can I use snapshots with a cache tier? Snapshots are not supported in conjunction with cache tiers. Can anyone clarify if this is just pool snapshots? I think that was just a decision based on the newness and complexity of the feature for product purposes. Snapshots against cache tiered pools certainly should be fine in Giant/Hammer and we can't think of any issues in Firefly off the tops of our heads. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cordialement, Frédéric Nass. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph -w: Understanding MB data versus MB used
On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote: Hello there, I started to push data into my ceph cluster. There is something I cannot understand in the output of ceph -w. When I run ceph -w I get this kinkd of output: 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail 2379MB is actually the data I pushed into the cluster, I can see it also in the ceph df output, and the numbers are consistent. What I dont understand is 19788MB used. All my pools have size 3, so I expected something like 2379 * 3. Instead this number is very big. I really need to understand how MB used grows because I need to know how many disks to buy. MB used is the summation of (the programmatic equivalent to) df across all your nodes, whereas MB data is calculated by the OSDs based on data they've written down. Depending on your configuration MB used can include thing like the OSD journals, or even totally unrelated data if the disks are shared with other applications. MB used including the space used by the OSD journals is my first guess about what you're seeing here, in which case you'll notice that it won't grow any faster than MB data does once the journal is fully allocated. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)
On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: Dear All, Please forgive this post if it's naive, I'm trying to familiarise myself with cephfs! I'm using Scientific Linux 6.6. with Ceph 0.87.1 My first steps with cephfs using a replicated pool worked OK. Now trying now to test cephfs via a replicated caching tier on top of an erasure pool. I've created an erasure pool, cannot put it under the existing replicated pool. My thoughts were to delete the existing cephfs, and start again, however I cannot delete the existing cephfs: errors are as follows: [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem I've tried killing the ceph-mds process, but this does not prevent the above error. I've also tried this, which also errors: [root@ceph1 ~]# ceph mds stop 0 Error EBUSY: must decrease max_mds or else MDS will immediately reactivate Right, so did you run ceph mds set_max_mds 0 and then repeating the stop command? :) This also fail... [root@ceph1 ~]# ceph-deploy mds destroy [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy [ceph_deploy.mds][ERROR ] subcommand destroy not implemented Am I doing the right thing in trying to wipe the original cephfs config before attempting to use an erasure cold tier? Or can I just redefine the cephfs? Yeah, unfortunately you need to recreate it if you want to try and use an EC pool with cache tiering, because CephFS knows what pools it expects data to belong to. Things are unlikely to behave correctly if you try and stick an EC pool under an existing one. :( Sounds like this is all just testing, which is good because the suitability of EC+cache is very dependent on how much hot data you have, etc...good luck! -Greg many thanks, Jake Grimmett ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Slow writes with 1MB files
On Sat, Mar 28, 2015 at 10:12 AM, Barclay Jameson almightybe...@gmail.com wrote: I redid my entire Ceph build going back to to CentOS 7 hoping to the get the same performance I did last time. The rados bench test was the best I have ever had with a time of 740 MB wr and 1300 MB rd. This was even better than the first rados bench test that had performance equal to PanFS. I find that this does not translate to my CephFS. Even with the following tweaking it still at least twice as slow as PanFS and my first *Magical* build (that had absolutely no tweaking): OSD osd_op_treads 8 /sys/block/sd*/queue/nr_requests 4096 /sys/block/sd*/queue/read_ahead_kb 4096 Client rsize=16777216 readdir_max_bytes=16777216 readdir_max_entries=16777216 ~160 mins to copy 10 (1MB) files for CephFS vs ~50 mins for PanFS. Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s. Strange thing is none of the resources are taxed. CPU, ram, network, disks, are not even close to being taxed on either the client,mon/mds, or the osd nodes. The PanFS client node was a 10Gb network the same as the CephFS client but you can see the huge difference in speed. As per Gregs questions before: There is only one client reading and writing (time cp Small1/* Small2/.) but three clients have cephfs mounted, although they aren't doing anything on the filesystem. I have done another test where I stream data info a file as fast as the processor can put it there. (for (i=0; i 11; i++){ fprintf (out_file, I is : %d\n,i);} ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the above tuning vs 130 seconds for PanFS. Without the tuning it takes 230 seconds for CephFS although the first build did it in 130 seconds without any tuning. This leads me to believe the bottleneck is the mds. Does anybody have any thoughts on this? Are there any tuning parameters that I would need to speed up the mds? This is pretty likely, but 10 creates/second is just impossibly slow. The only other thing I can think of is that you might have enabled fragmentation but aren't now, which might make an impact on a directory with 100k entries. Or else your hardware is just totally wonky, which we've seen in the past but your server doesn't look quite large enough to be hitting any of the nasty NUMA stuff...but that's something else to look at which I can't help you with, although maybe somebody else can. If you're interested in diving into it and depending on the Ceph version you're running you can also examine the mds perfcounters (http://ceph.com/docs/master/dev/perf_counters/) and the op history (dump_ops_in_flight etc) and look for any operations which are noticeably slow. -Greg On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson almightybe...@gmail.com wrote: Yes it's the exact same hardware except for the MDS server (although I tried using the MDS on the old node). I have not tried moving the MON back to the old node. My default cache size is mds cache size = 1000 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks. I created 2048 for data and metadata: ceph osd pool create cephfs_data 2048 2048 ceph osd pool create cephfs_metadata 2048 2048 To your point on clients competing against each other... how would I check that? Do you have multiple clients mounted? Are they both accessing files in the directory(ies) you're testing? Were they accessing the same pattern of files for the old cluster? If you happen to be running a hammer rc or something pretty new you can use the MDS admin socket to explore a bit what client sessions there are and what they have permissions on and check; otherwise you'll have to figure it out from the client side. -Greg Thanks for the input! On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum g...@gregs42.com wrote: So this is exactly the same test you ran previously, but now it's on faster hardware and the test is slower? Do you have more data in the test cluster? One obvious possibility is that previously you were working entirely in the MDS' cache, but now you've got more dentries and so it's kicking data out to RADOS and then reading it back in. If you've got the memory (you appear to) you can pump up the mds cache size config option quite dramatically from it's default 10. Other things to check are that you've got an appropriately-sized metadata pool, that you've not got clients competing against each other inappropriately, etc. -Greg On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson almightybe...@gmail.com wrote: Opps I should have said that I am not just writing the data but copying it : time cp Small1/* Small2/* Thanks, BJ On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson almightybe...@gmail.com wrote: I did a Ceph cluster install 2 weeks ago where I was getting great performance (~= PanFS) where I could write 100,000 1MB files
Re: [ceph-users] SSD Journaling
On Mon, Mar 30, 2015 at 1:01 PM, Garg, Pankaj pankaj.g...@caviumnetworks.com wrote: Hi, I’m benchmarking my small cluster with HDDs vs HDDs with SSD Journaling. I am using both RADOS bench and Block device (using fio) for testing. I am seeing significant Write performance improvements, as expected. I am however seeing the Reads coming out a bit slower on the SSD Journaling side. They are not terribly different, but sometimes 10% slower. Is that something other folks have also seen, or do I need some settings to be tuned properly? I’m wondering if accessing 2 drives for reads, adds latency and hence the throughput suffers. You're not reading off of the journal in any case (it's only read on restart). If I were to guess then the SSD journaling is just building up enough dirty data ahead of the backing filesystem that if you do a read it takes a little longer for the data to be readable through the local filesystem. There have been a number of threads here about configuring the journal which you might want to grab out of an archiving system and look at. :) -Greg Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is it possible to change the MDS node after its been created
On Mon, Mar 30, 2015 at 3:15 PM, Francois Lafont flafdiv...@free.fr wrote: Hi, Gregory Farnum wrote: The MDS doesn't have any data tied to the machine you're running it on. You can either create an entirely new one on a different machine, or simply copy the config file and cephx keyring to the appropriate directories. :) Sorry to enter in this post but how can we *remove* a mds daemon of a ceph cluster? Are the commands below enough? stop the daemon rm -r /var/lib/ceph/mds/ceph-$id/ ceph auth del mds.$id Should we edit something in the mds map to remove once and for all the mds ? As long as you turn on another MDS which takes over the logical rank of the MDS you remove, you don't need to remove anything from the cluster store. Note that if you just copy the directory and keyring to the new location you shouldn't do the ceph auth del bit either. ;) -Greg -- François Lafont -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is it possible to change the MDS node after its been created
On Mon, Mar 30, 2015 at 1:51 PM, Steve Hindle mech...@gmail.com wrote: Hi! I mistakenly created my MDS node on the 'wrong' server a few months back. Now I realized I placed it on a machine lacking IPMI and would like to move it to another node in my cluster. Is it possible to non-destructively move an MDS ? The MDS doesn't have any data tied to the machine you're running it on. You can either create an entirely new one on a different machine, or simply copy the config file and cephx keyring to the appropriate directories. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One host failure bring down the whole cluster
On Mon, Mar 30, 2015 at 8:02 PM, Lindsay Mathieson lindsay.mathie...@gmail.com wrote: On Tue, 31 Mar 2015 02:42:27 AM Kai KH Huang wrote: Hi, all I have a two-node Ceph cluster, and both are monitor and osd. When they're both up, osd are all up and in, everything is fine... almost: Two things. 1 - You *really* need a min of three monitors. Ceph cannot form a quorum with just two monitors and you run a risk of split brain. You can form quorums with an even number of monitors, and Ceph does so — there's no risk of split brain. The problem with 2 monitors is that a quorum is always 2 — which is exactly what you're seeing right now. You can't run with only one monitor up (assuming you have a non-zero number of them). 2 - You also probably have a min size of two set (the default). This means that you need a minimum of two copies of each data object for writes to work. So with just two nodes, if one goes down you can't write to the other. Also this. So: - Install a extra monitor node - it doesn't have to be powerful, we just use a Intel Celeron NUC for that. - reduce your minimum size to 1 (One). Yep. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?
I don't know why you're mucking about manually with the rbd directory; the rbd tool and rados handle cache pools correctly as far as I know. -Greg On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote: Hi Greg, ok! It's looks like, that my problem is more setomapval-related... I must o something like rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51 but rados setomapval don't use the hexvalues - instead of this I got rados -p ssd-archiv listomapvals rbd_directory name_vm-409-disk-2 value: (35 bytes) : : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\ 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d 0020 : 63 35 31: c51 hmm, strange. With rados -p ssd-archiv getomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 I got the binary inside the file name_vm-409-disk-2, but reverse do an rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 name_vm-409-disk-2 fill the variable with name_vm-409-disk-2 and not with the content of the file... Are there other tools for the rbd_directory? regards Udo Am 26.03.2015 15:03, schrieb Gregory Farnum: You shouldn't rely on rados ls when working with cache pools. It doesn't behave properly and is a silly operation to run against a pool of any size even when it does. :) More specifically, rados ls is invoking the pgls operation. Normal read/write ops will go query the backing store for objects if they're not in the cache tier. pgls is different — it just tells you what objects are present in the PG on that OSD right now. So any objects which aren't in cache won't show up when listing on the cache pool. -Greg On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote: Hi all, due an very silly approach, I removed the cache tier of an filled EC pool. After recreate the pool and connect with the EC pool I don't see any content. How can I see the rbd_data and other files through the new ssd cache tier? I think, that I must recreate the rbd_directory (and fill with setomapval), but I don't see anything yet! $ rados ls -p ecarchiv | more rbd_data.2e47de674b0dc51.00390074 rbd_data.2e47de674b0dc51.0020b64f rbd_data.2fbb1952ae8944a.0016184c rbd_data.2cfc7ce74b0dc51.00363527 rbd_data.2cfc7ce74b0dc51.0004c35f rbd_data.2fbb1952ae8944a.0008db43 rbd_data.2cfc7ce74b0dc51.0015895a rbd_data.31229f0238e1f29.000135eb ... $ rados ls -p ssd-archiv nothing generation of the cache tier: $ rados mkpool ssd-archiv $ ceph osd pool set ssd-archiv crush_ruleset 5 $ ceph osd tier add ecarchiv ssd-archiv $ ceph osd tier cache-mode ssd-archiv writeback $ ceph osd pool set ssd-archiv hit_set_type bloom $ ceph osd pool set ssd-archiv hit_set_count 1 $ ceph osd pool set ssd-archiv hit_set_period 3600 $ ceph osd pool set ssd-archiv target_max_bytes 500 rule ssd { ruleset 5 type replicated min_size 1 max_size 10 step take ssd step choose firstn 0 type osd step emit } Are there any magic (or which command I missed?) to see the excisting data throug the cache tier? regards - and hoping for answers Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last friday I got everything deployed and all was working well, and I set noout and shut all the OSD nodes down over the weekend. Yesterday when I spun it back up, the OSDs were behaving very strangely, incorrectly marking each other because of missed heartbeats, even though they were up. It looked like some kind of low-level networking problem, but I couldn't find any. After much work, I narrowed the apparent source of the problem down to the OSDs running on the first host I started in the morning. They were the ones that were logged the most messages about not being able to ping other OSDs, and the other OSDs were mostly complaining about them. After running out of other ideas to try, I restarted them, and then everything started working. It's still working happily this morning. It seems as though when that set of OSDs started they got stale OSD map information from the MON boxes, which failed to be updated as the other OSDs came up. Does that make sense? I still don't consider myself an expert on ceph architecture and would appreciate and corrections or other possible interpretations of events (I'm happy to provide whatever additional information I can) so I can get a deeper understanding of things. If my interpretation of events is correct, it seems that might point at a bug. I can't find the ticket now, but I think we did indeed have a bug around heartbeat failures when restarting nodes. This has been fixed in other branches but might have been missed for giant. (Did you by any chance set the nodown flag as well as noout?) In general Ceph isn't very happy with being shut down completely like that and its behaviors aren't validated, so nothing will go seriously wrong but you might find little irritants like this. It's particularly likely when you're prohibiting state changes with the noout/nodown flags. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One of three monitors can not be started
On Tue, Mar 31, 2015 at 2:50 AM, 张皓宇 zhanghaoyu1...@hotmail.com wrote: Who can help me? One monitor in my ceph cluster can not be started. Before that, I added '[mon] mon_compact_on_start = true' to /etc/ceph/ceph.conf on three monitor hosts. Then I did 'ceph tell mon.computer05 compact ' on computer05, which has a monitor on it. When store.db of computer05 changed from 108G to 1G, mon.computer06 stoped, and it can not be started since that. If I start mon.computer06, it will stop on this state: # /etc/init.d/ceph start mon.computer06 === mon.computer06 === Starting Ceph mon.computer06 on computer06... The process info is like this: root 12149 3807 0 20:46 pts/27 00:00:00 /bin/sh /etc/init.d/ceph start mon.computer06 root 12308 12149 0 20:46 pts/27 00:00:00 bash -c ulimit -n 32768; /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf root 12309 12308 0 20:46 pts/27 00:00:00 /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf root 12313 12309 19 20:46 pts/27 00:00:01 /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf Log on computer06 is like this: 2015-03-30 20:46:54.152956 7fc5379d07a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 12309 ... 2015-03-30 20:46:54.759791 7fc5379d07a0 1 mon.computer06@-1(probing) e4 preinit clean up potentially inconsistent store state So I haven't looked at this code in a while, but I think the monitor is trying to validate that it's consistent with the others. You probably want to dig around the monitor admin sockets and see what state each monitor is in, plus its perception of the others. In this case, I think maybe mon.computer06 is trying to examine its whole store, but 100GB is a lot (way too much, in fact), so this can take a lng time. Sorry, my English is not good. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman qhart...@direwolfdigital.com wrote: Thanks for the extra info Gregory. I did not also set nodown. I expect that I will be very rarely shutting everything down in the normal course of things, but it has come up a couple times when having to do some physical re-organizing of racks. Little irritants like this aren't a big deal if people know to expect them, but as it is I lost quite a lot of time troubleshooting a non-existant problem. What's the best way to get notes to that effect added to the docs? It seems something in http://ceph.com/docs/master/rados/operations/operating/ would save some people some headache. I'm happy to propose edits, but a quick look doesn't reveal a process for submitting that sort of thing. Github pull requests. :) My understanding is that the right method to take an entire cluster offline is to set noout and then shutting everything down. Is there a better way? That's probably the best way to do it. Like I said, there was also a bug here that I think is fixed for Hammer but that might not have been backported to Giant. Unfortunately I don't remember the right keywords as I wasn't involved in the fix. -Greg QH On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last friday I got everything deployed and all was working well, and I set noout and shut all the OSD nodes down over the weekend. Yesterday when I spun it back up, the OSDs were behaving very strangely, incorrectly marking each other because of missed heartbeats, even though they were up. It looked like some kind of low-level networking problem, but I couldn't find any. After much work, I narrowed the apparent source of the problem down to the OSDs running on the first host I started in the morning. They were the ones that were logged the most messages about not being able to ping other OSDs, and the other OSDs were mostly complaining about them. After running out of other ideas to try, I restarted them, and then everything started working. It's still working happily this morning. It seems as though when that set of OSDs started they got stale OSD map information from the MON boxes, which failed to be updated as the other OSDs came up. Does that make sense? I still don't consider myself an expert on ceph architecture and would appreciate and corrections or other possible interpretations of events (I'm happy to provide whatever additional information I can) so I can get a deeper understanding of things. If my interpretation of events is correct, it seems that might point at a bug. I can't find the ticket now, but I think we did indeed have a bug around heartbeat failures when restarting nodes. This has been fixed in other branches but might have been missed for giant. (Did you by any chance set the nodown flag as well as noout?) In general Ceph isn't very happy with being shut down completely like that and its behaviors aren't validated, so nothing will go seriously wrong but you might find little irritants like this. It's particularly likely when you're prohibiting state changes with the noout/nodown flags. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] More writes on filestore than on journal ?
On Mon, Mar 23, 2015 at 6:21 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Hi, I'm still trying to find why there is much more write operations on filestore since Emperor/Firefly than from Dumpling. Do you have any history around this? It doesn't sound familiar, although I bet it's because of the WBThrottle and flushing changes. So, I add monitoring of all perf counters values from OSD. From what I see : «filestore.ops» reports an average of 78 operations per seconds. But, block device monitoring reports an average of 113 operations per seconds (+45%). please thoses 2 graphs : - https://daevel.fr/img/firefly/osd-70.filestore-ops.png - https://daevel.fr/img/firefly/osd-70.sda-ops.png That's unfortunate but perhaps not surprising — any filestore op can change a backing file (which requires hitting both the file and the inode: potentially two disk seeks), as well as adding entries to the leveldb instance. -Greg Do you see what can explain this difference ? (this OSD use XFS) Thanks, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status
On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto ziopr...@gmail.com wrote: Hello, I started to work with CEPH few weeks ago, I might ask a very newbie question, but I could not find an answer in the docs or in the ml archive for this. Quick description of my setup: I have a ceph cluster with two servers. Each server has 3 SSD drives I use for journal only. To map to different failure domains SAS disks that keep a journal to the same SSD drive, I wrote my own crushmap. I have now a total of 36OSD. Ceph health returns HEALTH_OK. I run the cluster with a couple of pools with size=3 and min_size=3 Production operations questions: I manually stopped some OSDs to simulate a failure. As far as I understood, an OSD down condition is not enough to make CEPH start making new copies of objects. I noticed that I must mark the OSD as out to make ceph produce new copies. As far as I understood min_size=3 puts the object in readonly if there are not at least 3 copies of the object available. That is correct, but the default with size 3 is 2 and you probably want to do that instead. If you have size==min_size on firefly releases and lose an OSD it can't do recovery so that PG is stuck without manual intervention. :( This is because of some quirks about how the OSD peering and recovery works, so you'd be forgiven for thinking it would recover nicely. (This is changed in the upcoming Hammer release, but you probably still want to allow cluster activity when an OSD fails, unless you're very confident in their uptime and more concerned about durability than availability.) -Greg Is this behavior correct or I made some mistake creating the cluster ? Should I expect ceph to produce automatically a new copy for objects when some OSDs are down ? There is any option to mark automatically out OSDs that go down ? thanks Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven CPU usage on OSD nodes
On Mon, Mar 23, 2015 at 4:31 AM, f...@univ-lr.fr f...@univ-lr.fr wrote: Hi Somnath, Thank you, please find my answers below Somnath Roy somnath@sandisk.com a écrit le 22/03/15 18:16 : Hi Frederick, Need some information here. 1. Just to clarify, you are saying it is happening g in 0.87.1 and not in Firefly ? That's a possibility, others running similar hardware (and possibly OS, I can ask) confirm they dont have such visible comportment on Firefly. I'd need to install Firefly on our hosts to be sure. We run on RHEL. 2. Is it happening after some hours of run or just right away ? It's happening on freshly installed hosts and goes on. 3. Please provide ‘perf top’ output of all the OSD nodes. Here they are : http://www.4shared.com/photo/S9tvbNKEce/UnevenLoad3-perf.html http://www.4shared.com/photo/OHfiAtXKba/UnevenLoad3-top.html The left-hand 'high-cpu' nodes have tmalloc calls able to explain the cpu difference. We don't see them on 'low-cpu' nodes : 12,15% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans Huh. The tcmalloc (memory allocator) workload should be roughly the same across all nodes, especially if they have equivalent distributions of PGs and primariness as you describe. Are you sure this is a persistent CPU imbalance or are they oscillating? Are there other processes on some of the nodes which could be requiring memory from the system? Either you've found a new bug in our memory allocator or something else is going on in the system to make it behave differently across your nodes. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can't Start OSD
On Sun, Mar 22, 2015 at 11:22 AM, Somnath Roy somnath@sandisk.com wrote: You should be having replicated copies on other OSDs (disks), so, no need to worry about the data loss. You add a new drive and follow the steps in the following link (either 1 or 2) Except that's not the case if you only had one copy of the PG, as seems to be indicated by the last acting[1] output all over that health warning. :/ You certainly should have a copy of the data elsewhere, but that message means you *didn't*; presumably you had 2 copies of everything and either your CRUSH map was bad (which should have provoked lots of warnings?) or you've lost more than one OSD. -Greg 1. For manual deployment, http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ 2. With ceph-deploy, http://ceph.com/docs/master/rados/deployment/ceph-deploy-osd/ After successful deployment, rebalancing should start and eventually cluster will come to healthy state. Thanks Regards Somnath -Original Message- From: Noah Mehl [mailto:noahm...@combinedpublic.com] Sent: Sunday, March 22, 2015 11:15 AM To: Somnath Roy Cc: ceph-users@lists.ceph.com Subject: Re: Can't Start OSD Somnath, You are correct, there are dmesg errors about the drive. How can I replace the drive? Can I copy all of the readable contents from this drive to a new one? Because I have the following output from “ceph health detail” HEALTH_WARN 43 pgs stale; 43 pgs stuck stale pg 7.5b7 is stuck stale for 5954121.993990, current state stale+active+clean, last acting [1] pg 7.42a is stuck stale for 5954121.993885, current state stale+active+clean, last acting [1] pg 7.669 is stuck stale for 5954121.994072, current state stale+active+clean, last acting [1] pg 7.121 is stuck stale for 5954121.993586, current state stale+active+clean, last acting [1] pg 7.4ec is stuck stale for 5954121.993956, current state stale+active+clean, last acting [1] pg 7.1e4 is stuck stale for 5954121.993670, current state stale+active+clean, last acting [1] pg 7.41f is stuck stale for 5954121.993901, current state stale+active+clean, last acting [1] pg 7.59f is stuck stale for 5954121.994024, current state stale+active+clean, last acting [1] pg 7.39 is stuck stale for 5954121.993490, current state stale+active+clean, last acting [1] pg 7.584 is stuck stale for 5954121.994026, current state stale+active+clean, last acting [1] pg 7.fd is stuck stale for 5954121.993600, current state stale+active+clean, last acting [1] pg 7.6fd is stuck stale for 5954121.994158, current state stale+active+clean, last acting [1] pg 7.4b5 is stuck stale for 5954121.993975, current state stale+active+clean, last acting [1] pg 7.328 is stuck stale for 5954121.993840, current state stale+active+clean, last acting [1] pg 7.4a9 is stuck stale for 5954121.993981, current state stale+active+clean, last acting [1] pg 7.569 is stuck stale for 5954121.994046, current state stale+active+clean, last acting [1] pg 7.629 is stuck stale for 5954121.994119, current state stale+active+clean, last acting [1] pg 7.623 is stuck stale for 5954121.994118, current state stale+active+clean, last acting [1] pg 7.6dd is stuck stale for 5954121.994179, current state stale+active+clean, last acting [1] pg 7.3d5 is stuck stale for 5954121.993935, current state stale+active+clean, last acting [1] pg 7.54b is stuck stale for 5954121.994058, current state stale+active+clean, last acting [1] pg 7.3cf is stuck stale for 5954121.993938, current state stale+active+clean, last acting [1] pg 7.c4 is stuck stale for 5954121.993633, current state stale+active+clean, last acting [1] pg 7.178 is stuck stale for 5954121.993719, current state stale+active+clean, last acting [1] pg 7.3b8 is stuck stale for 5954121.993946, current state stale+active+clean, last acting [1] pg 7.b1 is stuck stale for 5954121.993635, current state stale+active+clean, last acting [1] pg 7.5fb is stuck stale for 5954121.994146, current state stale+active+clean, last acting [1] pg 7.236 is stuck stale for 5954121.993801, current state stale+active+clean, last acting [1] pg 7.2f5 is stuck stale for 5954121.993881, current state stale+active+clean, last acting [1] pg 7.ac is stuck stale for 5954121.993643, current state stale+active+clean, last acting [1] pg 7.16d is stuck stale for 5954121.993738, current state stale+active+clean, last acting [1] pg 7.6b7 is stuck stale for 5954121.994223, current state stale+active+clean, last acting [1] pg 7.5ea is stuck stale for 5954121.994166, current state stale+active+clean, last acting [1] pg 7.a3 is stuck stale for 5954121.993654, current state stale+active+clean, last acting [1] pg 7.52d is stuck stale for 5954121.994110, current state stale+active+clean, last acting [1] pg 7.2d8 is stuck stale for 5954121.993904, current state stale+active+clean, last acting [1] pg 7.2db is stuck stale for 5954121.993903,
Re: [ceph-users] How does crush selects different osds using hash(pg) in diferent iterations
On Sat, Mar 21, 2015 at 10:46 AM, shylesh kumar shylesh.mo...@gmail.com wrote: Hi , I was going through this simplified crush algorithm given in ceph website. def crush(pg): all_osds = ['osd.0', 'osd.1', 'osd.2', ...] result = [] # size is the number of copies; primary+replicas while len(result) size: -- r = hash(pg) chosen = all_osds[ r % len(all_osds) ] if chosen in result: # OSD can be picked only once continue result.append(chosen) return result 10:24 PM (51 minutes ago) In the line where r = hash(pg) , will it gives the same hash value in every iteration ? if that is the case we always endup choosing the same osd from the list or will the pg number be used as seed for the hashing so that r value changes in the next iteration. Am I missing something really basic ?? Can somebody please provide me some pointers ? I'm not sure where this bit of documentation came from, but the selection process includes the attempt number as one of the inputs. Where the attempt starts at 0 (or 1, I dunno) and increments each time we try to map a new OSD to the PG. -Greg -- Thanks, Shylesh Kumar M ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph in Production: best practice to monitor OSD up/down status
On Mon, Mar 23, 2015 at 7:17 AM, Saverio Proto ziopr...@gmail.com wrote: Hello, thanks for the answers. This was exacly what I was looking for: mon_osd_down_out_interval = 900 I was not waiting long enoght to see my cluster recovering by itself. That's why I tried to increase min_size, because I did not understand what min_size was for. Now that I know what is min_size, I guess the best setting for me is min_size = 1 because I would like to be able to make I/O operations even of only 1 copy is left. I'd strongly recommend leaving it at two — if you reduce it to 1 then you can lose data by having just one disk die at an inopportune moment, whereas if you leave it at 2 the system won't accept any writes to only one hard drive. Leaving it at two the system will still try and re-replicate back up to three copies after mon osd down out interval time has elapsed from a failure. :) -Greg Thanks to all for helping ! Saverio 2015-03-23 14:58 GMT+01:00 Gregory Farnum g...@gregs42.com: On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto ziopr...@gmail.com wrote: Hello, I started to work with CEPH few weeks ago, I might ask a very newbie question, but I could not find an answer in the docs or in the ml archive for this. Quick description of my setup: I have a ceph cluster with two servers. Each server has 3 SSD drives I use for journal only. To map to different failure domains SAS disks that keep a journal to the same SSD drive, I wrote my own crushmap. I have now a total of 36OSD. Ceph health returns HEALTH_OK. I run the cluster with a couple of pools with size=3 and min_size=3 Production operations questions: I manually stopped some OSDs to simulate a failure. As far as I understood, an OSD down condition is not enough to make CEPH start making new copies of objects. I noticed that I must mark the OSD as out to make ceph produce new copies. As far as I understood min_size=3 puts the object in readonly if there are not at least 3 copies of the object available. That is correct, but the default with size 3 is 2 and you probably want to do that instead. If you have size==min_size on firefly releases and lose an OSD it can't do recovery so that PG is stuck without manual intervention. :( This is because of some quirks about how the OSD peering and recovery works, so you'd be forgiven for thinking it would recover nicely. (This is changed in the upcoming Hammer release, but you probably still want to allow cluster activity when an OSD fails, unless you're very confident in their uptime and more concerned about durability than availability.) -Greg Is this behavior correct or I made some mistake creating the cluster ? Should I expect ceph to produce automatically a new copy for objects when some OSDs are down ? There is any option to mark automatically out OSDs that go down ? thanks Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph binary missing from ceph-0.87.1-0.el6.x86_64
The ceph tool got moved into ceph-common at some point, so it shouldn't be in the ceph rpm. I'm not sure what step in the installation process should have handled that, but I imagine it's your problem. -Greg On Mon, Mar 2, 2015 at 11:24 AM, Michael Kuriger mk7...@yp.com wrote: Hi all, When doing a fresh install on a new cluster, and using the latest rpm (0.87.1) ceph-deploy fails right away. I checked the files inside the rpm, and /usr/bin/ceph is not there. Upgrading from the previous rpm seems to work, but ceph-deploy is pulling the latest rpm automatically. [ceph201][DEBUG ] connected to host: ceph201 [ceph201][DEBUG ] detect platform information from remote host [ceph201][DEBUG ] detect machine type [ceph_deploy.install][INFO ] Distro info: CentOS 6.5 Final [ceph201][INFO ] installing ceph on ceph201 [ceph201][INFO ] Running command: yum clean all [ceph201][DEBUG ] Loaded plugins: fastestmirror, security [ceph201][DEBUG ] Cleaning repos: base updates-released ceph-released [ceph201][DEBUG ] Cleaning up Everything [ceph201][DEBUG ] Cleaning up list of fastest mirrors [ceph201][INFO ] Running command: rpm --import https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc [ceph201][INFO ] Running command: rpm -Uvh --replacepkgs http://ceph.com/rpm-firefly/el6/noarch/ceph-release-1-0.el6.noarch.rpm [ceph201][DEBUG ] Retrieving http://ceph.com/rpm-firefly/el6/noarch/ceph-release-1-0.el6.noarch.rpm [ceph201][DEBUG ] Preparing... ## [ceph201][DEBUG ] ceph-release ## [ceph201][WARNIN] ensuring that /etc/yum.repos.d/ceph.repo contains a high priority [ceph201][WARNIN] altered ceph.repo priorities to contain: priority=1 [ceph201][INFO ] Running command: yum -y install ceph [ceph201][DEBUG ] Loaded plugins: fastestmirror, security [ceph201][DEBUG ] Determining fastest mirrors [ceph201][DEBUG ] Setting up Install Process [ceph201][DEBUG ] Resolving Dependencies [ceph201][DEBUG ] -- Running transaction check [ceph201][DEBUG ] --- Package ceph.x86_64 1:0.87.1-0.el6 will be installed [ceph201][DEBUG ] -- Finished Dependency Resolution [ceph201][DEBUG ] [ceph201][DEBUG ] Dependencies Resolved [ceph201][DEBUG ] [ceph201][DEBUG ] [ceph201][DEBUG ] Package Arch Version RepositorySize [ceph201][DEBUG ] [ceph201][DEBUG ] Installing: [ceph201][DEBUG ] ceph x86_64 1:0.87.1-0.el6 ceph-released 13 M [ceph201][DEBUG ] [ceph201][DEBUG ] Transaction Summary [ceph201][DEBUG ] [ceph201][DEBUG ] Install 1 Package(s) [ceph201][DEBUG ] [ceph201][DEBUG ] Total download size: 13 M [ceph201][DEBUG ] Installed size: 50 M [ceph201][DEBUG ] Downloading Packages: [ceph201][DEBUG ] Running rpm_check_debug [ceph201][DEBUG ] Running Transaction Test [ceph201][DEBUG ] Transaction Test Succeeded [ceph201][DEBUG ] Running Transaction Installing : 1:ceph-0.87.1-0.el6.x86_64 1/1 Verifying : 1:ceph-0.87.1-0.el6.x86_64 1/1 [ceph201][DEBUG ] [ceph201][DEBUG ] Installed: [ceph201][DEBUG ] ceph.x86_64 1:0.87.1-0.el6 [ceph201][DEBUG ] [ceph201][DEBUG ] Complete! [ceph201][INFO ] Running command: ceph --version [ceph201][ERROR ] Traceback (most recent call last): [ceph201][ERROR ] File /usr/lib/python2.6/site-packages/ceph_deploy/lib/vendor/remoto/process.py, line 87, in run [ceph201][ERROR ] reporting(conn, result, timeout) [ceph201][ERROR ] File /usr/lib/python2.6/site-packages/ceph_deploy/lib/vendor/remoto/log.py, line 13, in reporting [ceph201][ERROR ] received = result.receive(timeout) [ceph201][ERROR ] File /usr/lib/python2.6/site-packages/ceph_deploy/lib/vendor/remoto/lib/vendor/execnet/gateway_base.py, line 704, in receive [ceph201][ERROR ] raise self._getremoteerror() or EOFError() [ceph201][ERROR ] RemoteError: Traceback (most recent call last): [ceph201][ERROR ] File string, line 1036, in executetask [ceph201][ERROR ] File remote exec, line 11, in _remote_run [ceph201][ERROR ] File /usr/lib64/python2.6/subprocess.py, line 642, in __init__ [ceph201][ERROR ] errread, errwrite) [ceph201][ERROR ] File /usr/lib64/python2.6/subprocess.py, line 1234, in _execute_child [ceph201][ERROR ] raise child_exception [ceph201][ERROR ] OSError: [Errno 2] No such file or directory [ceph201][ERROR ] [ceph201][ERROR ] Michael Kuriger ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___
Re: [ceph-users] CephFS Attributes Question Marks
I bet it's that permission issue combined with a minor bug in FUSE on that kernel, or maybe in the ceph-fuse code (but I've not seen it reported before, so I kind of doubt it). If you run ceph-fuse with debug client = 20 it will output (a whole lot of) logging to the client's log file and you could see what requests are getting processed by the Ceph code and how it's responding. That might let you narrow things down. It's certainly not any kind of timeout. -Greg On Mon, Mar 2, 2015 at 3:57 PM, Scottix scot...@gmail.com wrote: 3 Ceph servers on Ubuntu 12.04.5 - kernel 3.13.0-29-generic We have an old server that we compiled the ceph-fuse client on Suse11.4 - kernel 2.6.37.6-0.11 This is the only mount we have right now. We don't have any problems reading the files and the directory shows full 775 permissions and doing a second ls fixes the problem. On Mon, Mar 2, 2015 at 3:51 PM Bill Sanders billysand...@gmail.com wrote: Forgive me if this is unhelpful, but could it be something to do with permissions of the directory and not Ceph at all? http://superuser.com/a/528467 Bill On Mon, Mar 2, 2015 at 3:47 PM, Gregory Farnum g...@gregs42.com wrote: On Mon, Mar 2, 2015 at 3:39 PM, Scottix scot...@gmail.com wrote: We have a file system running CephFS and for a while we had this issue when doing an ls -la we get question marks in the response. -rw-r--r-- 1 wwwrun root14761 Feb 9 16:06 data.2015-02-08_00-00-00.csv.bz2 -? ? ? ? ?? data.2015-02-09_00-00-00.csv.bz2 If we do another directory listing it show up fine. -rw-r--r-- 1 wwwrun root14761 Feb 9 16:06 data.2015-02-08_00-00-00.csv.bz2 -rw-r--r-- 1 wwwrun root13675 Feb 10 15:21 data.2015-02-09_00-00-00.csv.bz2 It hasn't been a problem but just wanted to see if this is an issue, could the attributes be timing out? We do have a lot of files in the filesystem so that could be a possible bottleneck. Huh, that's not something I've seen before. Are the systems you're doing this on the same? What distro and kernel version? Is it reliably one of them showing the question marks, or does it jump between systems? -Greg We are using the ceph-fuse mount. ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) We are planning to do the update soon to 87.1 Thanks Scottie ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Attributes Question Marks
On Mon, Mar 2, 2015 at 3:39 PM, Scottix scot...@gmail.com wrote: We have a file system running CephFS and for a while we had this issue when doing an ls -la we get question marks in the response. -rw-r--r-- 1 wwwrun root14761 Feb 9 16:06 data.2015-02-08_00-00-00.csv.bz2 -? ? ? ? ?? data.2015-02-09_00-00-00.csv.bz2 If we do another directory listing it show up fine. -rw-r--r-- 1 wwwrun root14761 Feb 9 16:06 data.2015-02-08_00-00-00.csv.bz2 -rw-r--r-- 1 wwwrun root13675 Feb 10 15:21 data.2015-02-09_00-00-00.csv.bz2 It hasn't been a problem but just wanted to see if this is an issue, could the attributes be timing out? We do have a lot of files in the filesystem so that could be a possible bottleneck. Huh, that's not something I've seen before. Are the systems you're doing this on the same? What distro and kernel version? Is it reliably one of them showing the question marks, or does it jump between systems? -Greg We are using the ceph-fuse mount. ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) We are planning to do the update soon to 87.1 Thanks Scottie ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Update 0.80.5 to 0.80.8 --the VM's read request become too slow
On Mon, Mar 2, 2015 at 7:15 PM, Nathan O'Sullivan nat...@mammoth.com.au wrote: On 11/02/2015 1:46 PM, 杨万元 wrote: Hello! We use Ceph+Openstack in our private cloud. Recently we upgrade our centos6.5 based cluster from Ceph Emperor to Ceph Firefly. At first,we use redhat yum repo epel to upgrade, this Ceph's version is 0.80.5. First upgrade monitor,then osd,last client. when we complete this upgrade, we boot a VM on the cluster,then use fio to test the io performance. The io performance is as better as before. Everything is ok! Then we upgrade the cluster from 0.80.5 to 0.80.8,when we completed , we reboot the VM to load the newest librbd. after that we also use fio to test the io performance.then we find the randwrite and write is as good as before.but the randread and read is become worse, randwrite's iops from 4000-5000 to 300-400 ,and the latency is worse. the write's bw from 400MB/s to 115MB/s. then I downgrade the ceph client version from 0.80.8 to 0.80.5, then the reslut become normal. So I think maybe something cause about librbd. I compare the 0.80.8 release notes with 0.80.5 (http://ceph.com/docs/master/release-notes/#v0-80-8-firefly ), I just find this change in 0.80.8 is something about read request : librbd: cap memory utilization for read requests (Jason Dillaman) . Who can explain this? FWIW we are seeing the same thing when switching librbd from 0.80.7 to 0.80.8 - there is a massive performance regression in random reads. In our case, from ~10,000 4k read iops down to less than 1,000. We also tested librbd 0.87.1 , and found it does not have this problem - it appears to be isolated to 0.80.8 only. I'm not familiar with the details of the issue, but we're putting out 0.80.9 as soon as we can and should resolve this. There was an incomplete backport or something that is causing the slowness. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem in cephfs for remove empty directory
On Tue, Mar 3, 2015 at 9:24 AM, John Spray john.sp...@redhat.com wrote: On 03/03/2015 14:07, Daniel Takatori Ohara wrote: $ls test-daniel-old/ total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ $rm -rf test-daniel-old/ rm: cannot remove ‘test-daniel-old/’: Directory not empty $ls test-daniel-old/ ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam: No such file or directory ls: cannot access test-daniel-old/M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam: No such file or directory total 0 drwx-- 1 rmagalhaes BioInfoHSL Users0 Mar 2 10:52 ./ drwx-- 1 rmagalhaes BioInfoHSL Users 773099838313 Mar 2 11:41 ../ l? ? ? ? ?? M_S8_L001_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L001_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L002_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L002_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L003_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L003_R1-2_001.fastq.gz_sylvio.sam_fixed.bam l? ? ? ? ?? M_S8_L004_R1-2_001.fastq.gz_ref.sam_fixed.bam l? ? ? ? ?? M_S8_L004_R1-2_001.fastq.gz_sylvio.sam_fixed.bam You don't say what version of the client (version of kernel, if it's the kernel client) this is. It would appear that the client thinks there are some dentries that don't really exist. You should enable verbose debug logs (with fuse client, debug client = 20) and reproduce this. It looks like you had similar issues (subject: problem for remove files in cephfs) a while back, when Yan Zheng also advised you to get some debug logs. In particular this is a known bug in older kernels and is fixed in new enough ones. Unfortunately I don't have the bug link handy though. :( -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Shutting down a cluster fully and powering it back up
Sounds good! -Greg On Sat, Feb 28, 2015 at 10:55 AM David da...@visions.se wrote: Hi! I’m about to do maintenance on a Ceph Cluster, where we need to shut it all down fully. We’re currently only using it for rados block devices to KVM Hypervizors. Are these steps sane? Shutting it down 1. Shut down all IO to the cluster. Means turning off all clients (KVM Hypervizors in our case). 2. Set cluster to noout by running: ceph osd set noout 3. Shut down the MON nodes. 4. Shut down the OSD nodes. Starting it up 1. Start the OSD nodes. 2. Start the MON nodes. 3. Check ceph -w to see the status of ceph and take actions if something is wrong. 4. Start up the clients (KVM Hypervizors) 5. Run ceph osd unset noout Kind Regards, David ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] old osds take much longer to start than newer osd
This is probably LevelDB being slow. The monitor has some options to compact the store on startup and I thought the osd handled it automatically, but you could try looking for something like that and see if it helps. -Greg On Fri, Feb 27, 2015 at 5:02 AM Corin Langosch corin.lango...@netskin.com wrote: Hi guys, I'm using ceph for a long time now, since bobtail. I always upgraded every few weeks/ months to the latest stable release. Of course I also removed some osds and added new ones. Now during the last few upgrades (I just upgraded from 80.6 to 80.8) I noticed that old osds take much longer to startup than equal newer osds (same amount of data/ disk usage, same kind of storage+journal backing device (ssd), same weight, same number of pgs, ...). I know I observed the same behavior earlier but just didn't really care about it. Here are the relevant log entries (host of osd.0 and osd.15 has less cpu power than the others): old osds (average pgs load time: 1.5 minutes) 2015-02-27 13:44:23.134086 7ffbfdcbe780 0 osd.0 19323 load_pgs 2015-02-27 13:49:21.453186 7ffbfdcbe780 0 osd.0 19323 load_pgs opened 824 pgs 2015-02-27 13:41:32.219503 7f197b0dd780 0 osd.3 19317 load_pgs 2015-02-27 13:42:56.310874 7f197b0dd780 0 osd.3 19317 load_pgs opened 776 pgs 2015-02-27 13:38:43.909464 7f450ac90780 0 osd.6 19309 load_pgs 2015-02-27 13:40:40.080390 7f450ac90780 0 osd.6 19309 load_pgs opened 806 pgs 2015-02-27 13:36:14.451275 7f3c41d33780 0 osd.9 19301 load_pgs 2015-02-27 13:37:22.446285 7f3c41d33780 0 osd.9 19301 load_pgs opened 795 pgs new osds (average pgs load time: 3 seconds) 2015-02-27 13:44:25.529743 7f2004617780 0 osd.15 19325 load_pgs 2015-02-27 13:44:36.197221 7f2004617780 0 osd.15 19325 load_pgs opened 873 pgs 2015-02-27 13:41:29.176647 7fb147fb3780 0 osd.16 19315 load_pgs 2015-02-27 13:41:31.681722 7fb147fb3780 0 osd.16 19315 load_pgs opened 848 pgs 2015-02-27 13:38:41.470761 7f9c404be780 0 osd.17 19307 load_pgs 2015-02-27 13:38:43.737473 7f9c404be780 0 osd.17 19307 load_pgs opened 821 pgs 2015-02-27 13:36:10.997766 7f7315e99780 0 osd.18 19299 load_pgs 2015-02-27 13:36:13.511898 7f7315e99780 0 osd.18 19299 load_pgs opened 815 pgs The old osds also take more memory, here's an example: root 15700 22.8 0.7 1423816 485552 ? Ssl 13:36 4:55 /usr/bin/ceph-osd -i 9 --pid-file /var/run/ceph/osd.9.pid -c /etc/ceph/ceph.conf --cluster ceph root 15270 15.4 0.4 1227140 297032 ? Ssl 13:36 3:20 /usr/bin/ceph-osd -i 18 --pid-file /var/run/ceph/osd.18.pid -c /etc/ceph/ceph.conf --cluster ceph It seems to me there is still some old data around for the old osds which was not properly migrated/ cleaned up during the upgrades. The cluster is healthy, no problems at all the last few weeks. Is there any way to clean this up? Thanks Corin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?
You can turn the filestore up to 20 instead of 1. ;) You might also explore what information you can get out of the admin socket. You are correct that those numbers are the OSD epochs, although note that when the system is running you'll get output both for the OSD as a whole and for individual PGs within it (which can be lagging behind). I'm still pretty convinced the OSDs are simply stuck trying to bring their PGs up to date and are thrashing the maps on disk, but we're well past what I can personally diagnose without log diving. -Greg On Sat, Feb 28, 2015 at 11:51 AM, Chris Murray chrismurra...@gmail.com wrote: After noticing that the number increases by 101 on each attempt to start osd.11, I figured I was only 7 iterations away from the output being within 101 of 63675. So, I killed the osd process, started it again, lather, rinse, repeat. I then did the same for other OSDs. Some created very small logs, and some created logs into the gigabytes. Grepping the latter for update_osd_stat showed me where the maps were up to, and therefore which OSDs needed some special attention. Some of the epoch numbers appeared to increase by themselves to a point and then plateaux, after which I'd kill then start the osd again, and this number would start to increase again. After all either showed 63675, or nothing at all, I turned debugging back off, deleted logs, and tried to bring the cluster back by unsetting noup, nobackfill, norecovery etc. It hasn't got very far before appearing stuck again, with nothing progressing in ceph status. It appears that 11/15 OSDs are now properly up, but four still aren't. A lot of placement groups are stale, so I guess I really need the remaining four to come up. The OSDs in question are 1, 7, 10 12. All have a line similar to this as the last in their log: 2015-02-28 10:35:04.240822 7f375ef40780 1 journal _open /var/lib/ceph/osd/ceph-1/journal fd 21: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 Even with the following in ceph.conf, I'm not seeing anything after that last line in the log. debug osd = 20 debug filestore = 1 CPU is still being consumed by the ceph-osd process though, but not much memory is being used compared to the other two OSDs which are up on that node. Is there perhaps even further logging that I can use to see why the logs aren't progressing past this point? Osd.1 is on /dev/sdb. iostat still shows some activity as the minutes go on, but not much: (60 second intervals) Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdb 5.45 0.00 807.33 0 48440 sdb 5.75 0.00 807.33 0 48440 sdb 5.43 0.00 807.20 0 48440 Thanks, Chris -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Chris Murray Sent: 27 February 2015 10:32 To: Gregory Farnum Cc: ceph-users Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy;will the cluster recover without help? A little further logging: 2015-02-27 10:27:15.745585 7fe8e3f2f700 20 osd.11 62839 update_osd_stat osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist []) 2015-02-27 10:27:15.745619 7fe8e3f2f700 5 osd.11 62839 heartbeat: osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist []) 2015-02-27 10:27:23.530913 7fe8e8536700 1 -- 192.168.12.25:6800/673078 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0}) v2 -- ?+0 0xe5f26380 con 0xe1f0cc60 2015-02-27 10:27:30.645902 7fe8e3f2f700 20 osd.11 62839 update_osd_stat osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist []) 2015-02-27 10:27:30.645938 7fe8e3f2f700 5 osd.11 62839 heartbeat: osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist []) 2015-02-27 10:27:33.531142 7fe8e8536700 1 -- 192.168.12.25:6800/673078 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0}) v2 -- ?+0 0xe5f26540 con 0xe1f0cc60 2015-02-27 10:27:43.531333 7fe8e8536700 1 -- 192.168.12.25:6800/673078 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0}) v2 -- ?+0 0xe5f26700 con 0xe1f0cc60 2015-02-27 10:27:45.546275 7fe8e3f2f700 20 osd.11 62839 update_osd_stat osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist []) 2015-02-27 10:27:45.546311 7fe8e3f2f700 5 osd.11 62839 heartbeat: osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist []) 2015-02-27 10:27:53.531564 7fe8e8536700 1 -- 192.168.12.25:6800/673078 -- 192.168.12.25:6789/0 -- mon_subscribe({monmap=6+,osd_pg_creates=0}) v2 -- ?+0 0xe5f268c0 con 0xe1f0cc60 2015-02-27 10:27:56.846593 7fe8e3f2f700 20 osd.11 62839 update_osd_stat osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op hist []) 2015-02-27 10:27:56.846627 7fe8e3f2f700 5 osd.11 62839
Re: [ceph-users] Some long running ops may lock osd
On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu erdem.agao...@gmail.com wrote: Hi all, especially devs, We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys. In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds. time rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null real 0m2.983s user 0m0.760s sys 0m0.148s In order to lock the osd we request 2 of them simultaneously with something like: rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null sleep 1 rados -p .rgw.buckets listomapkeys .dir.5926.3 /dev/null 'debug_osd=30' logs show the flow like: At t0 some thread enqueue_op's my omap-get-keys request. Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys. Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs. At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A. After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider the osd is locked. At t2 (about another second later) my first omap-get-keys request is finished. Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts reading ~500k keys again. Op-Thread A continues to process the requests enqueued in t1-t2. It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can process other requests for other pg's just fine. My guess is a somewhat larger scenario happens in deep-scrubbing, like on the pg containing index for the bucket of 20M objects. A disk/op thread starts reading through the omap which will take say 60 seconds. During the first seconds, other requests for other pgs pass just fine. But in 60 seconds there are bound to be other requests for the same pg, especially since it holds the index file. Each of these requests lock another disk/op thread to the point where there are no free threads left to process any requests for any pg. Causing slow-requests. So first of all thanks if you can make it here, and sorry for the involved mail, i'm exploring the problem as i go. Now, is that deep-scrubbing situation i tried to theorize even possible? If not can you point us where to look further. We are currently running 0.72.2 and know about newer ioprio settings in Firefly and such. While we are planning to upgrade in a few weeks but i don't think those options will help us in any way. Am i correct? Are there any other improvements that we are not aware? This is all basically correct; it's one of the reasons you don't want to let individual buckets get too large. That said, I'm a little confused about why you're running listomapkeys that way. RGW throttles itself by getting only a certain number of entries at a time (1000?) and any system you're also building should do the same. That would reduce the frequency of any issues, and I *think* that scrubbing has some mitigating factors to help (although maybe not; it's been a while since I looked at any of that stuff). Although I just realized that my vague memory of deep scrubbing working better might be based on improvements that only got in for firefly...not sure. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW Log Rotation (firefly)
On Mon, Mar 2, 2015 at 8:44 AM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: On our Ubuntu 14.04/Firefly 0.80.8 cluster we are seeing problem with log file rotation for the rados gateway. The /etc/logrotate.d/radosgw script gets called, but it does not work correctly. It spits out this message, coming from the postrotate portion: /etc/cron.daily/logrotate: reload: Unknown parameter: id invoke-rc.d: initscript radosgw, action reload failed. A new log file actually gets created, but due to the failure in the post-rotate script, the daemon actually continues writing into the now deleted previous file: [B|root@node01] /etc/init ➜ ps aux | grep radosgw root 13077 0.9 0.1 13710396 203256 ? Ssl Feb14 212:27 /usr/bin/radosgw -n client.radosgw.node01 [B|root@node01] /etc/init ➜ ls -l /proc/13077/fd/ total 0 lr-x-- 1 root root 64 Mar 2 15:53 0 - /dev/null lr-x-- 1 root root 64 Mar 2 15:53 1 - /dev/null lr-x-- 1 root root 64 Mar 2 15:53 2 - /dev/null l-wx-- 1 root root 64 Mar 2 15:53 3 - /var/log/radosgw/radosgw.log.1 (deleted) ... Trying manually with service radosgw reload fails with the same message. Running the non-upstart /etc/init.d/radosgw reload works. It will, kind of crudely, just send a SIGHUP to any running radosgw process. To figure out the cause I compared OSDs and RadosGW wrt to upstart and got this: [B|root@node01] /etc/init ➜ initctl list | grep osd ceph-osd-all start/running ceph-osd-all-starter stop/waiting ceph-osd (ceph/8) start/running, process 12473 ceph-osd (ceph/9) start/running, process 12503 ... [B|root@node01] /etc/init ➜ initctl reload radosgw cluster=ceph id=radosgw.node01 initctl: Unknown instance: ceph/radosgw.node01 [B|root@node01] /etc/init ➜ initctl list | grep rados radosgw-instance stop/waiting radosgw stop/waiting radosgw-all-starter stop/waiting radosgw-all start/running Apart from me not being totally clear about what the difference between radosgw-instance and radosgw is, obviously Upstart has no idea about which PID to send the SIGHUP to when I ask it to reload. I can, of course, replace the logrotate config and use the /etc/init.d/radosgw reload approach, but I would like to understand if this is something unique to our system, or if this is a bug in the scripts. FWIW here's an excerpt from /etc/ceph.conf: [client.radosgw.node01] host = node01 rgw print continue = false keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/radosgw/radosgw.log rgw enable ops log = false rgw gc max objs = 31 I'm not very (well, at all, for rgw) familiar with these scripts, but how are you starting up your RGW daemon? There's some way to have Apache handle the process instead of Upstart, but Yehuda says you don't want to do it. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What does the parameter journal_align_min_size mean?
On Fri, Feb 27, 2015 at 5:03 AM, Mark Wu wud...@gmail.com wrote: I am wondering how the value of journal_align_min_size gives impact on journal padding. Is there any document describing the disk layout of journal? Not much, unfortunately. Just looking at the code, the journal will align any writes which are at least as large as that parameter, apparently based on the page size and the target offset within the destination object. I think this is so that it's more conveniently aligned for transfer into the filesystem later on, whereas smaller writes can just get copied? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Cluster Address
On Tue, Mar 3, 2015 at 9:26 AM, Garg, Pankaj pankaj.g...@caviumnetworks.com wrote: Hi, I have ceph cluster that is contained within a rack (1 Monitor and 5 OSD nodes). I kept the same public and private address for configuration. I do have 2 NICS and 2 valid IP addresses (one internal only and one external) for each machine. Is it possible now, to change the Public Network address, after the cluster is up and running? I had used Ceph-deploy for the cluster. If I change the address of the public network in Ceph.conf, do I need to propagate to all the machines in the cluster or just the Monitor Node is enough? You'll need to change the config on each node and then restart it so that the OSDs will bind to the new location. The OSDs will let you do this on a rolling basis, but the networks will need to be routable to each other. Note that changing the addresses on the monitors (I can't tell if you want to do that) is much more difficult; it's probably easiest to remove one at a time from the cluster and then recreate it with its new IP. (There are docs on how to do this.) -Greg Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?
Just to get more specific: the reason you can apparently write stuff to a file when you can't write to the pool it's stored in is because the file data is initially stored in cache. The flush out to RADOS, when it happens, will fail. It would definitely be preferable if there was some way to immediately return a permission or IO error in this case, but so far we haven't found one; the relevant interfaces just aren't present and it's unclear how to propagate the data back to users in a way that makes sense even if they were. :/ -Greg On Wed, Mar 4, 2015 at 3:37 AM, SCHAER Frederic frederic.sch...@cea.fr wrote: Hi, Many thanks for the explanations. I haven't used the nodcache option when mounting cephfs, it actually got there by default My mount command is/was : # mount -t ceph 1.2.3.4:6789:/ /mnt -o name=puppet,secretfile=./puppet.secret I don't know what causes this option to be default, maybe it's the kernel module I compiled from git (because there is no kmod-ceph or kmod-rbd in any RHEL-like distributions except RHEV), I'll try to update/check ... Concerning the rados pool ls, indeed : I created empty files in the pool, and they were not showing up probably because they were just empty - but when I create a non empty file, I see things in rados ls... Thanks again Frederic -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de John Spray Envoyé : mardi 3 mars 2015 17:15 À : ceph-users@lists.ceph.com Objet : Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ? On 03/03/2015 15:21, SCHAER Frederic wrote: By the way : looks like the ceph fs ls command is inconsistent when the cephfs is mounted (I used a locally compiled kmod-ceph rpm): [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ] (umount /mnt .) [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet root ] This is probably #10288, which was fixed in 0.87.1 So, I have this pool named root that I added in the cephfs filesystem. I then edited the filesystem xattrs : [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root getfattr: Removing leading '/' from absolute path names # file: mnt/root ceph.dir.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root I'm therefore assuming client.puppet should not be allowed to write or read anything in /mnt/root, which belongs to the root pool. but that is not the case. On another machine where I mounted cephfs using the client.puppet key, I can do this : The mount was done with the client.puppet key, not the admin one that is not deployed on that node : 1.2.3.4:6789:/ on /mnt type ceph (rw,relatime,name=puppet,secret=hidden,nodcache) [root@dev7248 ~]# echo not allowed /mnt/root/secret.notfailed [root@dev7248 ~]# [root@dev7248 ~]# cat /mnt/root/secret.notfailed not allowed This is data you're seeing from the page cache, it hasn't been written to RADOS. You have used the nodcache setting, but that doesn't mean what you think it does (it was about caching dentries, not data). It's actually not even used in recent kernels (http://tracker.ceph.com/issues/11009). You could try the nofsc option, but I don't know exactly how much caching that turns off -- the safer approach here is probably to do your testing using I/Os that have O_DIRECT set. And I can even see the xattrs inherited from the parent dir : [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root Whereas on the node where I mounted cephfs as ceph admin, I get nothing : [root@ceph0 ~]# cat /mnt/root/secret.notfailed [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed -rw-r--r-- 1 root root 12 Mar 3 15:27 /mnt/root/secret.notfailed After some time, the file also gets empty on the puppet client host : [root@dev7248 ~]# cat /mnt/root/secret.notfailed [root@dev7248 ~]# (but the metadata remained ?) Right -- eventually the cache goes away, and you see the true (empty) state of the file. Also, as an unpriviledged user, I can get ownership of a secret file by changing the extended attribute : [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet /mnt/root/secret.notfailed [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=puppet Well, you're not really getting ownership of anything here: you're modifying the file's metadata, which you are entitled to do (pool permissions have nothing to do with file metadata). There was a recent bug where a file's pool layout could
Re: [ceph-users] Does Ceph rebalance OSDs proportionally
Yes. :) -Greg On Wed, Feb 25, 2015 at 8:33 AM Jordan A Eliseo jaeli...@us.ibm.com wrote: Hi all, Quick qestion, does the Crush map always strive for proportionality when rebalancing a cluster? i.e. Say I have 8 OSDs (with a two node cluster - 4 OSDs per host - at ~90% utilization (which I know is bad, this is just hypothetical). Now if I add a total of 8 OSDs - 4 new OSDs for each host - will the crush map try to rebalance such that all disks have a utilization of 40-50%? Assumption being all disks are of equal size and weight. Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Strange 'ceph df' output
IIRC these global values for total size and available are just summations from the (programmatic equivalent) of running df on each machine locally, but the used values are based on actual space used by each PG. That has occasionally produced some odd results depending on how you've configured your system and how that translates into df output. (Eg you might be using up space for journals or your OS that aren't considered as used for the purposes of RADOS' df.) -Greg On Wed, Feb 25, 2015 at 6:57 AM Kamil Kuramshin kamil.kurams...@tatar.ru wrote: Cant find out why this can happen: Got an HEALTH_OK cluster. ceph version 0.87, all nodes are Debian Wheezy with a stable kernel 3.2.65-1+deb7u1. ceph df shows me this: *$ ceph df* GLOBAL: SIZE AVAIL RAW USED %RAW USED *242T 221T8519G 3.43 * POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 2 1948G 0.7974902G 498856 ec_backup-storage 4 0 0 146T 0 cache 5 0 0 184G 0 block-devices 6 827G 0.3374902G 211744 Explanation: Total space = Used space + Available space: *242T ** 8,5T + **221T*, but MUST be equal is not it? Where I have lost aproxymately 12,5 Tb of space? *$ ceph -s* cluster 0745bec9-a7a7-4ee1-be5d-bb12db3cdd8f health HEALTH_OK monmap e1: 3 mons at {node04= 10.0.0.14:6789/0,node05=10.0.0.15:6789/0,node06=10.0.0.16:6789/0}, election epoch 48, quorum 0,1,2 node04,node05,node06 osdmap e16866: 102 osds: 102 up, 102 in pgmap v570489: 10200 pgs, 4 pools, 2775 GB data, 693 kobjects * 8518 GB used, 221 TB / 242 TB avail* 10200 active+clean ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Wrong object and used space count in cache tier pool
On Tue, Feb 24, 2015 at 6:21 AM, Xavier Villaneau xavier.villan...@fr.clara.net wrote: Hello ceph-users, I am currently making tests on a small cluster, and Cache Tiering is one of those tests. The cluster runs Ceph 0.87 Giant on three Ubuntu 14.04 servers with the 3.16.0 kernel, for a total of 8 OSD and 1 MON. Since there are no SSDs in those servers, I am testing Cache Tiering by using an erasure-coded pool as storage and a replicated pool as cache. The cache settings are the defaults ones you'll find in the documentation, and I'm using writeback mode. Also, to simulate the small size of cache data, the hot storage pool has a 1024MB space quota. Then I write 4MB chunks of data to the storage pool using 'rados bench' (with --no-cleanup). Here are my cache pool settings according to InkScope : pool15 pool name test1_ct-cache auid0 type1 (replicated) size2 min size1 crush ruleset 0 (replicated_ruleset) pg num 512 pg placement_num512 quota max_bytes 1 GB quota max_objects 0 flags names hashpspool,incomplete_clones tiers none tier of 14 (test1_ec-data) read tier -1 write tier -1 cache mode writeback cache target_dirty_ratio_micro 40 % cache target_full_ratio_micro 80 % cache min_flush_age 0 s cache min_evict_age 0 s target max_objects 0 target max_bytes960 MB hit set_count 1 hit set_period 3600 s hit set_params target_size :0 seed : 0 type : bloom false_positive_probability : 0.05 I believe the tiering itself works well, I do see objects and bytes being transfered from the cache to the storage when I write data. I checked with 'rados ls', and the object count in the cold storage is always right on spot. But it isn't in the cache, when I do 'ceph df' or 'rados df' the space and object counts do not match with 'rados ls', and are usually much larger : % ceph df … POOLS: NAME ID USED %USED MAX AVAIL OBJECTS … test1_ec-data 14 5576M 0.045G 1394 test1_ct-cache 15 772M 0 7410G 250 % rados -p test1_ec-data ls | wc -l 1394 % rados -p test1_ct-cache ls | wc -l 56 # And this corresponds to 220M of data in test1_ct-cache Not only it prevents me from knowing exactly what the cache is doing, but it is also this value that is applied for the quota. And I've seen writing operations fail because the space count had reached 1G, although I was quite sure there was enough space. The count does not correct itself over time, even by waiting overnight. The count only changes when I poke the pool by changing a setting or writing data, but remains wrong (and not by the same number of objects). The changes in object counts given by 'rados ls' in both pools match with the number of objects written by 'rados bench'. Does anybody know where this mismatch might come from ? Is there a way to see more details about what's going on ? Or is it the normal behavior of a cache pool when 'rados bench' is used ? Well, I don't think the quota stuff is going to interact well with caching pools; the size limits are implemented at different places in the cache. Similarly, rados ls definitely doesn't work properly on cache pools; you shouldn't expect anything sensible to come out of it. Among other things, there are whiteout objects in the cache pool (recording that an object is known not to exist in the base pool) that won't be listed in rados ls, and I'm sure there's other stuff too. If you're trying to limit the cache pool size you want to do that with the target size and dirty targets/limits. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS [WRN] getattr pAsLsXsFs failed to rdlock
For everybody else's reference, this is addressed in http://tracker.ceph.com/issues/10944. That kernel has several known bugs. -Greg On Tue, Feb 24, 2015 at 12:02 PM, Ilja Slepnev islep...@gmail.com wrote: Dear All, Configuration of MDS and CephFS client is the same: OS: CentOS 7.0.1406 ceph-0.87 Linux 3.10.0-123.20.1.el7.centos.plus.x86_64 dmesg: libceph: loaded (mon/osd proto 15/24) dmesg: ceph: loaded (mds proto 32) Using kernel ceph module, fstab mount options: defaults,_netdev,ro,noatime,name=admin,secret=hidden CephFS mount is exported by NFS. Problem: After period of light activity (reading files, listing dirs) one of the cephfs paths got stuck in directory listing process, on local machine and via NFS. Log messages on MDS (repeating): 2015-02-24 16:02:41.564071 7fdb0055c700 0 log_channel(default) log [WRN] : 9 slow requests, 1 included below; oldest blocked for 14463.448519 secs 2015-02-24 16:02:41.564077 7fdb0055c700 0 log_channel(default) log [WRN] : slow request 1922.318256 seconds old, received at 2015-02-24 15:30:39.245786: client_request(client.66401597:2440 getattr pAsLsXsFs #1002d68) currently failed to rdlock, waiting Could it be a broken metadata, or a bug? How to find out what is going wrong? Is there a workaround? WBR, Ilja Slepnev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Minor version difference between monitors and OSDs
On Thu, Feb 19, 2015 at 8:30 PM, Christian Balzer ch...@gol.com wrote: Hello, I have a cluster currently at 0.80.1 and would like to upgrade it to 0.80.7 (Debian as you can guess), but for a number of reasons I can't really do it all at the same time. In particular I would like to upgrade the primary monitor node first and the secondary ones as well as the OSDs later. Now my understanding and hope is that unless I change the config to add features that aren't present in 0.80.1, things should work just fine, especially given the main release note blurb about 0.80.7: I don't think we test upgrades between that particular combination of versions, but as a matter of policy there shouldn't be any issues between point releases. The release note is referring to the issue described at http://tracker.ceph.com/issues/9419, which is indeed for pre-Firefly to Firefly upgrades. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD not marked as down or out
That's pretty strange, especially since the monitor is getting the failure reports. What version are you running? Can you bump up the monitor debugging and provide its output from around that time? -Greg On Fri, Feb 20, 2015 at 3:26 AM, Sudarshan Pathak sushan@gmail.com wrote: Hello everyone, I have a cluster running with OpenStack. It has 6 OSD (3 in each 2 different locations). Each pool has 3 replication size with 2 copy in primary location and 1 copy at secondary location. Everything is running as expected but the osd are not marked as down when I poweroff a OSD server. It has been around an hour. I tried changing the heartbeat settings too. Can someone point me in right direction. OSD 0 log = 2015-02-20 16:20:14.009723 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:54.009720) 2015-02-20 16:20:15.009908 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:55.009907) 2015-02-20 16:20:16.010123 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:56.010119) 2015-02-20 16:20:16.648167 7f3fc9a76700 -1 osd.0 451 heartbeat_check: no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:56.648165) Ceph monitor log 2015-02-20 16:49:16.831548 7f416e4aa700 1 mon.storage1@1(leader).osd e455 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.4 192.168.100.35:6800/1305 is reporting failure:1 2015-02-20 16:49:16.831593 7f416e4aa700 0 log_channel(cluster) log [DBG] : osd.2 192.168.100.33:6800/24431 reported failed by osd.4 192.168.100.35:6800/1305 2015-02-20 16:49:17.080314 7f416e4aa700 1 mon.storage1@1(leader).osd e455 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.3 192.168.100.34:6800/1358 is reporting failure:1 2015-02-20 16:49:17.080527 7f416e4aa700 0 log_channel(cluster) log [DBG] : osd.2 192.168.100.33:6800/24431 reported failed by osd.3 192.168.100.34:6800/1358 2015-02-20 16:49:17.420859 7f416e4aa700 1 mon.storage1@1(leader).osd e455 prepare_failure osd.2 192.168.100.33:6800/24431 from osd.5 192.168.100.36:6800/1359 is reporting failure:1 #ceph osd stat osdmap e455: 6 osds: 6 up, 6 in #ceph -s cluster c8a5975f-4c86-4cfe-a91b-fac9f3126afc health HEALTH_WARN 528 pgs peering; 528 pgs stuck inactive; 528 pgs stuck unclean; 1 requests are blocked 32 sec; 1 mons down, quorum 1,2,3,4 storage1,storage2,compute3,compute4 monmap e1: 5 mons at {admin=192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0}, election epoch 132, quorum 1,2,3,4 storage1,storage2,compute3,compute4 osdmap e455: 6 osds: 6 up, 6 in pgmap v48474: 3650 pgs, 19 pools, 27324 MB data, 4420 objects 82443 MB used, 2682 GB / 2763 GB avail 3122 active+clean 528 remapped+peering Ceph.conf file [global] fsid = c8a5975f-4c86-4cfe-a91b-fac9f3126afc mon_initial_members = admin, storage1, storage2, compute3, compute4 mon_host = 192.168.100.39,192.168.100.120,192.168.100.121,192.168.100.133,192.168.100.134 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default size = 3 osd pool default min size = 3 osd pool default pg num = 300 osd pool default pgp num = 300 public network = 192.168.100.0/24 rgw print continue = false rgw enable ops log = false mon osd report timeout = 60 mon osd down out interval = 30 mon osd min down reports = 2 osd heartbeat grace = 10 osd mon heartbeat interval = 20 osd mon report interval max = 60 osd mon ack timeout = 15 mon osd min down reports = 2 Regards, Sudarshan Pathak ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power failure recovery woes (fwd)
You can try searching the archives and tracker.ceph.com for hints about repairing these issues, but your disk stores have definitely been corrupted and it's likely to be an adventure. I'd recommend examining your local storage stack underneath Ceph and figuring out which part was ignoring barriers. -Greg On Fri, Feb 20, 2015 at 10:39 AM, Jeff j...@usedmoviefinder.com wrote: Should I infer from the silence that there is no way to recover from the FAILED assert(last_e.version.version e.version.version) errors? Thanks, Jeff - Forwarded message from Jeff j...@usedmoviefinder.com - Date: Tue, 17 Feb 2015 09:16:33 -0500 From: Jeff j...@usedmoviefinder.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Power failure recovery woes Some additional information/questions: Here is the output of ceph osd tree Some of the down OSD's are actually running, but are down. For example osd.1: root 30158 8.6 12.7 1542860 781288 ? Ssl 07:47 4:40 /usr/bin/ceph-osd --cluster=ceph -i 0 -f Is there any way to get the cluster to recognize them as being up? osd-1 has the FAILED assert(last_e.version.version e.version.version) errors. Thanks, Jeff # idweight type name up/down reweight -1 10.22 root default -2 2.72host ceph1 0 0.91osd.0 up 1 1 0.91osd.1 down0 2 0.9 osd.2 down0 -3 1.82host ceph2 3 0.91osd.3 down0 4 0.91osd.4 down0 -4 2.04host ceph3 5 0.68osd.5 up 1 6 0.68osd.6 up 1 7 0.68osd.7 up 1 8 0.68osd.8 down0 -5 1.82host ceph4 9 0.91osd.9 up 1 10 0.91osd.10 down0 -6 1.82host ceph5 11 0.91osd.11 up 1 12 0.91osd.12 up 1 On 2/17/2015 8:28 AM, Jeff wrote: Original Message Subject: Re: [ceph-users] Power failure recovery woes Date: 2015-02-17 04:23 From: Udo Lembke ulem...@polarzone.de To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com Hi Jeff, is the osd /var/lib/ceph/osd/ceph-2 mounted? If not, does it helps, if you mounted the osd and start with service ceph start osd.2 ?? Udo Am 17.02.2015 09:54, schrieb Jeff: Hi, We had a nasty power failure yesterday and even with UPS's our small (5 node, 12 OSD) cluster is having problems recovering. We are running ceph 0.87 3 of our OSD's are down consistently (others stop and are restartable, but our cluster is so slow that almost everything we do times out). We are seeing errors like this on the OSD's that never run: ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) Operation not permitted We are seeing errors like these of the OSD's that run some of the time: osd/PGLog.cc: 844: FAILED assert(last_e.version.version e.version.version) common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) Does anyone have any suggestions on how to recover our cluster? Thanks! Jeff ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com - End forwarded message - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running giant/hammer mds with firefly osds
On Fri, Feb 20, 2015 at 3:50 AM, Luis Periquito periqu...@gmail.com wrote: Hi Dan, I remember http://tracker.ceph.com/issues/9945 introducing some issues with running cephfs between different versions of giant/firefly. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14257.html Hmm, yeah, that's been fixed for a while but is still waiting to go out in the next point release. :( Beyond this bug, although the MDS doesn't have any new OSD dependencies that could break things, we don't test cross-version stuff like that at all except during upgrades. Some minimal testing on your side should be enough to make sure it works, but if I were you I'd try it on a test cluster first — the MDS is reporting a lot more to the monitors in Giant and Hammer than it did in Firefly, and everything should be good but there might be issues lurking in the compatibility checks there. -Greg So if you upgrade please be aware that you'll also have to update the clients. On Fri, Feb 20, 2015 at 10:33 AM, Dan van der Ster d...@vanderster.com wrote: Hi all, Back in the dumpling days, we were able to run the emperor MDS with dumpling OSDs -- this was an improvement over the dumpling MDS. Now we have stable firefly OSDs, but I was wondering if we can reap some of the recent CephFS developments by running a giant or ~hammer MDS with our firefly OSDs. Did anyone try that yet? Best Regards, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mixed ceph versions
On Wed, Feb 25, 2015 at 3:11 PM, Deneau, Tom tom.den...@amd.com wrote: I need to set up a cluster where the rados client (for running rados bench) may be on a different architecture and hence running a different ceph version from the osd/mon nodes. Is there a list of which ceph versions work together for a situation like this? The RADOS protocol is architecture-independent, and while we don't test across a huge version divergence (mostly between LTS releases) the client should also be compatible with pretty much anything you have server-side. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?
On Mon, Feb 23, 2015 at 8:59 AM, Chris Murray chrismurra...@gmail.com wrote: ... Trying to send again after reporting bounce backs to dreamhost ... ... Trying to send one more time after seeing mails come through the list today ... Hi all, First off, I should point out that this is a 'small cluster' issue and may well be due to the stretched resources. If I'm doomed to destroying and starting again, fair be it, but I'm interested to see if things can get up and running again. My experimental ceph cluster now has 5 nodes with 3 osds each. Some drives are big, some drives are small. Most are formatted with BTRFS and two are still formatted with XFS, which I intend to remove and recreate with BTRFS at some point. I gather BTRFS isn't entirely stable yet, but compression suits my use-case, so I'm prepared to stick with it while it matures. I had to set the following, to avoid osds dying as the IO was consumed by the snapshot creation and deletion process (as I understand it): filestore btrfs snap = false and the mount options look like this: osd mount options btrfs = rw,noatime,space_cache,user_subvol_rm_allowed,compress-force=lzo Each node is a HP Microserver n36l or n54l, with 8GB of memory, so CPU horsepower is lacking somewhat. Ceph is version 0.80.8, and each node is also a mon. My issue is: After adding the 15th osd, the cluster went into a spiral of destruction, with osds going down one after another. One might go down on occasion, and usually a start of the osd in question will remedy things. This time, though, it hasn't, and the problem appears to have become worse and worse. I've tried starting osds, restarting whole hosts, to no avail. I've brought all osds back 'in' and set noup, nodown and noout. I've ceased rbd activity since it was getting blocked anyway. The cluster appears to now be 'stuck' in this state: cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a health HEALTH_WARN 1 pgs backfill; 45 pgs backfill_toofull; 1969 pgs degraded; 1226 pgs down; 2 pgs incomplete; 1333 pgs peering; 1445 pgs stale; 1336 pgs stuck inactive; 1445 pgs stuck stale; 4198 pgs stuck unclean; recovery 838948/2578420 objects degraded (32.537%); 2 near full osd(s); 8/15 in osds are down; noup,nodown,noout flag(s) set monmap e5: 5 mons at {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0,3= 192.168.12.28:6789/0,4=192.168.12.29:6789/0}, election epoch 2618, quorum 0,1,2,3,4 0,1,2,3,4 osdmap e63276: 15 osds: 7 up, 15 in flags noup,nodown,noout pgmap v3371280: 4288 pgs, 5 pools, 3322 GB data, 835 kobjects 4611 GB used, 871 GB / 5563 GB avail 838948/2578420 objects degraded (32.537%) 3 down+remapped+peering 8 stale+active+degraded+remapped 85 active+clean 1 stale+incomplete 1088 stale+down+peering 642 active+degraded+remapped 1 incomplete 33 stale+remapped+peering 135 down+peering 1 stale+degraded 1 stale+active+degraded+remapped+wait_backfill+backfill_toofull 854 active+remapped 234 stale+active+degraded 4 active+degraded+remapped+backfill_toofull 40 active+remapped+backfill_toofull 1079 active+degraded 5 stale+active+clean 74 stale+peering Take one of the nodes. It holds osds 12 (down in), 13 (up in) and 14 (down in). # ceph osd stat osdmap e63276: 15 osds: 7 up, 15 in flags noup,nodown,noout # ceph daemon osd.12 status no valid command found; 10 closest matches: config show help log dump get_command_descriptions git_version config set var val [val...] version 2 config get var 0 admin_socket: invalid command # ceph daemon osd.13 status { cluster_fsid: e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a, osd_fsid: d7794b10-2366-4c4f-bb4d-5f11098429b6, whoami: 13, state: active, oldest_map: 48214, newest_map: 63276, num_pgs: 790} # ceph daemon osd.14 status admin_socket: exception getting command descriptions: [Errno 111] Connection refused I'm assuming osds 12 and 14 are acting that way because they're not up, but why are they different? Well, you below indicate that osd.14's log says it crashed on an internal heartbeat timeout (usually, it got stuck waiting for disk IO or the kernel/btrfs hung), so that's why. The osd.12 process exists but isn't up; osd.14 doesn't even have a socket to connect to. In terms of logs, ceph-osd.12.log doesn't go beyond this: 2015-02-22 10:38:29.629407 7fd24952c780 0 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 3813 2015-02-22 10:38:29.639802 7fd24952c780 0 filestore(/var/lib/ceph/osd/ceph-12) mount detected btrfs
Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?
comment on what might be causing this error for this osd? Many years ago, when ZFS was in its infancy, I had a dedup disaster which I thought would never end, but that just needed to do its thing before the pool came back to life. Could this be a similar scenario perhaps? Is the activity leading up to something, and BTRFS is slowly doing what Ceph is asking of it, or is it just going round and round in circles and I just can't see? :-) -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Chris Murray Sent: 25 February 2015 12:58 To: Gregory Farnum Cc: ceph-users Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy;will the cluster recover without help? Thanks Greg After seeing some recommendations I found in another thread, my impatience got the better of me, and I've start the process again, but there is some logic, I promise :-) I've copied the process from Michael Kidd, I believe, and it goes along the lines of: setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill stopping all OSDs setting all OSDs down out setting various options in ceph.conf to limit backfill activity etc starting all OSDs wait until all CPU settles to 0% -- I am here unset the noup flag wait until all CPU settles to 0% unset the noin flag wait until all CPU settles to 0% unset the nobackfill flag wait until all CPU settles to 0% unset the norecover flag remove options from ceph.conf unset the noscrub flag unset the nodeep-scrub flag Currently, host CPU usage is approx the following, so something's changed, and I'm tempted to leave things a little longer before my next step, just in case CPU does eventually stop spinning. I read reports of things taking a while even with modern Xeons, so I suppose it's not outside the realms of possibility that an AMD Neo might take days to work things out. We're up to 24.5 hours now: 192.168.12.25 20% 192.168.12.26 1% 192.168.12.27 15% 192.168.12.28 1% 192.168.12.29 12% Interesting, as 192.168.12.26 and .28 are the two which stopped spinning before I restarted this process too. The number of different states is slightly less confusing now, but not by much: :-) 788386/2591752 objects degraded (30.419%) 90 stale+active+clean 2 stale+down+remapped+peering 2 stale+incomplete 1 stale+active+degraded+remapped+wait_backfill+backfill_toofull 1 stale+degraded 1255 stale+active+degraded 32 stale+remapped+peering 773 stale+active+remapped 4 stale+active+degraded+remapped+backfill_toofull 1254 stale+down+peering 278 stale+peering 33 stale+active+remapped+backfill_toofull 563 stale+active+degraded+remapped Well, you below indicate that osd.14's log says it crashed on an internal heartbeat timeout (usually, it got stuck waiting for disk IO or the kernel/btrfs hung), so that's why. The osd.12 process exists but isn't up; osd.14 doesn't even have a socket to connect to. Ah, that does make sense, thank you. That's not what I'd expect to see (it appears to have timed out and not be recognizing it?) but I don't look at these things too often so maybe that's the normal indication that heartbeats are failing. I'm not sure what this means either. A google for heartbeat_map is_healthy FileStore::op_tp thread had timed out after doesn't return much, but I did see this quote from Sage on what looks like a similar matter: - the filestore op_queue is blocked on the throttler (too much io queued) - the commit thread is also waiting for ops to finish - i see no actual thread processing the op_queue Usually that's because it hit a kernel bug and got killed. Not sure what else would make that thread disappear... sage Oh! Also, you want to find out why they're dying. That's probably the root cause of your issues I have a sneaking suspicion it's BTRFS, but don't have the evidence or perhaps the knowledge to prove it. If XFS did compression, I'd go with that, but at the moment I need to rely on compression to solve the problem of reclaiming space *within* files which reside on ceph. As far as I remember, overwriting with zeros didn't re-do the thin provisioning on XFS, if that makes sense. Thanks again, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph -s slow return result
Are all your monitors running? Usually a temporary hang means that the Ceph client tries to reach a monitor that isn't up, then times out and contacts a different one. I have also seen it just be slow if the monitors are processing so many updates that they're behind, but that's usually on a very unhappy cluster. -Greg On Fri, Mar 27, 2015 at 8:50 AM Chu Duc Minh chu.ducm...@gmail.com wrote: On my CEPH cluster, ceph -s return result quite slow. Sometimes it return result immediately, sometimes i hang few seconds before return result. Do you think this problem (ceph -s slow return) only relate to ceph-mon(s) process? or maybe it relate to ceph-osd(s) too? (i deleting a big bucket, .rgw.buckets, and ceph-osd(s) disk util quite high) Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Slow writes with 1MB files
So this is exactly the same test you ran previously, but now it's on faster hardware and the test is slower? Do you have more data in the test cluster? One obvious possibility is that previously you were working entirely in the MDS' cache, but now you've got more dentries and so it's kicking data out to RADOS and then reading it back in. If you've got the memory (you appear to) you can pump up the mds cache size config option quite dramatically from it's default 10. Other things to check are that you've got an appropriately-sized metadata pool, that you've not got clients competing against each other inappropriately, etc. -Greg On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson almightybe...@gmail.com wrote: Opps I should have said that I am not just writing the data but copying it : time cp Small1/* Small2/* Thanks, BJ On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson almightybe...@gmail.com wrote: I did a Ceph cluster install 2 weeks ago where I was getting great performance (~= PanFS) where I could write 100,000 1MB files in 61 Mins (Took PanFS 59 Mins). I thought I could increase the performance by adding a better MDS server so I redid the entire build. Now it takes 4 times as long to write the same data as it did before. The only thing that changed was the MDS server. (I even tried moving the MDS back on the old slower node and the performance was the same.) The first install was on CentOS 7. I tried going down to CentOS 6.6 and it's the same results. I use the same scripts to install the OSDs (which I created because I can never get ceph-deploy to behave correctly. Although, I did use ceph-deploy to create the MDS and MON and initial cluster creation.) I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read with rados bench -p cephfs_data 500 write --no-cleanup rados bench -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read) Could anybody think of a reason as to why I am now getting a huge regression. Hardware Setup: [OSDs] 64 GB 2133 MHz Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores) 40Gb Mellanox NIC [MDS/MON new] 128 GB 2133 MHz Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores) 40Gb Mellanox NIC [MDS/MON old] 32 GB 800 MHz Dual Proc E5472 @ 3.00GHz (8 Cores) 10Gb Intel NIC ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Snapshots and fstrim with cache tiers ?
On Wed, Mar 25, 2015 at 3:14 AM, Frédéric Nass frederic.n...@univ-lorraine.fr wrote: Hello, I have a few questions regarding snapshots and fstrim with cache tiers. In the cache tier and erasure coding FAQ related to ICE 1.2 (based on Firefly), Inktank says Snapshots are not supported in conjunction with cache tiers. What are the risks of using snapshots with cache tiers ? Would this better not use it recommandation still be true with Giant or Hammer ? Regarding the fstrim command, it doesn't seem to work with cache tiers. The freed up blocks don't get back in the ceph cluster. Can someone confirm this ? Is there something we can do to get those freed up blocks back in the cluster ? It does work, but there are two effects you're missing here: 1) The object can be deleted in the cache tier, but it won't get deleted from the backing pool until it gets flushed out of the cache pool. Depending on your workload this can take a while. 2) On erasure-coded pool, the OSD makes sure it can roll back a certain number of operations per PG. In the case of deletions, this means keeping the object data around for a while. This can also take a while if you're not doing many operations. This has been discussed on the list before; I think you'll want to look for a thread about rollback and pg log size. -Greg Also, can we run an fstrim task from the cluster side ? That is, without having to map and mount each rbd image or rely on the client to operate this task ? Best regards, -- Frédéric Nass Sous-direction Infrastructures Direction du Numérique Université de Lorraine email : frederic.n...@univ-lorraine.fr Tél : +33 3 83 68 53 83 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS Slow writes with 1MB files
On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson almightybe...@gmail.com wrote: Yes it's the exact same hardware except for the MDS server (although I tried using the MDS on the old node). I have not tried moving the MON back to the old node. My default cache size is mds cache size = 1000 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks. I created 2048 for data and metadata: ceph osd pool create cephfs_data 2048 2048 ceph osd pool create cephfs_metadata 2048 2048 To your point on clients competing against each other... how would I check that? Do you have multiple clients mounted? Are they both accessing files in the directory(ies) you're testing? Were they accessing the same pattern of files for the old cluster? If you happen to be running a hammer rc or something pretty new you can use the MDS admin socket to explore a bit what client sessions there are and what they have permissions on and check; otherwise you'll have to figure it out from the client side. -Greg Thanks for the input! On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum g...@gregs42.com wrote: So this is exactly the same test you ran previously, but now it's on faster hardware and the test is slower? Do you have more data in the test cluster? One obvious possibility is that previously you were working entirely in the MDS' cache, but now you've got more dentries and so it's kicking data out to RADOS and then reading it back in. If you've got the memory (you appear to) you can pump up the mds cache size config option quite dramatically from it's default 10. Other things to check are that you've got an appropriately-sized metadata pool, that you've not got clients competing against each other inappropriately, etc. -Greg On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson almightybe...@gmail.com wrote: Opps I should have said that I am not just writing the data but copying it : time cp Small1/* Small2/* Thanks, BJ On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson almightybe...@gmail.com wrote: I did a Ceph cluster install 2 weeks ago where I was getting great performance (~= PanFS) where I could write 100,000 1MB files in 61 Mins (Took PanFS 59 Mins). I thought I could increase the performance by adding a better MDS server so I redid the entire build. Now it takes 4 times as long to write the same data as it did before. The only thing that changed was the MDS server. (I even tried moving the MDS back on the old slower node and the performance was the same.) The first install was on CentOS 7. I tried going down to CentOS 6.6 and it's the same results. I use the same scripts to install the OSDs (which I created because I can never get ceph-deploy to behave correctly. Although, I did use ceph-deploy to create the MDS and MON and initial cluster creation.) I use btrfs on the OSDS as I can get 734 MB/s write and 1100 MB/s read with rados bench -p cephfs_data 500 write --no-cleanup rados bench -p cephfs_data 500 seq (xfs was 734 MB/s write but only 200 MB/s read) Could anybody think of a reason as to why I am now getting a huge regression. Hardware Setup: [OSDs] 64 GB 2133 MHz Dual Proc E5-2630 v3 @ 2.40GHz (16 Cores) 40Gb Mellanox NIC [MDS/MON new] 128 GB 2133 MHz Dual Proc E5-2650 v3 @ 2.30GHz (20 Cores) 40Gb Mellanox NIC [MDS/MON old] 32 GB 800 MHz Dual Proc E5472 @ 3.00GHz (8 Cores) 10Gb Intel NIC ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
Has the OSD actually been detected as down yet? You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) On Thu, Mar 26, 2015 at 1:29 PM, Lee Revell rlrev...@gmail.com wrote: I added the osd pool default min size = 1 to test the behavior when 2 of 3 OSDs are down, but the behavior is exactly the same as without it: when the 2nd OSD is killed, all client writes start to block and these pipe.(stuff).fault messages begin: 2015-03-26 16:08:50.775848 7fce177fe700 0 monclient: hunting for new mon 2015-03-26 16:08:53.781133 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce0c01d260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce0c01d4f0).fault 2015-03-26 16:09:00.009092 7fce1c3fa700 0 -- 192.168.122.111:0/1011003 192.168.122.141:6789/0 pipe(0x7fce1802dab0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802dd40).fault 2015-03-26 16:09:12.013147 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce1802e740 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802e9d0).fault 2015-03-26 16:10:06.013113 7fce1c2f9700 0 -- 192.168.122.111:0/1011003 192.168.122.131:6789/0 pipe(0x7fce1802df80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1801e600).fault 2015-03-26 16:10:36.013166 7fce1c3fa700 0 -- 192.168.122.111:0/1011003 192.168.122.141:6789/0 pipe(0x7fce1802ebc0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fce1802ee50).fault Here is my ceph.conf: [global] fsid = db460aa2-5129-4aaa-8b2e-43eac727124e mon_initial_members = ceph-node-1 mon_host = 192.168.122.121 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default size = 3 osd pool default min size = 1 public network = 192.168.122.0/24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)
There have been bugs here in the recent past which have been fixed for hammer, at least...it's possible we didn't backport it for the giant point release. :( But for users going forward that procedure should be good! -Greg On Thu, Mar 26, 2015 at 11:26 AM, Kyle Hutson kylehut...@ksu.edu wrote: For what it's worth, I don't think being patient was the answer. I was having the same problem a couple of weeks ago, and I waited from before 5pm one day until after 8am the next, and still got the same errors. I ended up adding a new cephfs pool with a newly-created small pool, but was never able to actually remove cephfs altogether. On Thu, Mar 26, 2015 at 12:45 PM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: On 03/25/2015 05:44 PM, Gregory Farnum wrote: On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote: Dear All, Please forgive this post if it's naive, I'm trying to familiarise myself with cephfs! I'm using Scientific Linux 6.6. with Ceph 0.87.1 My first steps with cephfs using a replicated pool worked OK. Now trying now to test cephfs via a replicated caching tier on top of an erasure pool. I've created an erasure pool, cannot put it under the existing replicated pool. My thoughts were to delete the existing cephfs, and start again, however I cannot delete the existing cephfs: errors are as follows: [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem I've tried killing the ceph-mds process, but this does not prevent the above error. I've also tried this, which also errors: [root@ceph1 ~]# ceph mds stop 0 Error EBUSY: must decrease max_mds or else MDS will immediately reactivate Right, so did you run ceph mds set_max_mds 0 and then repeating the stop command? :) This also fail... [root@ceph1 ~]# ceph-deploy mds destroy [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy [ceph_deploy.mds][ERROR ] subcommand destroy not implemented Am I doing the right thing in trying to wipe the original cephfs config before attempting to use an erasure cold tier? Or can I just redefine the cephfs? Yeah, unfortunately you need to recreate it if you want to try and use an EC pool with cache tiering, because CephFS knows what pools it expects data to belong to. Things are unlikely to behave correctly if you try and stick an EC pool under an existing one. :( Sounds like this is all just testing, which is good because the suitability of EC+cache is very dependent on how much hot data you have, etc...good luck! -Greg many thanks, Jake Grimmett ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Thanks for your help - much appreciated. The set_max_mds 0 command worked, but only after I rebooted the server, and restarted ceph twice. Before this I still got an mds active error, and so was unable to destroy the cephfs. Possibly I was being impatient, and needed to let mds go inactive? there were ~1 million files on the system. [root@ceph1 ~]# ceph mds set_max_mds 0 max_mds = 0 [root@ceph1 ~]# ceph mds stop 0 telling mds.0 10.1.0.86:6811/3249 to deactivate [root@ceph1 ~]# ceph mds stop 0 Error EEXIST: mds.0 not active (up:stopping) [root@ceph1 ~]# ceph fs rm cephfs2 Error EINVAL: all MDS daemons must be inactive before removing filesystem There shouldn't be any other mds servers running.. [root@ceph1 ~]# ceph mds stop 1 Error EEXIST: mds.1 not active (down:dne) At this point I rebooted the server, did a service ceph restart twice. Shutdown ceph, then restarted ceph before this command worked: [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it Anyhow, I've now been able to create an erasure coded pool, with a replicated tier which cephfs is running on :) *Lots* of testing to go! Again, many thanks Jake ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?
You shouldn't rely on rados ls when working with cache pools. It doesn't behave properly and is a silly operation to run against a pool of any size even when it does. :) More specifically, rados ls is invoking the pgls operation. Normal read/write ops will go query the backing store for objects if they're not in the cache tier. pgls is different — it just tells you what objects are present in the PG on that OSD right now. So any objects which aren't in cache won't show up when listing on the cache pool. -Greg On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote: Hi all, due an very silly approach, I removed the cache tier of an filled EC pool. After recreate the pool and connect with the EC pool I don't see any content. How can I see the rbd_data and other files through the new ssd cache tier? I think, that I must recreate the rbd_directory (and fill with setomapval), but I don't see anything yet! $ rados ls -p ecarchiv | more rbd_data.2e47de674b0dc51.00390074 rbd_data.2e47de674b0dc51.0020b64f rbd_data.2fbb1952ae8944a.0016184c rbd_data.2cfc7ce74b0dc51.00363527 rbd_data.2cfc7ce74b0dc51.0004c35f rbd_data.2fbb1952ae8944a.0008db43 rbd_data.2cfc7ce74b0dc51.0015895a rbd_data.31229f0238e1f29.000135eb ... $ rados ls -p ssd-archiv nothing generation of the cache tier: $ rados mkpool ssd-archiv $ ceph osd pool set ssd-archiv crush_ruleset 5 $ ceph osd tier add ecarchiv ssd-archiv $ ceph osd tier cache-mode ssd-archiv writeback $ ceph osd pool set ssd-archiv hit_set_type bloom $ ceph osd pool set ssd-archiv hit_set_count 1 $ ceph osd pool set ssd-archiv hit_set_period 3600 $ ceph osd pool set ssd-archiv target_max_bytes 500 rule ssd { ruleset 5 type replicated min_size 1 max_size 10 step take ssd step choose firstn 0 type osd step emit } Are there any magic (or which command I missed?) to see the excisting data throug the cache tier? regards - and hoping for answers Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done You would also have issues with snapshots if you do this on an RBD pool. That's unfortunately not feasible. -Greg possible split/partition list of objects into multiple concurrent loops, possible from multiple boxes as seems fit for resources at hand, cpu, memory, network, ceph perf. /Steffen On 3/26/2015 3:54 PM, Steffen W Sørensen wrote: On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote: Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? What I did a one point was going one layer higher in my storage abstraction, and created new Ceph pools and used those for new storage resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual disks there, assume you could do the same in OpenStack. My 0.02$ /Steffen -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
The procedure you've outlined won't copy snapshots, just the head objects. Preserving the proper snapshot metadata and inter-pool relationships on rbd images I think isn't actually possible when trying to change pools. On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote: On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote: On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote: That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. Hm wouldn’t it be possible to do something similar ala: # list object from src pool rados ls objects loop | filter-obj-id | while read obj; do # export $obj to local disk rados -p pool-wth-too-many-pgs get $obj # import $obj from local disk to new pool rados -p better-sized-pool put $obj done You would also have issues with snapshots if you do this on an RBD pool. That's unfortunately not feasible. What isn’t possible, export-import objects out-and-in of pools or snapshots issues? /Steffen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB used, 12649 MB / 40059 MB avail 840 active+clean 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; 0 B/s rd, 260 kB/s wr, 13 op/s 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; 0 B/s rd, 943 kB/s wr, 38 op/s 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; 0 B/s rd, 10699 kB/s wr, 621 op/s this is where i kill the second OSD 2015-03-26 17:26:26.778461 7f4ebeffd700 0 monclient: hunting for new mon 2015-03-26 17:26:30.701099 7f4ec45f5700 0 -- 192.168.122.111:0/1007741 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0023490).fault 2015-03-26 17:26:42.701154 7f4ec44f4700 0 -- 192.168.122.111:0/1007741 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0025440).fault And all writes block until I bring back an OSD. Lee ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0 ,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB used, 12649 MB / 40059 MB avail 840 active+clean 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; active+0 B/s rd, 260 kB/s wr, 13 op/s 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; active+0 B/s rd, 943 kB/s wr, 38 op/s 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; active+0 B/s rd, 10699 kB/s wr, 621 op/s this is where i kill the second OSD 2015-03-26 17:26:26.778461 7f4ebeffd700 0 monclient: hunting for new mon 2015-03-26 17:26:30.701099 7f4ec45f5700 0 -- 192.168.122.111:0/1007741 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0023490).fault 2015-03-26 17:26:42.701154 7f4ec44f4700 0 -- 192.168.122.111:0/1007741 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f4ec0025440).fault And all writes block until I bring back an OSD. Lee ___ ceph-users mailing list ceph
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote: Got most portion of it, thanks ! But, still not able to get when second node is down why with single monitor in the cluster client is not able to connect ? 1 monitor can form a quorum and should be sufficient for a cluster to run. The whole point of the monitor cluster is to ensure a globally consistent view of the cluster state that will never be reversed by a different group of up nodes. If one monitor (out of three) could make changes to the maps by itself, then there's nothing to prevent all three monitors from staying up but getting a net split, and then each issuing different versions of the osdmaps to whichever clients or OSDs happen to be connected to them. If you want to get down into the math proofs and things then the Paxos papers do all the proofs. Or you can look at the CAP theorem about the tradeoff between consistency and availability. The monitors are a Paxos cluster and Ceph is a 100% consistent system. -Greg Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:29 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly check because ceph health starts to hang when I down the second node. Oh. You need to keep a quorum of your monitors running (just the monitor processes, not of everything in the system) or nothing at all is going to work. That's how we prevent split brain issues. You'll also need to set that min size on your existing pools (ceph osd pool pool set min_size 1 or similar) to change their behavior; the config option only takes effect for newly-created pools. (Thus the default.) I've done this, however the behavior is the same: $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size to 1 set pool 7 min_size to 1 $ ceph -w cluster db460aa2-5129-4aaa-8b2e-43eac727124e health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 monmap e3: 3 mons at {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/ 0 ,ceph-node-3=192.168.122.141:6789/0}, election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} osdmap e362: 3 osds: 2 up, 2 in pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects 25329 MB
Re: [ceph-users] All client writes block when 2 of 3 OSDs down
On Thu, Mar 26, 2015 at 3:54 PM, Somnath Roy somnath@sandisk.com wrote: Greg, I think you got me wrong. I am not saying each monitor of a group of 3 should be able to change the map. Here is the scenario. 1. Cluster up and running with 3 mons (quorum of 3), all fine. 2. One node (and mon) is down, quorum of 2 , still connecting. 3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should still be able to connect. Isn't it ? No. The monitors can't tell the difference between dead monitors, and monitors they can't reach over the network. So they say there are three monitors in my map; therefore it requires two to make any change. That's the case regardless of whether all of them are running, or only one. Cluster with single monitor is able to form a quorum and should be working fine. So, why not in case of point 3 ? If this is the way Paxos works, should we say that in a cluster with say 3 monitors it should be able to tolerate only one mon failure ? Yes, that is the case. Let me know if I am missing a point here. Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:41 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote: Got most portion of it, thanks ! But, still not able to get when second node is down why with single monitor in the cluster client is not able to connect ? 1 monitor can form a quorum and should be sufficient for a cluster to run. The whole point of the monitor cluster is to ensure a globally consistent view of the cluster state that will never be reversed by a different group of up nodes. If one monitor (out of three) could make changes to the maps by itself, then there's nothing to prevent all three monitors from staying up but getting a net split, and then each issuing different versions of the osdmaps to whichever clients or OSDs happen to be connected to them. If you want to get down into the math proofs and things then the Paxos papers do all the proofs. Or you can look at the CAP theorem about the tradeoff between consistency and availability. The monitors are a Paxos cluster and Ceph is a 100% consistent system. -Greg Thanks Regards Somnath -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Thursday, March 26, 2015 3:29 PM To: Somnath Roy Cc: Lee Revell; ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote: Greg, Couple of dumb question may be. 1. If you see , the clients are connecting fine with two monitors in the cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor (which is I guess happening after making 2 nodes down) it is not able to connect ? A quorum is a strict majority of the total membership. 2 monitors can form a quorum just fine if there are either 2 or 3 total membership. (As long as those two agree on every action, it cannot be lost.) We don't *recommend* configuring systems with an even number of monitors, because it increases the number of total possible failures without increasing the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.) 2. Also, my understanding is while IO is going on *no* monitor interaction will be on that path, so, why the client io will be stopped because the monitor quorum is not there ? If the min_size =1 is properly set it should able to serve IO as long as 1 OSD (node) is up, isn't it ? Well, the remaining OSD won't be able to process IO because it's lost its peers, and it can't reach any monitors to do updates or get new maps. (Monitors which are not in quorum will not allow clients to connect.) The clients will eventually stop serving IO if they know they can't reach a monitor, although I don't remember exactly how that's triggered. In this particular case, though, the client probably just tried to do an op against the dead osd, realized it couldn't, and tried to fetch a map from the monitors. When that failed it went into search mode, which is what the logs are showing you. -Greg Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: Thursday, March 26, 2015 2:40 PM To: Lee Revell Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote: On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote: Has the OSD actually been detected as down yet? I believe it has, however I can't directly
Re: [ceph-users] error creating image in rbd-erasure-pool
On Tue, Mar 24, 2015 at 12:09 PM, Brendan Moloney molo...@ohsu.edu wrote: Hi Loic and Markus, By the way, Inktank do not support snapshot of a pool with cache tiering : * https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf Hi, You seem to be talking about pool snapshots rather than RBD snapshots. But in the linked document it is not clear that there is a distinction: Can I use snapshots with a cache tier? Snapshots are not supported in conjunction with cache tiers. Can anyone clarify if this is just pool snapshots? I think that was just a decision based on the newness and complexity of the feature for product purposes. Snapshots against cache tiered pools certainly should be fine in Giant/Hammer and we can't think of any issues in Firefly off the tops of our heads. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Does crushtool --test --simulate do what cluster should do?
On Tue, Mar 24, 2015 at 10:48 AM, Robert LeBlanc rob...@leblancnet.us wrote: I'm not sure why crushtool --test --simulate doesn't match what the cluster actually does, but the cluster seems to be executing the rules even though crushtool doesn't. Just kind of stinks that you have to test the rules on actual data. Should I create a ticket for this? Yes please! I'm not too familiar with the crushtool internals but the simulator code hasn't had too many eyeballs so it's hopefully not too hard a bug to fix. On Mon, Mar 23, 2015 at 6:08 PM, Robert LeBlanc rob...@leblancnet.us wrote: I'm trying to create a CRUSH ruleset and I'm using crushtool to test the rules, but it doesn't seem to mapping things correctly. I have two roots, on for spindles and another for SSD. I have two rules, one for each root. The output of crushtool on rule 0 shows objects being mapped to SSD OSDs when it should only be choosing spindles. I'm pretty sure I'm doing something wrong. I've tested the map on .93 and .80.8. The map is at http://pastebin.com/BjmuASX0 when running crushtool -i map.crush --test --num-rep 3 --rule 0 --simulate --show-mappings I'm getting mapping to OSDs 39 which are SSDs. The same happens when I run the SSD rule, I get OSDs from both roots. It is as if crushtool is not selecting the correct root. In fact both rules result in the same mapping: RNG rule 0 x 0 [0,38,23] RNG rule 0 x 1 [10,25,1] RNG rule 0 x 2 [11,40,0] RNG rule 0 x 3 [5,30,26] RNG rule 0 x 4 [44,30,10] RNG rule 0 x 5 [8,26,16] RNG rule 0 x 6 [24,5,36] RNG rule 0 x 7 [38,10,9] RNG rule 0 x 8 [39,9,23] RNG rule 0 x 9 [12,3,24] RNG rule 0 x 10 [18,6,41] ... RNG rule 1 x 0 [0,38,23] RNG rule 1 x 1 [10,25,1] RNG rule 1 x 2 [11,40,0] RNG rule 1 x 3 [5,30,26] RNG rule 1 x 4 [44,30,10] RNG rule 1 x 5 [8,26,16] RNG rule 1 x 6 [24,5,36] RNG rule 1 x 7 [38,10,9] RNG rule 1 x 8 [39,9,23] RNG rule 1 x 9 [12,3,24] RNG rule 1 x 10 [18,6,41] ... Thanks, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS
On Wed, Mar 25, 2015 at 8:10 PM, Ridwan Rashid Noel ridwan...@gmail.com wrote: Hi Greg, Thank you for your response. I have understood that I should be starting only the mapred daemons when using cephFS instead of HDFS. I have fixed that and trying to run hadoop wordcount job using this instruction: bin/hadoop jar hadoop*examples*.jar wordcount /tmp/wc-input /tmp/wc-output but I am getting this error 15/03/26 02:54:35 INFO util.NativeCodeLoader: Loaded the native-hadoop library 15/03/26 02:54:35 INFO input.FileInputFormat: Total input paths to process : 1 15/03/26 02:54:35 WARN snappy.LoadSnappy: Snappy native library not loaded 15/03/26 02:54:35 INFO mapred.JobClient: Running job: job_201503260253_0001 15/03/26 02:54:36 INFO mapred.JobClient: map 0% reduce 0% 15/03/26 02:54:36 INFO mapred.JobClient: Task Id : attempt_201503260253_0001_m_21_0, Status : FAILED Error initializing attempt_201503260253_0001_m_21_0: java.io.FileNotFoundException: File file:/tmp/hadoop-ceph/mapred/system/job_201503260253_0001/jobToken does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:4445) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1272) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1213) at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2568) at java.lang.Thread.run(Thread.java:745) I'm not an expert at setting up Hadoop, but these errors are coming out of the RawLocalFileSystem, which I think means that worker node is trying to use a local FS instead of Ceph. Did you set up each node to access Ceph? Have you set up and used Hadoop previously? -Greg . I have used the core-site.xml configurations as mentioned in http://ceph.com/docs/master/cephfs/hadoop/ Please tell me how can this problem be solved? Regards, Ridwan Rashid Noel Doctoral Student, Department of Computer Science, University of Texas at San Antonio Contact# 210-773-9966 On Fri, Mar 20, 2015 at 4:04 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, Mar 20, 2015 at 1:05 PM, Ridwan Rashid ridwan...@gmail.com wrote: Gregory Farnum greg@... writes: On Thu, Mar 19, 2015 at 5:57 PM, Ridwan Rashid ridwan064@... wrote: Hi, I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop with cephFS. I have installed hadoop-1.1.1 in the nodes and changed the conf/core-site.xml file according to the ceph documentation http://ceph.com/docs/master/cephfs/hadoop/ but after changing the file the namenode is not starting (namenode can be formatted) but the other services(datanode, jobtracker, tasktracker) are running in hadoop. The default hadoop works fine but when I change the core-site.xml file as above I get the following bindException as can be seen from the namenode log: 2015-03-19 01:37:31,436 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to node1/10.242.144.225:6789 : Cannot assign requested address I have one monitor for the ceph cluster (node1/10.242.144.225) and I included in the core-site.xml file ceph://10.242.144.225:6789 as the value of fs.default.name. The 6789 port is the default port being used by the monitor node of ceph, so that may be the reason for the bindException but the ceph documentation mentions that it should be included like this in the core-site.xml file. It would be really helpful to get some pointers to where I am doing wrong in the setup. I'm a bit confused. The NameNode is only used by HDFS, and so shouldn't be running at all if you're using CephFS. Nor do I have any idea why you've changed anything in a way that tells the NameNode to bind to the monitor's IP address; none of the instructions that I see can do that, and they certainly shouldn't be. -Greg Hi Greg, I want to run a hadoop job (e.g. terasort) and want to use cephFS instead of HDFS. In Using Hadoop with cephFS documentation in http://ceph.com/docs/master/cephfs/hadoop/ if you look into the Hadoop configuration section, the first property fs.default.name has to be set as the ceph URI and in the notes it's mentioned as ceph://[monaddr:port]/. My core-site.xml of hadoop conf looks like this configuration property namefs.default.name/name valueceph://10.242.144.225:6789/value /property Yeah, that all makes sense. But I don't understand why or how you're starting up a NameNode at all, nor what config values it's drawing from to try and bind to that port. The NameNode is the problem because it shouldn't even be invoked. -Greg