Re: [ceph-users] Two osds are spaming dmesg every 900 seconds
This is being output by one of the kernel clients, and it's just saying that the connections to those two OSDs have died from inactivity. Either the other OSD connections are used a lot more, or aren't used at all. In any case, it's not a problem; just a noisy notification. There's not much you can do about it; sorry. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 12:01 PM, Andrei Mikhailovsky and...@arhont.com wrote: Hello I am seeing this message every 900 seconds on the osd servers. My dmesg output is all filled with: [256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state OPEN) [256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state OPEN) Looking at the ceph-osd logs I see the following at the same time: 2014-08-25 19:48:14.869145 7f0752125700 0 -- 192.168.168.200:6821/4097 192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket is 192.168.168.200:54457/0) This happens only on two osds and the rest of osds seem fine. Does anyone know why am I seeing this and how to correct it? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-fuse fails to mount
In particular, we changed things post-Firefly so that the filesystem isn't created automatically. You'll need to set it up (and its pools, etc) explicitly to use it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby richardnixonsh...@gmail.com wrote: Hi James, On 26 August 2014 07:17, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: [ceph@first_cluster ~]$ ceph -s cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d health HEALTH_OK monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0}, election epoch 2, quorum 0 first_cluster mdsmap e4: 1/1/1 up {0=first_cluster=up:active} osdmap e13: 3 osds: 3 up, 3 in pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects 19835 MB used, 56927 MB / 76762 MB avail 192 active+clean This cluster has an MDS. It should mount. [ceph@second_cluster ~]$ ceph -s cluster 06f655b7-e147-4790-ad52-c57dcbf160b7 health HEALTH_OK monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0}, election epoch 1, quorum 0 cilsdbxd1768 osdmap e16: 7 osds: 7 up, 7 in pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects 252 MB used, 194 GB / 194 GB avail 192 active+clean No mdsmap line for this cluster, and therefore the filesystem won't mount. Have you added an MDS for this cluster, or has the mds daemon died? You'll have to get the mdsmap line to show before it will mount Sean ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS dying on Ceph 0.67.10
I don't think the log messages you're showing are the actual cause of the failure. The log file should have a proper stack trace (with specific function references and probably a listed assert failure), can you find that? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien tientienminh080...@gmail.com wrote: Hi all, I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate = 2) When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing. I have 3 MDS in 3 nodes,the MDS process is dying after a while with a stack trace: --- 2014-08-26 17:08:34.362901 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.10 10.20.0.21:6802/15917 1 osd_op_reply(230 10003f6. [tmapup 0~0] ondisk = 0) v4 119+0+0 (1770421071 0 0) 0x2aece00 con 0x2aa4200 -54 2014-08-26 17:08:34.362942 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.55 10.20.0.23:6800/2407 10 osd_op_reply(263 100048a. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0 -53 2014-08-26 17:08:34.363001 7f1c2c704700 5 mds.0.log submit_entry 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -52 2014-08-26 17:08:34.363022 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.37 10.20.0.22:6898/11994 6 osd_op_reply(226 1. [tmapput 0~7664] ondisk = 0) v4 109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0 -51 2014-08-26 17:08:34.363092 7f1c2c704700 5 mds.0.log _expired segment 293601899 2548 events -50 2014-08-26 17:08:34.363117 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.17 10.20.0.21:6941/17572 9 osd_op_reply(264 1000489. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180 -49 2014-08-26 17:08:34.363177 7f1c2c704700 5 mds.0.log submit_entry 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -48 2014-08-26 17:08:34.363197 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.1 10.20.0.21:6872/13227 6 osd_op_reply(265 1000491. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (1231782695 0 0) 0x1e63400 con 0x1e7ac00 -47 2014-08-26 17:08:34.363255 7f1c2c704700 5 mds.0.log submit_entry 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs] -46 2014-08-26 17:08:34.363274 7f1c2c704700 1 -- 10.20.0.21:6800/22154 == osd.11 10.20.0.21:6884/7018 5 osd_op_reply(266 100047d. [getxattr] ack = -2 (No such file or directory)) v4 119+0+0 (2737916920 0 0) 0x1e61e00 con 0x1e7bc80 - I try to restart MDSs, but after a few seconds in a state of active, MDS switch to state laggy or crashed. I have a lot of important data on it. I do not want to use the command: ceph mds newfs metadata pool id data pool id --yes-i-really-mean-it :( Tien Bui. -- Bui Minh Tien ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fresh Firefly install degraded without modified default tunables
Hmm, that all looks basically fine. But why did you decide not to segregate OSDs across hosts (according to your CRUSH rules)? I think maybe it's the interaction of your map, setting choose_local_tries to 0, and trying to go straight to the OSDs instead of choosing hosts. But I'm not super familiar with how the tunables would act under these exact conditions. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 12:59 PM, Ripal Nathuji ri...@nathuji.com wrote: Hi Greg, Thanks for helping to take a look. Please find your requested outputs below. ceph osd tree: # id weight type name up/down reweight -1 0 root default -2 0 host osd1 0 0 osd.0 up 1 4 0 osd.4 up 1 8 0 osd.8 up 1 11 0 osd.11 up 1 -3 0 host osd0 1 0 osd.1 up 1 3 0 osd.3 up 1 6 0 osd.6 up 1 9 0 osd.9 up 1 -4 0 host osd2 2 0 osd.2 up 1 5 0 osd.5 up 1 7 0 osd.7 up 1 10 0 osd.10 up 1 ceph -s: cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45 health HEALTH_WARN 832 pgs degraded; 832 pgs stuck unclean; recovery 43/86 objects degraded (50.000%) monmap e1: 1 mons at {ceph-mon0=192.168.2.10:6789/0}, election epoch 2, quorum 0 ceph-mon0 osdmap e34: 12 osds: 12 up, 12 in pgmap v61: 832 pgs, 8 pools, 840 bytes data, 43 objects 403 MB used, 10343 MB / 10747 MB avail 43/86 objects degraded (50.000%) 832 active+degraded Thanks, Ripal On Aug 25, 2014, at 12:45 PM, Gregory Farnum g...@inktank.com wrote: What's the output of ceph osd tree? And the full output of ceph -s? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 18, 2014 at 8:07 PM, Ripal Nathuji ri...@nathuji.com wrote: Hi folks, I've come across an issue which I found a fix for, but I'm not sure whether it's correct or if there is some other misconfiguration on my end and this is merely a symptom. I'd appreciate any insights anyone could provide based on the information below, and happy to provide more details as necessary. Summary: A fresh install of Ceph 0.80.5 comes up with all pgs marked as active+degraded. This reproduces on 12.04 as well as CentOS 7 with a varying number of OSD hosts (1, 2, 3), where each OSD host has four storage drives. The configuration file defines a default replica size of 2, and allows leafs of type 0. Specific snippet: [global] ... osd pool default size = 2 osd crush chooseleaf type = 0 I verified the crush rules were as expected: rules: [ { rule_id: 0, rule_name: replicated_ruleset, ruleset: 0, type: 1, min_size: 1, max_size: 10, steps: [ { op: take, item: -1, item_name: default}, { op: choose_firstn, num: 0, type: osd}, { op: emit}]}], Inspecting the pg dump I observed that all pgs had a single osd in the up/acting sets. That seemed to explain why the pgs were degraded, but it was unclear to me why a second OSD wasn't in the set. After trying a variety of things, I noticed that there was a difference between Emperor (which works fine in these configurations) and Firefly with the default tunables, where Firefly comes up with the bobtail profile. The setting choose_local_fallback_tries is 0 in this profile while it used to default to 5 on Emperor. Sure enough, if I modify my crush map and set the parameter to a non-zero value, the cluster remaps and goes healthy with all pgs active+clean. The documentation states the optimal value of choose_local_fallback_tries is 0 for FF, so I'd like to get a better understanding of this parameter and why modifying the default value moves the pgs to a clean state in my scenarios. Thanks, Ripal ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-fuse fails to mount
[Re-added the list.] I believe you'll find everything you need at http://ceph.com/docs/master/cephfs/createfs/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Aug 26, 2014 at 1:25 PM, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: So is there a link for documentation on the newer versions? (we're doing evaluations at present, so I had wanted to work with newer versions, since it would be closer to what we would end up using). -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Tuesday, August 26, 2014 4:05 PM To: Sean Crosby Cc: LaBarre, James (CTR) A6IT; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph-fuse fails to mount In particular, we changed things post-Firefly so that the filesystem isn't created automatically. You'll need to set it up (and its pools, etc) explicitly to use it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby richardnixonsh...@gmail.com wrote: Hi James, On 26 August 2014 07:17, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: [ceph@first_cluster ~]$ ceph -s cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d health HEALTH_OK monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0}, election epoch 2, quorum 0 first_cluster mdsmap e4: 1/1/1 up {0=first_cluster=up:active} osdmap e13: 3 osds: 3 up, 3 in pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects 19835 MB used, 56927 MB / 76762 MB avail 192 active+clean This cluster has an MDS. It should mount. [ceph@second_cluster ~]$ ceph -s cluster 06f655b7-e147-4790-ad52-c57dcbf160b7 health HEALTH_OK monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0}, election epoch 1, quorum 0 cilsdbxd1768 osdmap e16: 7 osds: 7 up, 7 in pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects 252 MB used, 194 GB / 194 GB avail 192 active+clean No mdsmap line for this cluster, and therefore the filesystem won't mount. Have you added an MDS for this cluster, or has the mds daemon died? You'll have to get the mdsmap line to show before it will mount Sean ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2014 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] error ioctl(BTRFS_IOC_SNAP_CREATE) failed: (17) File exists
This looks new to me. Can you try and start up the OSD with debug osd = 20 and debug filestore = 20 in your conf, then put the log somewhere accessible? (You can also use ceph-post-file if it's too large for pastebin or something.) Also, check dmesg and see if btrfs is complaining, and see what the (folder, or more specifically snapshot) contents of the OSD data directory are. Since you *are* on btrfs this is probably reasonably recoverable, but we'll have to see what's going on first. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Aug 26, 2014 at 10:18 PM, John Morris j...@zultron.com wrote: During reorganization of the Ceph system, including an updated CRUSH map and moving to btrfs, some PGs became stuck incomplete+remapped. Before that was resolved, a restart of osd.1 failed while creating a btrfs snapshot. A 'ceph-osd -i 1 --flush-journal' fails with the same error. See the below pasted log. This is a Bad Thing, because two PGs are now stuck down+peering. A 'ceph pg 2.74 query' shows they had been stuck on osd.1 before the btrfs problem, despite what the 'last acting' field shows in the below 'ceph health detail' output. Is there any way to recover from this? Judging from Google searches on the list archives, nobody has run into this problem before, so I'm quite worried that this spells backup recovery exercises for the next few days. Related question: Are outright OSD crashes the reason btrfs is discouraged for production use? Thanks- John pg 2.74 is stuck inactive since forever, current state down+peering, last acting [3,7,0,6] pg 3.73 is stuck inactive since forever, current state down+peering, last acting [3,7,0,6] pg 2.74 is stuck unclean since forever, current state down+peering, last acting [3,7,0,6] pg 3.73 is stuck unclean since forever, current state down+peering, last acting [3,7,0,6] pg 2.74 is down+peering, acting [3,7,0,6] pg 3.73 is down+peering, acting [3,7,0,6] 2014-08-26 22:36:12.641585 7f5b38e507a0 0 ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f), process ceph-osd, pid 10281 2014-08-26 22:36:12.717100 7f5b38e507a0 0 filestore(/ceph/osd.1) mount FIEMAP ioctl is supported and appears to work 2014-08-26 22:36:12.717121 7f5b38e507a0 0 filestore(/ceph/osd.1) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-08-26 22:36:12.717434 7f5b38e507a0 0 filestore(/ceph/osd.1) mount detected btrfs 2014-08-26 22:36:12.717471 7f5b38e507a0 0 filestore(/ceph/osd.1) mount btrfs CLONE_RANGE ioctl is supported 2014-08-26 22:36:12.765009 7f5b38e507a0 0 filestore(/ceph/osd.1) mount btrfs SNAP_CREATE is supported 2014-08-26 22:36:12.765335 7f5b38e507a0 0 filestore(/ceph/osd.1) mount btrfs SNAP_DESTROY is supported 2014-08-26 22:36:12.765541 7f5b38e507a0 0 filestore(/ceph/osd.1) mount btrfs START_SYNC is supported (transid 3118) 2014-08-26 22:36:12.789600 7f5b38e507a0 0 filestore(/ceph/osd.1) mount btrfs WAIT_SYNC is supported 2014-08-26 22:36:12.808287 7f5b38e507a0 0 filestore(/ceph/osd.1) mount btrfs SNAP_CREATE_V2 is supported 2014-08-26 22:36:12.834144 7f5b38e507a0 0 filestore(/ceph/osd.1) mount syscall(SYS_syncfs, fd) fully supported 2014-08-26 22:36:12.834377 7f5b38e507a0 0 filestore(/ceph/osd.1) mount found snaps 6009082,6009083 2014-08-26 22:36:12.834427 7f5b38e507a0 -1 filestore(/ceph/osd.1) FileStore::mount: error removing old current subvol: (22) Invalid argument 2014-08-26 22:36:12.861045 7f5b38e507a0 -1 filestore(/ceph/osd.1) mount initial op seq is 0; something is wrong 2014-08-26 22:36:12.861428 7f5b38e507a0 -1 ^[[0;31m ** ERROR: error converting store /ceph/osd.1: (22) Invalid argument^[[0m ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 'incomplete' PGs: what does it mean?
On Tue, Aug 26, 2014 at 10:46 PM, John Morris j...@zultron.com wrote: In the docs [1], 'incomplete' is defined thusly: Ceph detects that a placement group is missing a necessary period of history from its log. If you see this state, report a bug, and try to start any failed OSDs that may contain the needed information. However, during an extensive review of list postings related to incomplete PGs, an alternate and oft-repeated definition is something like 'the number of existing replicas is less than the min_size of the pool'. In no list posting was there any acknowledgement of the definition from the docs. While trying to understand what 'incomplete' PGs are, I simply set min_size = 1 on this cluster with incomplete PGs, and they continue to be 'incomplete'. Does this mean that definition #2 is incorrect? In case #1 is correct, how can the cluster be told to forget the lapse in history? In our case, there was nothing writing to the cluster during the OSD reorganization that could have caused this lapse. Yeah, these two meanings can (unfortunately) both lead to the INCOMPLETE state being reported. I think that's going to be fixed in our next major release (so that INCOMPLETE means not enough OSDs hosting and missing log will translate into something else), but for now the not enough OSDs is by far the more common. In your case you probably are missing history, but you don't want to recover from it using any of the cluster tools because they're likely to lose more data than necessary. (Hopefully, you can just roll back to a slightly older btrfs snapshot, but we'll see). -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RAID underlying a Ceph config
There aren't too many people running RAID under Ceph, as it's a second layer of redundancy that in normal circumstances is a bit pointless. But there are scenarios where it might be useful. You might check the list archives for the anti-cephalopod question thread. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Aug 28, 2014 at 10:19 AM, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: Having heard some suggestions on RAID configuration under Gluster (we have someone else doing that evaluation, I’m doing the Ceph piece), I’m wondering what (if any) RAID configurations would be recommended for Ceph. I have the impression that striping data could counteract/undermine data replication (with PGs potentially being on multiple disks, rather than within a single disk-oriented OSD). -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2014 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Filesystem - Production?
On Thu, Aug 28, 2014 at 10:36 AM, Brian C. Huffman bhuff...@etinternational.com wrote: Is Ceph Filesystem ready for production servers? The documentation says it's not, but I don't see that mentioned anywhere else. http://ceph.com/docs/master/cephfs/ Everybody has their own standards, but Red Hat isn't supporting it for general production use at this time. If you're brave you could test it under your workload for a while and see how it comes out; the known issues are very much workload-dependent (or just general concerns over polish). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MSWin CephFS
On Thu, Aug 28, 2014 at 10:41 AM, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: Just out of curiosity, is there a way to mount a Ceph filesystem directly on a MSWindows system (2008 R2 server)? Just wanted to try something out from a VM. Nope, sorry. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 'incomplete' PGs: what does it mean?
, last_epoch_started: 10278}, recovery_state: [ { name: Started\/Primary\/Peering, enter_time: 2014-08-29 01:22:50.132826, past_intervals: [ [...] { first: 12809, last: 13101, maybe_went_rw: 1, up: [ 7, 3, 0, 4], acting: [ 7, 3, 0, 4]}, [...] ], probing_osds: [ 0, 1, 2, 3, 4, 5, 7], down_osds_we_would_probe: [], peering_blocked_by: []}, { name: Started, enter_time: 2014-08-29 01:22:50.132784}]} On Wed, Aug 27, 2014 at 12:40 PM, Gregory Farnum g...@inktank.com javascript:; wrote: On Tue, Aug 26, 2014 at 10:46 PM, John Morris john at zultron.com wrote: In the docs [1], 'incomplete' is defined thusly: Ceph detects that a placement group is missing a necessary period of history from its log. If you see this state, report a bug, and try to start any failed OSDs that may contain the needed information. However, during an extensive review of list postings related to incomplete PGs, an alternate and oft-repeated definition is something like 'the number of existing replicas is less than the min_size of the pool'. In no list posting was there any acknowledgement of the definition from the docs. While trying to understand what 'incomplete' PGs are, I simply set min_size = 1 on this cluster with incomplete PGs, and they continue to be 'incomplete'. Does this mean that definition #2 is incorrect? In case #1 is correct, how can the cluster be told to forget the lapse in history? In our case, there was nothing writing to the cluster during the OSD reorganization that could have caused this lapse. Yeah, these two meanings can (unfortunately) both lead to the INCOMPLETE state being reported. I think that's going to be fixed in our next major release (so that INCOMPLETE means not enough OSDs hosting and missing log will translate into something else), but for now the not enough OSDs is by far the more common. In your case you probably are missing history, but you don't want to recover from it using any of the cluster tools because they're likely to lose more data than necessary. (Hopefully, you can just roll back to a slightly older btrfs snapshot, but we'll see). -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com javascript:; http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question about monitor and paxos relationship
On Thu, Aug 28, 2014 at 9:52 PM, pragya jain prag_2...@yahoo.co.in wrote: I have some basic question about monitor and paxos relationship: As the documents says, Ceph monitor contains cluster map, if there is any change in the state of the cluster, the change is updated in the cluster map. monitor use paxos algorithm to create the consensus among monitors to establish a quorum. And when we talk about the Paxos algorithm, documents says that monitor writes all changes to the Paxos instance and Paxos writes the changes to a key/value store for strong consistency. #1: I am unable to understand what actually the Paxos algorithm do? all changes in the cluster map are made by Paxos algorithm? how it create a consensus among monitors Paxos is an algorithm for making decisions and/or safely replicating data in a distributed system. The Ceph monitor cluster uses it for all changes to any of its data. My assumption is: cluster map is updated when OSD report monitor about any changes, there is no role of Paxos in it. Paxos write changes made only for the monitors. Please somebody elaborate at this point. Every change the monitors incorporate to any data structure, most definitely including the OSD map's changes based on reports from OSDs, is passed through paxos. #2: why odd no. of monitors are recommended for production cluster, not even no.? This is because of a trait of the Paxos' systems durability and uptime guarantees. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Misdirected client messages
The clients are sending messages to OSDs which are not the primary for the data. That shouldn't happen — clients which don't understand the whole osdmap ought to be gated and prevented from accessing the cluster at all. What version of Ceph are you running, and what clients? (We've seen this in dev versions but I can't think of any in named releases off the top of my head. It's more likely if you're using something like the primary affinity values or something.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Sep 3, 2014 at 6:26 AM, Maros Vegh maros.v...@microstep-mis.sk wrote: Hello, last weeks we observed many misdirected client messages in the logs. The messages are similar to this one: 2014-09-03 15:20:55.696752 osd.24 192.168.61.3:6830/25216 234 : [WRN] client.2936377 192.168.61.105:0/983896378 misdirected client.2936377.1:4985727 pg 0.a7459c63 to osd.24 not [5,24] in e22827/22827 Can somebody explain what is the issue and how to solve it? thanks Maros Vegh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Updating the pg and pgp values
On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah jshah2...@me.com wrote: While checking the health of the cluster, I ran to the following error: warning: health HEALTH_WARN too few pgs per osd (1 min 20) When I checked the pg and php numbers, I saw the value was the default value of 64 ceph osd pool get data pg_num pg_num: 64 ceph osd pool get data pgp_num pgp_num: 64 Checking the ceph documents, I updated the numbers to 2000 using the following commands: ceph osd pool set data pg_num 2000 ceph osd pool set data pgp_num 2000 It started resizing the data and saw health warnings again: health HEALTH_WARN 1 requests are blocked 32 sec; pool data pg_num 2000 pgp_num 64 and then: ceph health detail HEALTH_WARN 6 requests are blocked 32 sec; 3 osds have slow requests 5 ops are blocked 65.536 sec 1 ops are blocked 32.768 sec 1 ops are blocked 32.768 sec on osd.16 1 ops are blocked 65.536 sec on osd.77 4 ops are blocked 65.536 sec on osd.98 3 osds have slow requests This error also went away after a day. ceph health detail HEALTH_OK Now, the question I have is, will this pg number remain effective on the cluster, even if we restart MON or OSD’s on the individual disks? I haven’t changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the ceph.conf and push that change to all the MON, MSD and OSD’s ? It's durable once the commands are successful on the monitors. You're all done. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Updating the pg and pgp values
It's stored in the OSDMap on the monitors. Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Sep 8, 2014 at 10:50 AM, JIten Shah jshah2...@me.com wrote: So, if it doesn’t refer to the entry in ceph.conf. Where does it actually store the new value? —Jiten On Sep 8, 2014, at 10:31 AM, Gregory Farnum g...@inktank.com wrote: On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah jshah2...@me.com wrote: While checking the health of the cluster, I ran to the following error: warning: health HEALTH_WARN too few pgs per osd (1 min 20) When I checked the pg and php numbers, I saw the value was the default value of 64 ceph osd pool get data pg_num pg_num: 64 ceph osd pool get data pgp_num pgp_num: 64 Checking the ceph documents, I updated the numbers to 2000 using the following commands: ceph osd pool set data pg_num 2000 ceph osd pool set data pgp_num 2000 It started resizing the data and saw health warnings again: health HEALTH_WARN 1 requests are blocked 32 sec; pool data pg_num 2000 pgp_num 64 and then: ceph health detail HEALTH_WARN 6 requests are blocked 32 sec; 3 osds have slow requests 5 ops are blocked 65.536 sec 1 ops are blocked 32.768 sec 1 ops are blocked 32.768 sec on osd.16 1 ops are blocked 65.536 sec on osd.77 4 ops are blocked 65.536 sec on osd.98 3 osds have slow requests This error also went away after a day. ceph health detail HEALTH_OK Now, the question I have is, will this pg number remain effective on the cluster, even if we restart MON or OSD’s on the individual disks? I haven’t changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the ceph.conf and push that change to all the MON, MSD and OSD’s ? It's durable once the commands are successful on the monitors. You're all done. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delays while waiting_for_osdmap according to dump_historic_ops
On Sun, Sep 7, 2014 at 4:28 PM, Alex Moore a...@lspeed.org wrote: I recently found out about the ceph --admin-daemon /var/run/ceph/ceph-osd.id.asok dump_historic_ops command, and noticed something unexpected in the output on my cluster, after checking numerous output samples... It looks to me like normal write ops on my cluster spend roughly: 1ms between received_at and waiting_for_osdmap 1ms between waiting_for_osdmap and reached_pg 15ms between reached_pg and commit_sent 15ms between commit_sent and done For reference, this is a small (3-host) all-SSD cluster, with monitors co-located with OSDs. Each host has: 1 SSD for the OS, 1 SSD for the journal, and 1 SSD for the OSD + monitor data (I initially had the monitor data on the same drive as the OS, but encountered performance problems - which have since been allieviated by moving the monitor data to the same drives as the OSDs. Networking is infiniband (8 Gbps dedicated point-to-point link between each pair of hosts). I'm running v0.80.5. And the OSDs use XFS. Anyway, as this command intentionally shows the worst few recent IOs, I only rarely see examples that match the above norm. Rather, the typical outliers that it highlights are usually write IOs with ~100-300ms latency, where the extra latency exists purely between the received_at and reached_pg timestamps, and mostly in the waiting_for_osdmap step. Also it looks like these slow IOs come in batches. Every write IO arriving within the same ~1 second period will suffer from these strangely slow initial two steps, with the additional latency being almost identical for each one within the same batch. After which things return to normal again in that those steps take 1ms. So compared to the above norm, these look more like: ~50ms between received_at and waiting_for_osdmap ~150ms between waiting_for_osdmap and reached_pg 15ms between reached_pg and commit_sent 15ms between commit_sent and done This seems unexpected to me. I don't see why those initial steps in the IO should ever take such a long time to complete. Where should I be looking next to track down the cause? I'm guessing that waiting_for_osdmap involves OSD-Mon communication, and so perhaps indicates poor performance of the Mons. But for there to be any non-negligible delay between received_at and waiting_for_osdmap makes no sense to me at all. First thing here is to explain what each of these events actually mean. received_at is the point at which we *started* reading the message off the wire. We have to finish reading it off and dispatch it to the OSD before the next one. waiting_for_osdmap is slightly misnamed; it's the point at which the op was submitted to the OSD. It's called that because receiving a message with a newer OSDMap epoch than we have is the most common long-term delay in this phase, but we also have to do some other preprocessing and queue the Op up. reached_pg is the point at which the Op is dequeued by a worker thread and has the necessary mutexes to get processed. After this point we're going to try and actually do the operations described (reads or writes). commit_sent indicates that we've actually sent back the commit to the client or primary OSD. done indicates that the op has been completed (commit_sent doesn't wait for the op to have been applied to the backing filesystem; this does). There are probably a bunch of causes for the behavior you're seeing, but the most likely is that you've occasionally got a whole bunch of operations going to a single object/placement group and they're taking some time to process because they have to be serialized. This would prevent the PG from handling newer ops while the old ones are still being processed, and that could back up through the pipeline to slow down the reads off the wire as well. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd crash: trim_objectcould not find coid
On Mon, Sep 8, 2014 at 1:42 AM, Francois Deppierraz franc...@ctrlaltdel.ch wrote: Hi, This issue is on a small 2 servers (44 osds) ceph cluster running 0.72.2 under Ubuntu 12.04. The cluster was filling up (a few osds near full) and I tried to increase the number of pg per pool to 1024 for each of the 14 pools to improve storage space balancing. This increase triggered high memory usage on the servers which were unfortunately under-provisioned (16 GB RAM for 22 osds) and started to swap and crash. After installing memory into the servers, the result is a broken cluster with unfound objects and two osds (osd.6 and osd.43) crashing at startup. $ ceph health HEALTH_WARN 166 pgs backfill; 326 pgs backfill_toofull; 2 pgs backfilling; 765 pgs degraded; 715 pgs down; 1 pgs incomplete; 715 pgs peering; 5 pgs recovering; 2 pgs recovery_wait; 716 pgs stuck inactive; 1856 pgs stuck unclean; 164 requests are blocked 32 sec; recovery 517735/15915673 objects degraded (3.253%); 1241/7910367 unfound (0.016%); 3 near full osd(s); 1/43 in osds are down; noout flag(s) set osd.6 is crashing due to an assertion (trim_objectcould not find coid) which leads to a resolved bug report which unfortunately doesn't give any advise on how to repair the osd. http://tracker.ceph.com/issues/5473 It is much less obvious why osd.43 is crashing, please have a look at the following osd logs: http://paste.ubuntu.com/8288607/ http://paste.ubuntu.com/8288609/ The first one is not caused by the same thing as the ticket you reference (it was fixed well before emperor), so it appears to be some kind of disk corruption. The second one is definitely corruption of some kind as it's missing an OSDMap it thinks it should have. It's possible that you're running into bugs in emperor that were fixed after we stopped doing regular support releases of it, but I'm more concerned that you've got disk corruption in the stores. What kind of crashes did you see previously; are there any relevant messages in dmesg, etc? Given these issues, you might be best off identifying exactly which PGs are missing, carefully copying them to working OSDs (use the osd store tool), and killing these OSDs. Do lots of backups at each stage... -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd crash: trim_objectcould not find coid
On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz franc...@ctrlaltdel.ch wrote: Hi Greg, Thanks for your support! On 08. 09. 14 20:20, Gregory Farnum wrote: The first one is not caused by the same thing as the ticket you reference (it was fixed well before emperor), so it appears to be some kind of disk corruption. The second one is definitely corruption of some kind as it's missing an OSDMap it thinks it should have. It's possible that you're running into bugs in emperor that were fixed after we stopped doing regular support releases of it, but I'm more concerned that you've got disk corruption in the stores. What kind of crashes did you see previously; are there any relevant messages in dmesg, etc? Nothing special in dmesg except probably irrelevant XFS warnings: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) Hmm, I'm not sure what the outcome of that could be. Googling for the error message returns this as the first result, though: http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58429 Which indicates that it's a real deadlock and capable of messing up your OSDs pretty good. All logs from before the disaster are still there, do you have any advise on what would be relevant? Given these issues, you might be best off identifying exactly which PGs are missing, carefully copying them to working OSDs (use the osd store tool), and killing these OSDs. Do lots of backups at each stage... This sounds scary, I'll keep fingers crossed and will do a bunch of backups. There are 17 pg with missing objects. What do you exactly mean by the osd store tool? Is it the 'ceph_filestore_tool' binary? Yeah, that one. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Remaped osd at remote restart
On Mon, Sep 8, 2014 at 6:33 AM, Eduard Kormann ekorm...@dunkel.de wrote: Hello, have I missed something or is it a feature: When I restart a osd on the belonging server so it restarts normally: root@cephosd10:~# service ceph restart osd.76 === osd.76 === === osd.76 === Stopping Ceph osd.76 on cephosd10...kill 799176...done === osd.76 === create-or-move updating item name 'osd.76' weight 3.64 at location {host=cephosd10,root=default} to crush map Starting Ceph osd.76 on cephosd10... starting osd.76 at :/0 osd_data /var/lib/ceph/osd/ceph-76 /var/lib/ceph/osd/ceph-76/journal But if I trie to restart osd on the admin server...: root@cephadmin:/etc/ceph# service ceph -a restart osd.76 === osd.76 === === osd.76 === Stopping Ceph osd.76 on cephosd10...kill 800262...kill 800262...done === osd.76 === df: `/var/lib/ceph/osd/ceph-76/.': No such file or directory df: no file systems processed create-or-move updating item name 'osd.76' weight 1 at location {host=cephadmin,root=default} to crush map Starting Ceph osd.76 on cephosd10... starting osd.76 at :/0 osd_data /var/lib/ceph/osd/ceph-76 /var/lib/ceph/osd/ceph-76/journal ...it will associated with the admin server in the crush map: -17 0 host cephadmin 76 0 osd.76 up 0 Before that each osd could be started from arbitrary server with option -a. Apparently it no longer works. How do I run any osd from any server without error messages? ...huh. I didn't realize the -a option still existed. You should be able to prevent this from happening by adding osd crush update on start = false to the global section of your ceph.conf on any nodes which you are going to use to restart OSDs from other nodes with. I created a ticket to address this issue: http://tracker.ceph.com/issues/9407 -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] max_bucket limit -- safe to disable?
On Tue, Sep 9, 2014 at 9:11 AM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: Hi list! Under http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-September/033670.html I found a situation not unlike ours, but unfortunately either the list archive fails me or the discussion ended without a conclusion, so I dare to ask again :) We currently have a setup of 4 servers with 12 OSDs each, combined journal and data. No SSDs. We develop a document management application that accepts user uploads of all kinds of documents and processes them in several ways. For any given document, we might create anywhere from 10s to several hundred dependent artifacts. We are now preparing to move from Gluster to a Ceph based backend. The application uses the Apache JClouds Library to talk to the Rados Gateways that are running on all 4 of these machines, load balanced by haproxy. We currently intend to create one container for each document and put all the dependent and derived artifacts as objects into that container. This gives us a nice compartmentalization per document, also making it easy to remove a document and everything that is connected with it. During the first test runs we ran into the default limit of 1000 containers per user. In the thread mentioned above that limit was removed (setting the max_buckets value to 0). We did that and now can upload more than 1000 documents. I just would like to understand a) if this design is recommended, or if there are reasons to go about the whole issue in a different way, potentially giving up the benefit of having all document artifacts under one convenient handle. b) is there any absolute limit for max_buckets that we will run into? Remember we are talking about 10s of millions of containers over time. c) are any performance issues to be expected with this design and can we tune any parameters to alleviate this? Any feedback would be very much appreciated. Yehuda can talk about this with more expertise than I can, but I think it should be basically fine. By creating so many buckets you're decreasing the effectiveness of RGW's metadata caching, which means the initial lookup in a particular bucket might take longer. The big concern is that we do maintain a per-user list of all their buckets — which is stored in a single RADOS object — so if you have an extreme number of buckets that RADOS object could get pretty big and become a bottleneck when creating/removing/listing the buckets. You should run your own experiments to figure out what the limits are there; perhaps you have an easy way of sharding up documents into different users. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] max_bucket limit -- safe to disable?
On Wednesday, September 10, 2014, Daniel Schneller daniel.schnel...@centerdevice.com wrote: On 09 Sep 2014, at 21:43, Gregory Farnum g...@inktank.com javascript:_e(%7B%7D,'cvml','g...@inktank.com'); wrote: Yehuda can talk about this with more expertise than I can, but I think it should be basically fine. By creating so many buckets you're decreasing the effectiveness of RGW's metadata caching, which means the initial lookup in a particular bucket might take longer. Thanks for your thoughts. With “initial lookup in a particular bucket” do you mean accessing any of the objects in a bucket? If we directly access the object (not enumerating the buckets content), would that still be an issue? Just trying to understand the inner workings a bit better to make more educated guesses :) When doing an object lookup, the gateway combines the bucket ID with a mangled version of the object name to try and do a read out of RADOS. It first needs to get that bucket ID though -- it will cache an the bucket name-ID mapping, but if you have a ton of buckets there could be enough entries to degrade the cache's effectiveness. (So, you're more likely to pay that extra disk access lookup.) The big concern is that we do maintain a per-user list of all their buckets — which is stored in a single RADOS object — so if you have an extreme number of buckets that RADOS object could get pretty big and become a bottleneck when creating/removing/listing the buckets. You Alright. Listing buckets is no problem, that we don’t do. Can you say what “pretty big” would be in terms of MB? How much space does a bucket record consume in there? Based on that I could run a few numbers. Uh, a kilobyte per bucket? You could look it up in the source (I'm on my phone) but I *believe* the bucket name is allowed to be larger than the rest combined... More particularly, though, if you've got a single user uploading documents, each creating a new bucket, then those bucket creates are going to serialize on this one object. -Greg should run your own experiments to figure out what the limits are there; perhaps you have an easy way of sharding up documents into different users. Good advice. We can do that per distributor (an org unit in our software) to at least compartmentalize any potential locking issues in this area to that single entity. Still, there would be quite a lot of buckets/objects per distributor, so some more detail on the above items would be great. Thanks a lot! Daniel -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS roadmap (was Re: NAS on RBD)
On Tue, Sep 9, 2014 at 6:10 PM, Blair Bethwaite blair.bethwa...@gmail.com wrote: Hi Sage, Thanks for weighing into this directly and allaying some concerns. It would be good to get a better understanding about where the rough edges are - if deployers have some knowledge of those then they can be worked around to some extent. It's just a very long process to qualify a filesystem, even in this limited sense. We're still at the point where we're solving bugs that the open-source community brings us rather than setting out to make it stable for a particular identified workload. For the moment most of our development effort is focused on 1) instrumentation that makes it possible for users (and developers!) to identify the cause of problems we run across 2) basic mechanisms for fixing ephemeral bugs (things like booting dead clients, restarting hung metadata ops, etc) 3) general usability issues that our newer developers and users are reporting to us 4) the beginnings of fsck (correctness checking for now, no fixing yet) E.g., for our use-case it may be that whilst Inktank/RedHat won't provide support for CephFS that we are better off using it in a tightly controlled fashion (e.g., no snapshots, restricted set of native clients acting as presentation layer with others coming in via SAMBA Ganesha, no dynamic metadata tree/s, ???) where we're less likely to run into issues. Well, snapshots are definitely going to break your install (they're disabled by default, now). Multi-mds is unstable enough that nobody should be running with it. We run samba and NFS tests in our nightlies and they mostly work, although we've got some odd issues we've not tracked down when *ending* the samba process or unmounting nfs. (Our best guess on these is test or environment issues, rather than actual FS issues.) But these are probably not complete. Related, given there is no fsck, how would one go about backing up the metadata in order to facilitate DR? Is there even a way for that to make sense given the decoupling of data metadata pools...? Uh, depends on the kind of DR you're going for, I guess. There are lots of things that will backup a generic filesystem; you could do something smarter with a bit of custom scripting using Ceph's rstats. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why one osd-op from client can get two osd-op-reply?
On Wed, Sep 10, 2014 at 8:29 PM, yuelongguang fasts...@163.com wrote: as for ack and ondisk, ceph has size and min_size to decide there are how many replications. if client receive ack or ondisk, which means there are at least min_size osds have done the ops? i am reading the cource code, could you help me with the two questions. 1. on osd, where is the code that reply ops separately according to ack or ondisk. i check the code, but i thought they always are replied together. It depends on what journaling mode you're in, but generally they're triggered separately (unless it goes on disk first, in which case it will skip the ack — this is the mode it uses for non-btrfs filesystems). The places where it actually replies are pretty clear about doing one or the other, though... 2. now i just know how client write ops to primary osd, inside osd cluster, how it promises min_size copy are reached. i mean when primary osd receives ops , how it spreads ops to others, and how it processes other's reply. That's not how it works. The primary for a PG will not go active with it until it has at least min_size copies that it knows about. Once the OSD is doing any processing of the PG, it requires all participating members to respond before it sends any messages back to the client. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com greg, thanks very much 在 2014-09-11 01:36:39,Gregory Farnum g...@inktank.com 写道: The important bit there is actually near the end of the message output line, where the first says ack and the second says ondisk. I assume you're using btrfs; the ack is returned after the write is applied in-memory and readable by clients. The ondisk (commit) message is returned after it's durable to the journal or the backing filesystem. -Greg On Wednesday, September 10, 2014, yuelongguang fasts...@163.com wrote: hi,all i recently debug ceph rbd, the log tells that one write to osd can get two if its reply. the difference between them is seq. why? thanks ---log- reader got message 6 0x7f58900010a0 osd_op_reply(15 rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6 2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue 0x7f58900010a0 prio 127 2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader reading tag... 2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got MSG 2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0 2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader wants 247 from dispatch throttler 247/104857600 2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got front 247 2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).aborted = 0 2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got 247 + 0 + 0 byte message 2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq # = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0 2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15 rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6 -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd cpu usage is bigger than 100%
Presumably it's going faster when you have a deeper iodepth? So the reason it's using more CPU is because it's doing more work. That's all there is to it. (And the OSD uses a lot more CPU than some storage systems do, because it does a lot more work than them.) -Greg On Thursday, September 11, 2014, yuelongguang fasts...@163.com wrote: hi,all i am testing rbd performance, now there is only one vm which is using rbd as its disk, and inside it fio is doing r/w. the big diffenence is that i set a big iodepth other than iodepth=1. according to my test, the bigger iodepth, the bigger cpu usage. analyse the output of top command. 1. 12% wa, if it means disk speed is not fast enough? 2. from where we can know whether ceph's number of threads is enough or not? how do you think about it, which part is using up cpu? i want to find the root cause, why big iodepth leads to high cpu usage. ---default options osd_op_threads: 2, osd_disk_threads: 1, osd_recovery_threads: 1, filestore_op_threads: 2, thanks --top---iodepth=16- top - 15:27:34 up 2 days, 6:03, 2 users, load average: 0.49, 0.56, 0.62 Tasks: 97 total, 1 running, 96 sleeping, 0 stopped, 0 zombie Cpu(s): 19.0%us, 8.1%sy, 0.0%ni, 59.3%id, 12.1%wa, 0.0%hi, 0.8%si, 0.7%st Mem: 1922540k total, 1853180k used,69360k free, 7012k buffers Swap: 1048568k total,76796k used, 971772k free, 1034272k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2763 root 20 0 1112m 386m 5028 S 60.8 20.6 200:43.47 ceph-osd -top top - 19:50:08 up 1 day, 10:26, 2 users, load average: 1.55, 0.97, 0.81 Tasks: 97 total, 1 running, 96 sleeping, 0 stopped, 0 zombie Cpu(s): 37.6%us, 14.2%sy, 0.0%ni, 37.0%id, 9.4%wa, 0.0%hi, 1.3%si, 0.5%st Mem: 1922540k total, 1820196k used, 102344k free,23100k buffers Swap: 1048568k total,91724k used, 956844k free, 1052292k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 4312 root 20 0 1100m 337m 5192 S 107.3 18.0 88:33.27 ceph-osd 1704 root 20 0 514m 272m 3648 S 0.7 14.5 3:27.19 ceph-mon --iostat-- Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 5.50 137.50 247.00 782.00 2896.00 8773.00 11.34 7.083.55 0.63 65.05 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 9.50 119.00 327.50 458.50 3940.00 4733.50 11.0312.03 19.66 0.70 55.40 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 15.5010.50 324.00 559.50 3784.00 3398.00 8.13 1.982.22 0.81 71.25 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 4.50 253.50 273.50 803.00 3056.00 12155.00 14.13 4.704.32 0.55 59.55 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 10.00 6.00 294.00 488.00 3200.00 2933.50 7.84 1.101.49 0.70 54.85 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 10.0014.00 333.00 645.00 3780.00 3846.00 7.80 2.132.15 0.90 87.55 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 11.00 240.50 259.00 579.00 3144.00 10035.50 15.73 8.51 10.18 0.84 70.20 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 10.5017.00 318.50 707.00 3876.00 4084.50 7.76 1.321.30 0.61 62.65 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 4.50 208.00 233.50 918.00 2648.00 19214.50 18.99 5.434.71 0.55 63.20 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vdd 7.00 1.50 306.00 212.00 3376.00 2176.50 10.72 1.031.83 0.96 49.70 -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why one osd-op from client can get two osd-op-reply?
It's the recovery and backfill code. There's not one place; it's what most of the OSD code is for. On Thursday, September 11, 2014, yuelongguang fasts...@163.com wrote: as for the second question, could you tell me where the code is. how ceph makes size/min_szie copies? thanks At 2014-09-11 12:19:18, Gregory Farnum g...@inktank.com javascript:_e(%7B%7D,'cvml','g...@inktank.com'); wrote: On Wed, Sep 10, 2014 at 8:29 PM, yuelongguang fasts...@163.com javascript:_e(%7B%7D,'cvml','fasts...@163.com'); wrote: as for ack and ondisk, ceph has size and min_size to decide there are how many replications. if client receive ack or ondisk, which means there are at least min_size osds have done the ops? i am reading the cource code, could you help me with the two questions. 1. on osd, where is the code that reply ops separately according to ack or ondisk. i check the code, but i thought they always are replied together. It depends on what journaling mode you're in, but generally they're triggered separately (unless it goes on disk first, in which case it will skip the ack — this is the mode it uses for non-btrfs filesystems). The places where it actually replies are pretty clear about doing one or the other, though... 2. now i just know how client write ops to primary osd, inside osd cluster, how it promises min_size copy are reached. i mean when primary osd receives ops , how it spreads ops to others, and how it processes other's reply. That's not how it works. The primary for a PG will not go active with it until it has at least min_size copies that it knows about. Once the OSD is doing any processing of the PG, it requires all participating members to respond before it sends any messages back to the client. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com greg, thanks very much 在 2014-09-11 01:36:39,Gregory Farnum g...@inktank.com javascript:_e(%7B%7D,'cvml','g...@inktank.com'); 写道: The important bit there is actually near the end of the message output line, where the first says ack and the second says ondisk. I assume you're using btrfs; the ack is returned after the write is applied in-memory and readable by clients. The ondisk (commit) message is returned after it's durable to the journal or the backing filesystem. -Greg On Wednesday, September 10, 2014, yuelongguang fasts...@163.com javascript:_e(%7B%7D,'cvml','fasts...@163.com'); wrote: hi,all i recently debug ceph rbd, the log tells that one write to osd can get two if its reply. the difference between them is seq. why? thanks ---log- reader got message 6 0x7f58900010a0 osd_op_reply(15 rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6 2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue 0x7f58900010a0 prio 127 2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader reading tag... 2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got MSG 2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0 2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader wants 247 from dispatch throttler 247/104857600 2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got front 247 2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).aborted = 0 2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got 247 + 0 + 0 byte message 2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq # = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0 2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15 rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6 -- Software Engineer #42 @ http://inktank.com | http://ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com
Re: [ceph-users] Cephfs upon Tiering
On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, I am testing the tiering functionality with cephfs. I used a replicated cache with an EC data pool, and a replicated metadata pool like this: ceph osd pool create cache 1024 1024 ceph osd pool set cache size 2 ceph osd pool set cache min_size 1 ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd ceph osd pool create ecdata 128 128 erasure profile11 ceph osd tier add ecdata cache ceph osd tier cache-mode cache writeback ceph osd tier set-overlay ecdata cache ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) ceph osd pool create metadata 128 128 ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap ceph fs new ceph_fs metadata cache -- wrong ? I started testing with this, and this worked, I could write to it with cephfs and the cache was flushing to the ecdata pool as expected. But now I notice I made the fs right upon the cache, instead of the underlying data pool. I suppose I should have done this: ceph fs new ceph_fs metadata ecdata So my question is: Was this wrong and not doing the things I thought it did, or was this somehow handled by ceph and didn't it matter I specified the cache instead of the data pool? Well, it's sort of doing what you want it to. You've told the filesystem to use the cache pool as the location for all of its data. But RADOS is pushing everything in the cache pool down to the ecdata pool. So it'll work for now as you want. But if in future you wanted to stop using the caching pool, or switch it out for a different pool entirely, that wouldn't work (whereas it would if the fs was using ecdata). We should perhaps look at prevent use of cache pools like this...hrm... http://tracker.ceph.com/issues/9435 -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs upon Tiering
On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil sw...@redhat.com wrote: On Thu, 11 Sep 2014, Gregory Farnum wrote: On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, I am testing the tiering functionality with cephfs. I used a replicated cache with an EC data pool, and a replicated metadata pool like this: ceph osd pool create cache 1024 1024 ceph osd pool set cache size 2 ceph osd pool set cache min_size 1 ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd ceph osd pool create ecdata 128 128 erasure profile11 ceph osd tier add ecdata cache ceph osd tier cache-mode cache writeback ceph osd tier set-overlay ecdata cache ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) ceph osd pool create metadata 128 128 ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap ceph fs new ceph_fs metadata cache -- wrong ? I started testing with this, and this worked, I could write to it with cephfs and the cache was flushing to the ecdata pool as expected. But now I notice I made the fs right upon the cache, instead of the underlying data pool. I suppose I should have done this: ceph fs new ceph_fs metadata ecdata So my question is: Was this wrong and not doing the things I thought it did, or was this somehow handled by ceph and didn't it matter I specified the cache instead of the data pool? Well, it's sort of doing what you want it to. You've told the filesystem to use the cache pool as the location for all of its data. But RADOS is pushing everything in the cache pool down to the ecdata pool. So it'll work for now as you want. But if in future you wanted to stop using the caching pool, or switch it out for a different pool entirely, that wouldn't work (whereas it would if the fs was using ecdata). We should perhaps look at prevent use of cache pools like this...hrm... http://tracker.ceph.com/issues/9435 Should we? I was planning on doing exactly this for my home cluster. Not cache pools under CephFS, but specifying the cache pool as the data pool (rather than some underlying pool). Or is there some reason we might want the cache pool to be the one the filesystem is using for indexing? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgraded now MDS won't start
On Wed, Sep 10, 2014 at 4:24 PM, McNamara, Bradley bradley.mcnam...@seattle.gov wrote: Hello, This is my first real issue since running Ceph for several months. Here's the situation: I've been running an Emperor cluster for several months. All was good. I decided to upgrade since I'm running Ubuntu 13.10 and 0.72.2. I decided to first upgrade Ceph to 0.80.4, which was the last version in the apt repository for 13.10. I upgrade the MON's, then the OSD servers to 0.80.4; all went as expected with no issues. The last thing I did was upgrade the MDS using the same process, but now the MDS won't start. I've tried to manually start the MDS with debugging on, and I have attached the file. It complains that it's looking for mds.0.20 need osdmap epoch 3602, have 3601. Anyway, I'd don't really use CephFS or RGW, so I don't need the MDS, but I'd like to have it. Can someone tell me how to fix it, or delete it, so I can start over when I do need it? Right now my cluster is HEALTH_WARN because of it. Uh, the log is from an MDS running Emperor. That one looks like it's complaining because the mds data formats got updated for Firefly. ;) You'll need to run debugging from a Firefly mds to try and get something useful. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cephfs upon Tiering
On Fri, Sep 12, 2014 at 1:53 AM, Kenneth Waegeman kenneth.waege...@ugent.be javascript:; wrote: - Message from Sage Weil sw...@redhat.com javascript:; - Date: Thu, 11 Sep 2014 14:10:46 -0700 (PDT) From: Sage Weil sw...@redhat.com javascript:; Subject: Re: [ceph-users] Cephfs upon Tiering To: Gregory Farnum g...@inktank.com javascript:; Cc: Kenneth Waegeman kenneth.waege...@ugent.be javascript:;, ceph-users ceph-users@lists.ceph.com javascript:; On Thu, 11 Sep 2014, Gregory Farnum wrote: On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil sw...@redhat.com javascript:; wrote: On Thu, 11 Sep 2014, Gregory Farnum wrote: On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman kenneth.waege...@ugent.be javascript:; wrote: Hi all, I am testing the tiering functionality with cephfs. I used a replicated cache with an EC data pool, and a replicated metadata pool like this: ceph osd pool create cache 1024 1024 ceph osd pool set cache size 2 ceph osd pool set cache min_size 1 ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd ceph osd pool create ecdata 128 128 erasure profile11 ceph osd tier add ecdata cache ceph osd tier cache-mode cache writeback ceph osd tier set-overlay ecdata cache ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) ceph osd pool create metadata 128 128 ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap ceph fs new ceph_fs metadata cache -- wrong ? I started testing with this, and this worked, I could write to it with cephfs and the cache was flushing to the ecdata pool as expected. But now I notice I made the fs right upon the cache, instead of the underlying data pool. I suppose I should have done this: ceph fs new ceph_fs metadata ecdata So my question is: Was this wrong and not doing the things I thought it did, or was this somehow handled by ceph and didn't it matter I specified the cache instead of the data pool? Well, it's sort of doing what you want it to. You've told the filesystem to use the cache pool as the location for all of its data. But RADOS is pushing everything in the cache pool down to the ecdata pool. So it'll work for now as you want. But if in future you wanted to stop using the caching pool, or switch it out for a different pool entirely, that wouldn't work (whereas it would if the fs was using ecdata). After this I tried with the 'ecdata' pool, which is not working because itself is an EC pool. So I guess specifying the cache pool is then indeed the only way, but that's ok then if that works. It is just a bit confusing to specify the cache pool rather than the data:) *blinks* Uh, yeah. I forgot about that check, which was added because somebody tried to use CephFS on an EC pool without a cache on top. We've obviously got some UI work to do. Thanks for the reminder! -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Showing package loss in ceph main log
Ceph messages are transmitted using tcp, so the system isn't directly aware of packet loss at any level. I suppose we could try and export messenger reconnect counts via the admin socket, but that'd be a very noisy measure -- it seems simplest to just query the OS or hardware directly? -Greg On Friday, September 12, 2014, Josef Johansson jo...@oderland.se wrote: Hi, I've stumpled upon this a couple of times, where Ceph just stops responding, but still works. The cause has been package loss on the network layer, but Ceph doesn't say anything. Is there a debug flag for showing retransmission of package, or someway to see that packages are lost? Regards, Josef ___ ceph-users mailing list ceph-users@lists.ceph.com javascript:; http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] a question regarding sparse file
On Fri, Sep 12, 2014 at 9:26 AM, brandon li brandon.li@gmail.com wrote: Hi, I am new to ceph file system, and have got a newbie question: For a sparse file, how could ceph file system know the hole in the file was never created or some stripe was just simply lost? CephFS does not keep any metadata to try and track that; it assumes that non-existent objects are supposed to be holes. It relies on RADOS not losing data. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS : rm file does not remove object in rados
On Fri, Sep 12, 2014 at 6:49 AM, Florent Bautista flor...@coppint.com wrote: Hi all, Today I have a problem using CephFS. I use firefly last release, with kernel 3.16 client (Debian experimental). I have a directory in CephFS, associated to a pool pool2 (with set_layout). All is working fine, I can add and remove files, objects are stored in the right pool. But when Ceph cluster is overloaded (or for another reason, I don't know), sometimes when I remove a file, objects are not deleted in rados ! CephFS file removal is asynchronous with you removing it from the filesystem. The files get moved into a stray directory and will get deleted once nobody holds references to them any more. I explain : I want to remove a large directory, containing millions of files. For a moment, objects are really deleted in rados (I see it in rados df), but when I start to do some heavy operations (like moving volumes in rdb), objects are not deleted anymore, rados df returns a fixed number of objects. I can see that files are still deleting because I use rsync (rsync -avP --stats --delete /empty/dir/ /dir/to/delete/). What do you mean you're rsyncing and can see files deleting? I don't understand. Anyway, It's *possible* that the client is holding capabilities on the deleted files and isn't handing them back, in which case unmounting it would drop them (and then you could remount). I don't think we have any commands designed to hasten that, though. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd crash: trim_objectcould not find coid
On Fri, Sep 12, 2014 at 4:41 AM, Francois Deppierraz franc...@ctrlaltdel.ch wrote: Hi, Following-up this issue, I've identified that almost all unfound objects belongs to a single RBD volume (with the help of the script below). Now what's the best way to try to recover the filesystem stored on this RBD volume? 'mark_unfound_lost revert' or 'mark_unfound_lost lost' and then running fsck? By the way, I'm also still interested to know whether the procedure I've tried with ceph_objectstore_tool was correct? Yeah, that was the correct procedure. I believe you should just need to mark osd.6 as lost and remove it from the cluster and it will give up on getting the pg back. (You may also need to force_create_pgs or something; I don't recall. The docs should discuss that, though.) Once you've given up on the objects, recovering data from rbd images which included them is just like recovering from a lost hard drive sector or whatever. Hopefully fsck in the VM leaves you with a working filesystem, and however many files are still present... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com Thanks! François [1] ceph-list-unfound.sh #!/bin/sh for pg in $(ceph health detail | awk '/unfound$/ { print $2; }'); do ceph pg $pg list_missing | jq .objects done | jq -s add | jq '.[] | .oid.oid' On 11. 09. 14 11:05, Francois Deppierraz wrote: Hi Greg, An attempt to recover pg 3.3ef by copying it from broken osd.6 to working osd.32 resulted in one more broken osd :( Here's what was actually done: root@storage1:~# ceph pg 3.3ef list_missing | head { offset: { oid: , key: , snapid: 0, hash: 0, max: 0, pool: -1, namespace: }, num_missing: 219, num_unfound: 219, objects: [ [...] root@storage1:~# ceph pg 3.3ef query [...] might_have_unfound: [ { osd: 6, status: osd is down}, { osd: 19, status: already probed}, { osd: 32, status: already probed}, { osd: 42, status: already probed}], [...] # Exporting pg 3.3ef from broken osd.6 root@storage2:~# ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-6/ --journal-path /var/lib/ceph/osd/ssd0/6.journal --pgid 3.3ef --op export --file ~/backup/osd-6.pg-3.3ef.export # Remove an empty pg 3.3ef which was already present on this OSD root@storage2:~# service ceph stop osd.32 root@storage2:~# ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-32/ --journal-path /var/lib/ceph/osd/ssd0/32.journal --pgid 3.3ef --op remove # Import pg 3.3ef from dump root@storage2:~# ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-32/ --journal-path /var/lib/ceph/osd/ssd0/32.journal --op import --file ~/backup/osd-6.pg-3.3ef.export root@storage2:~# service ceph start osd.32 -1 2014-09-10 18:53:37.196262 7f13fdd7d780 5 osd.32 pg_epoch: 48366 pg[3.3ef(unlocked)] enter Initial 0 2014-09-10 18:53:37.239479 7f13fdd7d780 -1 *** Caught signal (Aborted) ** in thread 7f13fdd7d780 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: /usr/bin/ceph-osd() [0x8843da] 2: (()+0xfcb0) [0x7f13fcfabcb0] 3: (gsignal()+0x35) [0x7f13fb98a0d5] 4: (abort()+0x17b) [0x7f13fb98d83b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f13fc2dc69d] 6: (()+0xb5846) [0x7f13fc2da846] 7: (()+0xb5873) [0x7f13fc2da873] 8: (()+0xb596e) [0x7f13fc2da96e] 9: /usr/bin/ceph-osd() [0x94b34f] 10: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator)+0x12c) [0x691b6c] 11: (PGLog::read_log(ObjectStore*, coll_t, hobject_t, pg_info_t const, std::mapeversion_t, hobject_t, std::lesseversion_t, std::allocatorstd::paireversion_t const, hobject_t , PGLog::IndexedLog, pg_missing_t, std::basic_ostringstreamchar, std::char_traitschar, std::allocatorchar , std::setstd::string, std::lessstd:: string, std::allocatorstd::string *)+0x16d4) [0x7d3ef4] 12: (PG::read_state(ObjectStore*, ceph::buffer::list)+0x2c1) [0x7951b1] 13: (OSD::load_pgs()+0x18f3) [0x61e143] 14: (OSD::init()+0x1b9a) [0x62726a] 15: (main()+0x1e8d) [0x5d2d0d] 16: (__libc_start_main()+0xed) [0x7f13fb97576d] 17: /usr/bin/ceph-osd() [0x5d69d9] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Fortunately it was possible to bring back osd.32 into a working state simply be removing this pg. root@storage2:~# ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-32/ --journal-path /var/lib/ceph/osd/ssd0/32.journal --pgid 3.3ef --op remove Did I miss something from this procedure or does it mean that this pg is definitely lost? Thanks! François On 09. 09. 14 00:23, Gregory Farnum wrote: On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz franc...@ctrlaltdel.ch wrote: Hi Greg, Thanks for your support! On 08. 09. 14 20:20, Gregory Farnum wrote
Re: [ceph-users] Removing MDS
You can turn off the MDS and create a new FS in new pools. The ability to shut down a filesystem more completely is coming in Giant. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Sep 12, 2014 at 1:16 PM, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: We were building a test cluster here, and I enabled MDS in order to use ceph-fuse to fill the cluster with data. It seems the metadata server is having problems, so I figured I’d just remove it and rebuild it. However, the “ceph-deploy mds destroy” command is not implemented; it appears that once you have created an MDS, you can’t get rid of it without demolishing your entire cluster and building from scratch. And since the cluster is already out of whack, there seems to be no way to even drop OSDs to restart it cleanly. Should I just reboot all the OSD nodes and the monitor node, and hope the cluster comes up in a usable fashion? Because there seems no other option short of the burn-down and rebuild. -- CONFIDENTIALITY NOTICE: If you have received this email in error, please immediately notify the sender by e-mail at the address shown. This email transmission may contain confidential information. This information is intended only for the use of the individual(s) or entity to whom it is intended even if addressed incorrectly. Please delete it from your files if you are not the intended recipient. Thank you for your compliance. Copyright (c) 2014 Cigna == ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why no likely() and unlikely() used in Ceph's source code?
I don't know where the file came from, but likely/unlikely markers are the kind of micro-optimization that isn't worth the cost in Ceph dev resources right now. -Greg On Monday, September 15, 2014, Tim Zhang cofol1...@gmail.com wrote: Hey guys, After reading ceph source code, I find that there is a file named common/likely.h and it implements the function likely() and unlikey() which will optimize the prediction of code branch for cpu. But there isn't any place using these two functions, I am curious about why the developer of ceph not using these two functions to achieve more performance. Can anyone provide some hints? BR -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Dumpling cluster can't resolve peering failures, ceph pg query blocks, auth failures in logs
Not sure, but have you checked the clocks on their nodes? Extreme clock drift often results in strange cephx errors. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Sep 14, 2014 at 11:03 PM, Florian Haas flor...@hastexo.com wrote: Hi everyone, [Keeping this on the -users list for now. Let me know if I should cross-post to -devel.] I've been asked to help out on a Dumpling cluster (a system bequeathed by one admin to the next, currently on 0.67.10, was originally installed with 0.67.5 and subsequently updated a few times), and I'm seeing a rather odd issue there. The cluster is relatively small, 3 MONs, 4 OSD nodes; each OSD node hosts a rather non-ideal 12 OSDs but its performance issues aren't really the point here. ceph health detail shows a bunch of PGs peering, but the usual troubleshooting steps don't really seem to work. For some PGs, ceph pg pgid query just blocks, doesn't return anything. Adding --debug_ms=10 shows that it's simply not getting a response back from one of the OSDs it's trying to talk to, as if packets dropped on the floor or were filtered out. However, opening a simple TCP connection to the OSD's IP and port works perfectly fine (netcat returns a Ceph signature). (Note, though, that because of a daemon flapping issue they at some point set both noout and nodown, so the cluster may not be behaving as normally expected when OSDs fail to respond in time.) Then there are some PGs where ceph pg pgid query is a little more verbose, though not exactly more successful: From ceph health detail: pg 6.c10 is stuck inactive for 1477.781394, current state peering, last acting [85,16] ceph pg 6.b1 query: 2014-09-15 01:06:48.200418 7f29a6efc700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-09-15 01:06:48.200428 7f29a6efc700 0 -- 10.47.17.1:0/1020420 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1 c=0x2c00d90).failed verifying authorize reply 2014-09-15 01:06:48.200465 7f29a6efc700 0 -- 10.47.17.1:0/1020420 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1 c=0x2c00d90).fault 2014-09-15 01:06:48.201000 7f29a6efc700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2014-09-15 01:06:48.201008 7f29a6efc700 0 -- 10.47.17.1:0/1020420 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43264 s=1 pgs=0 cs=0 l=1 c=0x2c00d90).failed verifying authorize reply Oops. Now the admins swear they didn't touch the keys, but they are also (understandably) reluctant to just kill and redeploy all those OSDs, as these issues are basically scattered over a bunch of PGs touching many OSDs. How would they pinpoint this to be sure that they're not being bitten by a bug or misconfiguration? Not sure if people have seen this before — if so, I'd be grateful for some input. Loïc, Sébastien perhaps? Or João, Greg, Sage? Thanks in advance for any insight people might be able to share. :) Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD troubles on FS+Tiering
The pidfile bug is already fixed in master/giant branches. As for the crashing, I'd try killing all the osd processes and turning them back on again. It might just be some daemon restart failed, or your cluster could be sufficiently overloaded that the node disks are going unresponsive and they're suiciding, or... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Sep 15, 2014 at 5:43 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi, I have some strange OSD problems. Before the weekend I started some rsync tests over CephFS, on a cache pool with underlying EC KV pool. Today the cluster is completely degraded: [root@ceph003 ~]# ceph status cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded; 168 pgs down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgs stuck degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgs stuck unclean; 403 pgs stuck undersized; 403 pgs undersized; 300 requests are blocked 32 sec; recovery 15170/27902361 objects degraded (0.054%); 1922/27902361 objects misplaced (0.007%); 1 near full osd(s) monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby osdmap e719: 48 osds: 18 up, 18 in pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects 2282 GB used, 31397 GB / 33680 GB avail 15170/27902361 objects degraded (0.054%); 1922/27902361 objects misplaced (0.007%) 68 down+remapped+peering 1 active 754 active+clean 1 stale+incomplete 1 stale+active+clean+scrubbing 14 active+undersized+degraded+remapped 7 incomplete 100 down+peering 9 active+remapped 59 stale+active+undersized+degraded 19 active+undersized+degraded+remapped+backfill_toofull 311 active+undersized+degraded I tried to figure out what happened in the global logs: 2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF] pgmap v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 4159 kB/s wr, 45 op/s 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap v65893: 1344 pgs: 1344 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 561 kB/s wr, 11 op/s 2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF] osd.19 10.141.8.181:6809/29664 failed (3 reports from 3 peers after 20.79 = grace 20.00) 2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF] osdmap e117: 48 osds: 47 up, 48 in 2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF] pgmap v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 1353 kB/s wr, 13 op/s 2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF] pgmap v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s 2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF] osdmap e118: 48 osds: 47 up, 48 in 2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF] pgmap v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s 2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF] pgmap v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s 2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF] pgmap v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s 2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF] pgmap v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s 2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.000137 secs 2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN] slow request 30.000137 seconds old, received at 2014-09-13 08:00:49.335339: osd_op(client.7448.1:17751783 1203eac.000e [write 0~319488 [1@-1],startsync 0~0] 1.b 6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg 2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 5 : [WRN] 7 slow requests, 6 included below; oldest blocked for 31.001947 secs 2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN] slow
Re: [ceph-users] Cephfs upon Tiering
On Mon, Sep 15, 2014 at 6:32 AM, Berant Lemmenes ber...@lemmenes.com wrote: Greg, So is the consensus that the appropriate way to implement this scenario is to have the fs created on the EC backing pool vs. the cache pool but that the UI check needs to be tweaked to distinguish between this scenario and just trying to use a EC pool alone? Yeah, we'll fix this for Giant. In practical terms it doesn't make much difference right now; just want to be consistent for the future. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com I'm also interested in the scenario of having a EC backed pool fronted by a replicated cache for use with cephfs. Thanks, Berant On Fri, Sep 12, 2014 at 12:37 PM, Gregory Farnum g...@inktank.com wrote: On Fri, Sep 12, 2014 at 1:53 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: - Message from Sage Weil sw...@redhat.com - Date: Thu, 11 Sep 2014 14:10:46 -0700 (PDT) From: Sage Weil sw...@redhat.com Subject: Re: [ceph-users] Cephfs upon Tiering To: Gregory Farnum g...@inktank.com Cc: Kenneth Waegeman kenneth.waege...@ugent.be, ceph-users ceph-users@lists.ceph.com On Thu, 11 Sep 2014, Gregory Farnum wrote: On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil sw...@redhat.com wrote: On Thu, 11 Sep 2014, Gregory Farnum wrote: On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, I am testing the tiering functionality with cephfs. I used a replicated cache with an EC data pool, and a replicated metadata pool like this: ceph osd pool create cache 1024 1024 ceph osd pool set cache size 2 ceph osd pool set cache min_size 1 ceph osd erasure-code-profile set profile11 k=8 m=3 ruleset-failure-domain=osd ceph osd pool create ecdata 128 128 erasure profile11 ceph osd tier add ecdata cache ceph osd tier cache-mode cache writeback ceph osd tier set-overlay ecdata cache ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024)) ceph osd pool create metadata 128 128 ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap ceph fs new ceph_fs metadata cache -- wrong ? I started testing with this, and this worked, I could write to it with cephfs and the cache was flushing to the ecdata pool as expected. But now I notice I made the fs right upon the cache, instead of the underlying data pool. I suppose I should have done this: ceph fs new ceph_fs metadata ecdata So my question is: Was this wrong and not doing the things I thought it did, or was this somehow handled by ceph and didn't it matter I specified the cache instead of the data pool? Well, it's sort of doing what you want it to. You've told the filesystem to use the cache pool as the location for all of its data. But RADOS is pushing everything in the cache pool down to the ecdata pool. So it'll work for now as you want. But if in future you wanted to stop using the caching pool, or switch it out for a different pool entirely, that wouldn't work (whereas it would if the fs was using ecdata). After this I tried with the 'ecdata' pool, which is not working because itself is an EC pool. So I guess specifying the cache pool is then indeed the only way, but that's ok then if that works. It is just a bit confusing to specify the cache pool rather than the data:) *blinks* Uh, yeah. I forgot about that check, which was added because somebody tried to use CephFS on an EC pool without a cache on top. We've obviously got some UI work to do. Thanks for the reminder! -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] does CephFS still have no fsck utility?
CephFS in general has a lot fewer metadata structures than traditional filesystems generally do; about the only thing that could go wrong without users noticing directly is: 1) The data gets corrupted 2) Files somehow get removed from folders. Data corruption is something RADOS is responsible for detecting through its scrub processes and things. If CephFS actually dropped a file, yeah, that'd be a problem which we don't have other mechanisms of detecting at this time. But the more traditional sort of fsck activities like looking for doubly-linked blocks or multiply-allocated inodes are more or less impossible given our decentralized architecture and lack of stored metadata (for instance, data blocks are just objects whose name is calculated based on the inode number and the offset within the file). If it makes you feel better, fsck is something I've been working on a lot recently, based on the design discussions we had early last year. The first pass is just a scrubbing mechanism to make sure that the hierarchy is self-consistent and the referenced RADOS objects actually exist; later we'll move on to checking that each RADOS object is associated with a particular file. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Sep 15, 2014 at 4:15 PM, brandon li brandon.li@gmail.com wrote: Thanks for the reply, Greg. With traditional file system experience, I have to admit it will take me some time to get used to the way CephFS works. Considering it as part of my learning curve. :-) One of concerns I have it that, without tools like fsck, how could we know the file system is still consistent? Even RADOS doesn't report error, could there be any miscommunication(e.g., due to bug, networking issue, disk bit flip, ...) between metadata operation and stripe I/O? For example, the first stripe of a file is created but its inode id(on RADOS) is wrong for some reason, and thus RADOS doesn't think it belongs to the correct file. This may never happen and I just use it here to explain my concern. Thanks, Brandon On Mon, Sep 15, 2014 at 3:49 PM, Gregory Farnum g...@inktank.com wrote: On Mon, Sep 15, 2014 at 3:23 PM, brandon li brandon.li@gmail.com wrote: If it's true, is there any other tools I can use to check and repair the file system? Not much, no. That said, you shouldn't really need an fsck unless the underlying RADOS store went through some catastrophic event. Is there anything in particular you're worried about? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD troubles on FS+Tiering
Heh, you'll have to talk to Haomai about issues with the KeyValueStore, but I know he's found a number of issues in the version of it that went to 0.85. In future please flag when you're running with experimental stuff; it helps direct attention to the right places! ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 16, 2014 at 5:28 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: - Message from Gregory Farnum g...@inktank.com - Date: Mon, 15 Sep 2014 10:37:07 -0700 From: Gregory Farnum g...@inktank.com Subject: Re: [ceph-users] OSD troubles on FS+Tiering To: Kenneth Waegeman kenneth.waege...@ugent.be Cc: ceph-users ceph-users@lists.ceph.com The pidfile bug is already fixed in master/giant branches. As for the crashing, I'd try killing all the osd processes and turning them back on again. It might just be some daemon restart failed, or your cluster could be sufficiently overloaded that the node disks are going unresponsive and they're suiciding, or... I restarted them that way, and they eventually got clean again. 'ceph status' printed that 'ecdata' pool had too few pgs, so I changed the amount of pgs from 128 to 256 (with EC k+m=11) After a few minutes I checked the cluster state again: [root@ceph001 ~]# ceph status cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d health HEALTH_WARN 100 pgs down; 155 pgs peering; 81 pgs stale; 240 pgs stuck inactive; 81 pgs stuck stale; 240 pgs stuck unclean; 746 requests are blocked 32 sec; 'cache' at/near target max; pool ecdata pg_num 256 pgp_num 128 monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003 mdsmap e6993: 1/1/1 up {0=ceph003=up:active}, 2 up:standby osdmap e11023: 48 osds: 14 up, 14 in pgmap v160466: 1472 pgs, 4 pools, 3899 GB data, 2374 kobjects 624 GB used, 7615 GB / 8240 GB avail 75 creating 1215 active+clean 100 down+peering 1 active+clean+scrubbing 10 stale 16 stale+active+clean Again 34 OSDS are down.. This time I have the error log, I checked a few osd logs : I checked the first host that was marked down: -17 2014-09-16 13:27:49.962938 7f5dfe6a3700 5 osd.7 pg_epoch: 8912 pg[2.b0s3(unlocked)] enter Initial -16 2014-09-16 13:27:50.008842 7f5e02eac700 1 -- 10.143.8.180:6833/53810 == osd.30 10.141.8.181:0/37396 2524 osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 47+0+0 (386299 0 0) 0x18ef7080 con 0x6961600 -15 2014-09-16 13:27:50.008892 7f5e02eac700 1 -- 10.143.8.180:6833/53810 -- 10.141.8.181:0/37396 -- osd_ping(ping_reply e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x7326900 con 0x6961600 -14 2014-09-16 13:27:50.009159 7f5e046af700 1 -- 10.141.8.180:6847/53810 == osd.30 10.141.8.181:0/37396 2524 osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 47+0+0 (386299 0 0) 0x2210a760 con 0xadd0420 -13 2014-09-16 13:27:50.009202 7f5e046af700 1 -- 10.141.8.180:6847/53810 -- 10.141.8.181:0/37396 -- osd_ping(ping_reply e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x14e35a00 con 0xadd0420 -12 2014-09-16 13:27:50.034378 7f5dfeea4700 5 osd.7 pg_epoch: 8912 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Reset 0.127612 1 0.000123 -11 2014-09-16 13:27:50.034432 7f5dfeea4700 5 osd.7 pg_epoch: 8912 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started -10 2014-09-16 13:27:50.034452 7f5dfeea4700 5 osd.7 pg_epoch: 8912 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Start -9 2014-09-16 13:27:50.034469 7f5dfeea4700 1 osd.7 pg_epoch: 8912 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] stateStart: transitioning to Stray -8 2014-09-16 13:27:50.034491 7f5dfeea4700 5 osd.7 pg_epoch: 8912 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Start 0.38 0 0.00 -7 2014-09-16 13:27:50.034521 7f5dfeea4700 5 osd.7 pg_epoch: 8912 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 les/c 813/815 805/8912
Re: [ceph-users] does CephFS still have no fsck utility?
http://tracker.ceph.com/issues/4137 contains links to all the tasks we have so far. You can also search any of the ceph-devel list archives for forward scrub. On Mon, Sep 15, 2014 at 10:16 PM, brandon li brandon.li@gmail.com wrote: Great to know you are working on it! I am new to the mailing list. Is there any reference of discussion last year, so I can look into. or any bug number I can watch to keep track of the development? Thanks, Brandon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] what are these files for mon?
I don't really know; Joao has handled all these cases. I *think* they've been tied to a few bad versions of LevelDB, but I'm not certain. (There were a number of discussions about it on the public mailing lists.) -Greg On Tuesday, September 16, 2014, Florian Haas flor...@hastexo.com wrote: Hi Greg, just picked up this one from the archive while researching a different issue and thought I'd follow up. On Tue, Aug 19, 2014 at 6:24 PM, Gregory Farnum g...@inktank.com javascript:; wrote: The sst files are files used by leveldb to store its data; you cannot remove them. Are you running on a very small VM? How much space are the files taking up in aggregate? Speaking generally, I think you should see something less than a GB worth of data there, but some versions of leveldb under some scenarios are known to misbehave and grow pretty large. Can you elaborate on the scenarios where leveldb is misbehaving? I've also seen reports of this before, with .sst files growing to several GB in size. Is this a cause for concern (for example, would you expect mons to slow down) and if so, how would you recover? Would you essentially nuke the mon and replace it with another? Cheers, Florian -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mount ceph block device over specific NIC
Assuming you're using the kernel? In any case, Ceph generally doesn't do anything to select between different NICs; it just asks for a connection to a given IP. So you should just be able to set up a route for that IP. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 16, 2014 at 4:13 AM, Arne K. Haaje a...@drlinux.no wrote: Hello, We have a machine that mounts a rbd image as a block device, then rsync files from another server to this mount. As this rsync traffic will have to share bandwith with the writing to the RBD, I wonder if it is possible to specify which NIC to mount the RBD through? We are using 0.85.5 on Ubuntu 14.04. Regards, Arne ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Still seing scrub errors in .80.5
On Tue, Sep 16, 2014 at 12:03 AM, Marc m...@shoowin.de wrote: Hello fellow cephalopods, every deep scrub seems to dig up inconsistencies (i.e. scrub errors) that we could use some help with diagnosing. I understand there used to be a data corruption issue before .80.3 so we made sure that all the nodes were upgraded to .80.5 and all the daemons were restarted (they all report .80.5 when contacted via socket). *After* that we ran a deep scrub, which obviously found errors, which we then repaired. But unfortunately, it's now a week later, and the next deep scrub has dug up new errors, which shouldn't have happened I think...? ceph.log shows these errors in between the deep scrub messages: 2014-09-15 07:56:23.164818 osd.15 10.10.10.55:6804/23853 364 : [ERR] 3.335 shard 2: soid 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest 3090820441 != known digest 3787996302 2014-09-15 07:56:23.164827 osd.15 10.10.10.55:6804/23853 365 : [ERR] 3.335 shard 6: soid 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest 3259686791 != known digest 3787996302 2014-09-15 07:56:28.485713 osd.15 10.10.10.55:6804/23853 366 : [ERR] 3.335 deep-scrub 0 missing, 1 inconsistent objects 2014-09-15 07:56:28.485734 osd.15 10.10.10.55:6804/23853 367 : [ERR] 3.335 deep-scrub 2 errors Uh, I'm afraid those errors were never output as a result of bugs in Firefly. These are indicating actual data differences between the nodes, whereas the Firefly issue was a metadata flag that wasn't handled properly in mixed-version OSD clusters. I don't think Ceph has ever had a bug that would change the data payload between OSDs. Searching the tracker logs, the only entries with this error message are: 1) The local filesystem is not misbehaving under the workload we give it (and there are no known filesystem issues that are exposed by running firefly OSDs in default config that I can think of — certainly none with this error) 2) The disks themselves are bad. :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Still seing scrub errors in .80.5
Ah, you're right — it wasn't popping up in the same searches and I'd forgotten that was so recent. In that case, did you actually deep scrub *everything* in the cluster, Marc? You'll need to run and fix every PG in the cluster, and the background deep scrubbing doesn't move through the data very quickly. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 16, 2014 at 11:32 AM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi Greg, I believe Marc is referring to the corruption triggered by set_extsize on xfs. That option was disabled by default in 0.80.4... See the thread firefly scrub error. Cheers, Dan From: Gregory Farnum g...@inktank.com Sent: Sep 16, 2014 8:15 PM To: Marc Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Still seing scrub errors in .80.5 On Tue, Sep 16, 2014 at 12:03 AM, Marc m...@shoowin.de wrote: Hello fellow cephalopods, every deep scrub seems to dig up inconsistencies (i.e. scrub errors) that we could use some help with diagnosing. I understand there used to be a data corruption issue before .80.3 so we made sure that all the nodes were upgraded to .80.5 and all the daemons were restarted (they all report .80.5 when contacted via socket). *After* that we ran a deep scrub, which obviously found errors, which we then repaired. But unfortunately, it's now a week later, and the next deep scrub has dug up new errors, which shouldn't have happened I think...? ceph.log shows these errors in between the deep scrub messages: 2014-09-15 07:56:23.164818 osd.15 10.10.10.55:6804/23853 364 : [ERR] 3.335 shard 2: soid 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest 3090820441 != known digest 3787996302 2014-09-15 07:56:23.164827 osd.15 10.10.10.55:6804/23853 365 : [ERR] 3.335 shard 6: soid 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest 3259686791 != known digest 3787996302 2014-09-15 07:56:28.485713 osd.15 10.10.10.55:6804/23853 366 : [ERR] 3.335 deep-scrub 0 missing, 1 inconsistent objects 2014-09-15 07:56:28.485734 osd.15 10.10.10.55:6804/23853 367 : [ERR] 3.335 deep-scrub 2 errors Uh, I'm afraid those errors were never output as a result of bugs in Firefly. These are indicating actual data differences between the nodes, whereas the Firefly issue was a metadata flag that wasn't handled properly in mixed-version OSD clusters. I don't think Ceph has ever had a bug that would change the data payload between OSDs. Searching the tracker logs, the only entries with this error message are: 1) The local filesystem is not misbehaving under the workload we give it (and there are no known filesystem issues that are exposed by running firefly OSDs in default config that I can think of — certainly none with this error) 2) The disks themselves are bad. :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Packages for 0.85?
Thanks for the poke; looks like something went wrong during the release build last week. We're investigating now. -Greg On Tue, Sep 16, 2014 at 11:08 AM, Daniel Swarbrick daniel.swarbr...@profitbricks.com wrote: Hi, I saw that the development snapshot 0.85 was released last week, and have been patiently waiting for packages to appear, so that I can upgrade a test cluster here. Can we still expect packages (wheezy, in my case) of 0.85 to be published? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replication factor of 50 on a 1000 OSD node cluster
On Tue, Sep 16, 2014 at 5:10 PM, JIten Shah jshah2...@me.com wrote: Hi Guys, We have a cluster with 1000 OSD nodes and 5 MON nodes and 1 MDS node. In order to be able to loose quite a few OSD’s and still survive the load, we were thinking of making the replication factor to 50. Is that too big of a number? what is the performance implications and any other issues that we should consider before setting it to that. Also, do we need the same number of metadata copies too or it can be less? Don't do that. Every write has to be synchronously copied to every replica, so 50x replication will give you very high latencies and very low write bandwidth to each object. If you're just worried about not losing data, there are a lot of people with big clusters running 3x replication and it's been fine. If you have some use case where you think you're going to be turning off a bunch of nodes simultaneously without planning, Ceph might not be the storage system for your needs. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replication factor of 50 on a 1000 OSD node cluster
Yeah, so generally those will be correlated with some failure domain, and if you spread your replicas across failure domains you won't hit any issues. And if hosts are down for any length of time the OSDs will re-replicate data to keep it at proper redundancy. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 16, 2014 at 5:53 PM, JIten Shah jshah2...@icloud.com wrote: Thanks Greg. We may not turn off the nodes randomly without planning but with a 1000 node cluster, there could be 5 to 10 hosts that might crash or go down in case of an event. —Jiten On Sep 16, 2014, at 5:35 PM, Gregory Farnum g...@inktank.com wrote: On Tue, Sep 16, 2014 at 5:10 PM, JIten Shah jshah2...@me.com wrote: Hi Guys, We have a cluster with 1000 OSD nodes and 5 MON nodes and 1 MDS node. In order to be able to loose quite a few OSD’s and still survive the load, we were thinking of making the replication factor to 50. Is that too big of a number? what is the performance implications and any other issues that we should consider before setting it to that. Also, do we need the same number of metadata copies too or it can be less? Don't do that. Every write has to be synchronously copied to every replica, so 50x replication will give you very high latencies and very low write bandwidth to each object. If you're just worried about not losing data, there are a lot of people with big clusters running 3x replication and it's been fine. If you have some use case where you think you're going to be turning off a bunch of nodes simultaneously without planning, Ceph might not be the storage system for your needs. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-community] Can't Start-up MDS
That looks like the beginning of an mds creation to me. What's your problem in more detail, and what's the output of ceph -s? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Sep 15, 2014 at 5:34 PM, Shun-Fa Yang shu...@gmail.com wrote: Hi all, I'm installed ceph v 0.80.5 on Ubuntu 14.04 server version by using apt-get... The log of mds shows as following: 2014-09-15 17:24:58.291305 7fd6f6d47800 0 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 10487 2014-09-15 17:24:58.302164 7fd6f6d47800 -1 mds.-1.0 *** no OSDs are up as of epoch 8, waiting 2014-09-15 17:25:08.302930 7fd6f6d47800 -1 mds.-1.-1 *** no OSDs are up as of epoch 8, waiting 2014-09-15 17:25:19.322092 7fd6f1938700 1 mds.-1.0 handle_mds_map standby 2014-09-15 17:25:19.325024 7fd6f1938700 1 mds.0.3 handle_mds_map i am now mds.0.3 2014-09-15 17:25:19.325026 7fd6f1938700 1 mds.0.3 handle_mds_map state change up:standby -- up:creating 2014-09-15 17:25:19.325196 7fd6f1938700 0 mds.0.cache creating system inode with ino:1 2014-09-15 17:25:19.325377 7fd6f1938700 0 mds.0.cache creating system inode with ino:100 2014-09-15 17:25:19.325381 7fd6f1938700 0 mds.0.cache creating system inode with ino:600 2014-09-15 17:25:19.325449 7fd6f1938700 0 mds.0.cache creating system inode with ino:601 2014-09-15 17:25:19.325489 7fd6f1938700 0 mds.0.cache creating system inode with ino:602 2014-09-15 17:25:19.325538 7fd6f1938700 0 mds.0.cache creating system inode with ino:603 2014-09-15 17:25:19.325564 7fd6f1938700 0 mds.0.cache creating system inode with ino:604 2014-09-15 17:25:19.325603 7fd6f1938700 0 mds.0.cache creating system inode with ino:605 2014-09-15 17:25:19.325627 7fd6f1938700 0 mds.0.cache creating system inode with ino:606 2014-09-15 17:25:19.325655 7fd6f1938700 0 mds.0.cache creating system inode with ino:607 2014-09-15 17:25:19.325682 7fd6f1938700 0 mds.0.cache creating system inode with ino:608 2014-09-15 17:25:19.325714 7fd6f1938700 0 mds.0.cache creating system inode with ino:609 2014-09-15 17:25:19.325738 7fd6f1938700 0 mds.0.cache creating system inode with ino:200 Could someone tell me how to solve it? Thanks. -- 楊順發(yang shun-fa) ___ Ceph-community mailing list ceph-commun...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mds unable to start with 0.85
On Wed, Sep 17, 2014 at 9:59 PM, 廖建锋 de...@f-club.cn wrote: dear, my ceph cluster worked for about two weeks, mds crashed every 2-3 days, Now it stuck on replay , looks like replay crash and restart mds process again what can i do for this? 1015 = # ceph -s cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18 health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is degraded; noscrub,nodeep-scrub flag(s) set monmap e1: 2 mons at {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0}, election epoch 26, quorum 0,1 storage-1-213,storage-1-214 mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby osdmap e1932: 18 osds: 18 up, 18 in flags noscrub,nodeep-scrub pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects 4479 GB used, 32292 GB / 36772 GB avail 444 active+clean 56 active+clean+inconsistent client io 125 MB/s rd, 31 op/s MDS log here: 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch 620 from mon.0 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch 621 from mon.0 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am now mds.0.272 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state change up:standby -- up:replay 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch 1929, have 1927 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap 1929 (which blacklists prior instance) 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for myself; not doing anything 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening inotable 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening sessionmap 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening mds log 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log bounds 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening snap table 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6806/6114 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6811/6385 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6801/6110 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal 200 to recover... 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.214:6806/6238 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered. 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200 in format 1 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1: loading/discovering base inodes 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system inode with ino:100 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system inode with ino:1 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.214:6811/6276 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2: replaying mds log 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth -1,-2 - -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x5da] 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 0x5da] 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth -1,-2 - -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x5da03b8] 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 0x5da03b8] 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache adjust_bounded_subtree_auth -2,-2 - 0,-2 on [dir 1 / [2,head] auth v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da] bound_dfs [] 2014-09-18 12:38:55.675116 7f823ca08700 7 mds.0.cache adjust_bounded_subtree_auth -2,-2 - 0,-2 on [dir 1 / [2,head] auth v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da] bounds 2014-09-18 12:38:55.675129 7f823ca08700 7 mds.0.cache
Re: [ceph-users] Still seing scrub errors in .80.5
On Thu, Sep 18, 2014 at 3:09 AM, Marc m...@shoowin.de wrote: Hi, we did run a deep scrub on everything yesterday, and a repair afterwards. Then a new deep scrub today, which brought new scrub errors. I did check the osd config, they report filestore_xfs_extsize: false, as it should be if I understood things correctly. FTR the deep scrub has been initiated like this: for pgnum in `ceph pg dump|grep active|awk '{print $1}'`; do ceph pg deep-scrub $pgnum; done How do we proceed from here? Did the deep scrubs all actually complete yesterday, so these are new errors and not just scrubs which weren't finished until now? If so, I'd start looking at the scrub errors and which OSDs are involved. Hopefully they'll have one or a few OSDs in common that you can examine more closely. But like I said before, my money's on faulty hardware or local filesystems. Depending on how you're set up it's probably a good idea to just start checking dmesg for any indications of trouble before you start tackling it from the RADOS side. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS : rm file does not remove object in rados
On Thu, Sep 18, 2014 at 10:39 AM, Florent B flor...@coppint.com wrote: On 09/12/2014 07:38 PM, Gregory Farnum wrote: On Fri, Sep 12, 2014 at 6:49 AM, Florent Bautista flor...@coppint.com wrote: Hi all, Today I have a problem using CephFS. I use firefly last release, with kernel 3.16 client (Debian experimental). I have a directory in CephFS, associated to a pool pool2 (with set_layout). All is working fine, I can add and remove files, objects are stored in the right pool. But when Ceph cluster is overloaded (or for another reason, I don't know), sometimes when I remove a file, objects are not deleted in rados ! CephFS file removal is asynchronous with you removing it from the filesystem. The files get moved into a stray directory and will get deleted once nobody holds references to them any more. My client is the only mounted and does not use files. does not use files...what? This problems occurs when I delete files with rm, but not when I use given rsync command. I explain : I want to remove a large directory, containing millions of files. For a moment, objects are really deleted in rados (I see it in rados df), but when I start to do some heavy operations (like moving volumes in rdb), objects are not deleted anymore, rados df returns a fixed number of objects. I can see that files are still deleting because I use rsync (rsync -avP --stats --delete /empty/dir/ /dir/to/delete/). What do you mean you're rsyncing and can see files deleting? I don't understand. When you run command I gave, syncing an empty dir with the dir you want deleted, rsync is telling you Deleting (file) for each file to unlink. Anyway, It's *possible* that the client is holding capabilities on the deleted files and isn't handing them back, in which case unmounting it would drop them (and then you could remount). I don't think we have any commands designed to hasten that, though. unmounting does not help. When I unlink() via rsync, objects are deleted in rados (it makes all cluster slow down, and have slow requests). When I use rm command, it is much faster but objects are not deleted in rados ! I think you're not doing what you think you're doing, then...those two actions should look the same to CephFS. When I re-mount root CephFS, there are no files, all empty. But still have 125 MB of objects in metadata pool and 21.57 GB in my data pool (and it does not decrease...)... Well, the metadata pool is never going to be emptied; that holds your MDS journals. The data pool might not get entirely empty either; how many objects does it say it has? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mds unable to start with 0.85
On Thu, Sep 18, 2014 at 5:35 PM, 廖建锋 de...@f-club.cn wrote: if i turn on debug=20, the log will be more than 100G, looks no way to put, do you have any other good way to figure it out? It should compress well and you can use ceph-post-file if you don't have a place to host it yourself. -Greg would you like to log into the server to check? From: Gregory Farnum Date: 2014-09-19 02:33 To: 廖建锋 CC: ceph-users Subject: Re: [ceph-users] ceph mds unable to start with 0.85 On Wed, Sep 17, 2014 at 9:59 PM, 廖建锋 de...@f-club.cn wrote: dear, my ceph cluster worked for about two weeks, mds crashed every 2-3 days, Now it stuck on replay , looks like replay crash and restart mds process again what can i do for this? 1015 = # ceph -s cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18 health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is degraded; noscrub,nodeep-scrub flag(s) set monmap e1: 2 mons at {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0}, election epoch 26, quorum 0,1 storage-1-213,storage-1-214 mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby osdmap e1932: 18 osds: 18 up, 18 in flags noscrub,nodeep-scrub pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects 4479 GB used, 32292 GB / 36772 GB avail 444 active+clean 56 active+clean+inconsistent client io 125 MB/s rd, 31 op/s MDS log here: 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch 620 from mon.0 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch 621 from mon.0 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am now mds.0.272 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state change up:standby -- up:replay 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch 1929, have 1927 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap 1929 (which blacklists prior instance) 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for myself; not doing anything 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening inotable 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening sessionmap 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening mds log 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log bounds 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening snap table 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6806/6114 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6811/6385 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6801/6110 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal 200 to recover... 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.214:6806/6238 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered. 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200 in format 1 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1: loading/discovering base inodes 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system inode with ino:100 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system inode with ino:1 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.214:6811/6276 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2: replaying mds log 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth -1,-2 - -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x5da] 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 0x5da] 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth -1,-2 - -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x5da03b8] 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 0x5da03b8] 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache adjust_bounded_subtree_auth -2,-2 - 0,-2 on [dir 1 / [2,head] auth v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0
Re: [ceph-users] Renaming pools used by CephFS
On Fri, Sep 19, 2014 at 10:21 AM, Jeffrey Ollie j...@ocjtech.us wrote: I've got a Ceph system (running 0.80.5) at home that I've been messing around with, partly to learn Ceph, but also as reliable storage for all of my media. During the process I deleted the data and metadata pools used by CephFS and recreated them. However, when I recreated the filesystem, the pool called data got assigned as a metadata pool and the pool called metadata got assigned as a data pool. Is there a safe way to rename the pools? It's purely an aesthetic thing (I think), so if it's difficult/dangerous to do I'll leave it be. You can rename pools with ceph osd pool rename current_name new_name. Generally it's not a good idea to mess around with the CephFS pools, though — in Giant you'll be prevented from deleting them. ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reassigning admin server
On Mon, Sep 22, 2014 at 1:22 PM, LaBarre, James (CTR) A6IT james.laba...@cigna.com wrote: If I have a machine/VM I am using as an Admin node for a ceph cluster, can I relocate that admin to another machine/VM after I’ve built a cluster? I would expect as the Admin isn’t an actual operating part of the cluster itself (other than Calamari, if it happens to be running) the rest of the cluster should be adequately served with a –update-conf. The admin node really just has the default ceph.conf and the keyrings for admin access to your cluster. You just need to copy that data to whatever other node(s) you want; there's no updating to do for the rest of the cluster. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck in active+clean+replay state
I imagine you aren't actually using the data/metadata pool that these PGs are in, but it's a previously-reported bug we haven't identified: http://tracker.ceph.com/issues/8758 They should go away if you restart the OSDs that host them (or just remove those pools), but it's not going to hurt anything as long as you aren't using them. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Sep 25, 2014 at 3:37 AM, Pavel V. Kaygorodov pa...@inasan.ru wrote: Hi! 16 pgs in our ceph cluster are in active+clean+replay state more then one day. All clients are working fine. Is this ok? root@bastet-mon1:/# ceph -w cluster fffeafa2-a664-48a7-979a-517e3ffa0da1 health HEALTH_OK monmap e3: 3 mons at {1=10.92.8.80:6789/0,2=10.92.8.81:6789/0,3=10.92.8.82:6789/0}, election epoch 2570, quorum 0,1,2 1,2,3 osdmap e3108: 16 osds: 16 up, 16 in pgmap v1419232: 8704 pgs, 6 pools, 513 GB data, 125 kobjects 2066 GB used, 10879 GB / 12945 GB avail 8688 active+clean 16 active+clean+replay client io 3237 kB/s wr, 68 op/s root@bastet-mon1:/# ceph pg dump | grep replay dumped all in format plain 0.fd0 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:29.902766 0'0 3108:2628 [0,7,14,8] [0,7,14,8] 0 0'0 2014-09-23 02:23:49.463704 0'0 2014-09-23 02:23:49.463704 0.e80 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:21.945082 0'0 3108:1823 [2,7,9,10] [2,7,9,10] 2 0'0 2014-09-22 14:37:32.910787 0'0 2014-09-22 14:37:32.910787 0.aa0 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:29.326607 0'0 3108:2451 [0,7,15,12][0,7,15,12] 0 0'0 2014-09-23 00:39:10.717363 0'0 2014-09-23 00:39:10.717363 0.9c0 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:29.325229 0'0 3108:1917 [0,7,9,12] [0,7,9,12] 0 0'0 2014-09-22 14:40:06.694479 0'0 2014-09-22 14:40:06.694479 0.9a0 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:29.325074 0'0 3108:2486 [0,7,14,11][0,7,14,11] 0 0'0 2014-09-23 01:14:55.825900 0'0 2014-09-23 01:14:55.825900 0.910 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:28.839148 0'0 3108:1962 [0,7,9,10] [0,7,9,10] 0 0'0 2014-09-22 14:37:44.652796 0'0 2014-09-22 14:37:44.652796 0.8c0 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:28.838683 0'0 3108:2635 [0,2,9,11] [0,2,9,11] 0 0'0 2014-09-23 01:52:52.390529 0'0 2014-09-23 01:52:52.390529 0.8b0 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:21.215964 0'0 3108:1636 [2,0,8,14] [2,0,8,14] 2 0'0 2014-09-23 01:31:38.134466 0'0 2014-09-23 01:31:38.134466 0.500 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:35.869160 0'0 3108:1801 [7,2,15,10][7,2,15,10] 7 0'0 2014-09-20 08:38:53.963779 0'0 2014-09-13 10:27:26.977929 0.440 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:35.871409 0'0 3108:1819 [7,2,15,10][7,2,15,10] 7 0'0 2014-09-20 11:59:05.208164 0'0 2014-09-20 11:59:05.208164 0.390 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:28.653190 0'0 3108:1827 [0,2,9,10] [0,2,9,10] 0 0'0 2014-09-22 14:40:50.697850 0'0 2014-09-22 14:40:50.697850 0.320 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:10.970515 0'0 3108:1719 [2,0,14,9] [2,0,14,9] 2 0'0 2014-09-20 12:06:23.716480 0'0 2014-09-20 12:06:23.716480 0.2c0 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:28.647268 0'0 3108:2540 [0,7,12,8] [0,7,12,8] 0 0'0 2014-09-22 23:44:53.387815 0'0 2014-09-22 23:44:53.387815 0.1f0 0 0 0 0 0 0 active+clean+replay 2014-09-24 02:38:28.651059 0'0 3108:2522 [0,2,14,11][0,2,14,11] 0 0'0 2014-09-22 23:38:16.315755 0'0 2014-09-22 23:38:16.315755 0.7 0 0 0 0 0 0 0
Re: [ceph-users] PG stuck creating
Yeah, the last acting set there is probably from prior to your lost data and forced pg creation, so it might not have any bearing on what's happening now. Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 30, 2014 at 10:07 AM, Robert LeBlanc rob...@leblancnet.us wrote: I rebuilt the primary OSD (29) in the hopes it would unblock whatever it was, but no luck. I'll check the admin socket and see if there is anything I can find there. On Tue, Sep 30, 2014 at 10:36 AM, Gregory Farnum g...@inktank.com wrote: On Tuesday, September 30, 2014, Robert LeBlanc rob...@leblancnet.us wrote: On our dev cluster, I've got a PG that won't create. We had a host fail with 10 OSDs that needed to be rebuilt. A number of other OSDs were down for a few days (did I mention this was a dev cluster?). The other OSDs eventually came up once the OSD maps caught up on them. I rebuilt the OSDs on all the hosts because we were running into XFS lockups with bcache. There were a number of PGs that could not be found when all the hosts were rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the OSDs they were on as well as the PGs. I performed a repair on the OSDs as well without any luck. One of pools had a recommendation to increase the PGs, so I increased it thinking it might be able to help. Nothing was helping and I could not find any reference to them so I force created them. That cleared up all but one that is creating due to the new PG number. Now, there is nothing I can do to unstick this one PG, I can't force create it, I can't increase the pgp_num, nada. At one point when recreating the OSDs, some of the number got out of order and to calm my OCD, I fixed it requiring me to manually modify the CRUSH map as the OSD appeared in both hosts, this was before I increased the PGs. There is nothing critical on this cluster, but I'm using this as an opportunity to understand Ceph in case we run into something similar in our future production environment. HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool pg_num 256 pgp_num 128 pg 4.bf is stuck inactive since forever, current state creating, last acting [29,15,32] pg 4.bf is stuck unclean since forever, current state creating, last acting [29,15,32] pool libvirt-pool pg_num 256 pgp_num 128 [root@nodea ~]# ceph-osd --version ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187) More output http://pastebin.com/ajgpU7Zx Thanks You should find out which OSD the PG maps to, and see if ceph pg query or the osd admin socket will expose anything useful about its state. -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG stuck creating
On Tuesday, September 30, 2014, Robert LeBlanc rob...@leblancnet.us wrote: On our dev cluster, I've got a PG that won't create. We had a host fail with 10 OSDs that needed to be rebuilt. A number of other OSDs were down for a few days (did I mention this was a dev cluster?). The other OSDs eventually came up once the OSD maps caught up on them. I rebuilt the OSDs on all the hosts because we were running into XFS lockups with bcache. There were a number of PGs that could not be found when all the hosts were rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the OSDs they were on as well as the PGs. I performed a repair on the OSDs as well without any luck. One of pools had a recommendation to increase the PGs, so I increased it thinking it might be able to help. Nothing was helping and I could not find any reference to them so I force created them. That cleared up all but one that is creating due to the new PG number. Now, there is nothing I can do to unstick this one PG, I can't force create it, I can't increase the pgp_num, nada. At one point when recreating the OSDs, some of the number got out of order and to calm my OCD, I fixed it requiring me to manually modify the CRUSH map as the OSD appeared in both hosts, this was before I increased the PGs. There is nothing critical on this cluster, but I'm using this as an opportunity to understand Ceph in case we run into something similar in our future production environment. HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool pg_num 256 pgp_num 128 pg 4.bf is stuck inactive since forever, current state creating, last acting [29,15,32] pg 4.bf is stuck unclean since forever, current state creating, last acting [29,15,32] pool libvirt-pool pg_num 256 pgp_num 128 [root@nodea ~]# ceph-osd --version ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187) More output http://pastebin.com/ajgpU7Zx Thanks You should find out which OSD the PG maps to, and see if ceph pg query or the osd admin socket will expose anything useful about its state. -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?
On Wed, Oct 1, 2014 at 5:24 AM, Andrei Mikhailovsky and...@arhont.com wrote: Timur, As far as I know, the latest master has a number of improvements for ssd disks. If you check the mailing list discussion from a couple of weeks back, you can see that the latest stable firefly is not that well optimised for ssd drives and IO is limited. However changes are being made to address that. I am well surprised that you can get 10K IOps as in my tests I was not getting over 3K IOPs on the ssd disks which are capable of doing 90K IOps. P.S. does anyone know if the ssd optimisation code will be added to the next maintenance release of firefly? Not a chance. The changes enabling that improved throughput are very invasive and sprinkled all over the OSD; they aren't the sort of thing that one does backport or that one could put on top of a stable release for any meaningful definition of stable. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?
All the stuff I'm aware of is part of the testing we're doing for Giant. There is probably ongoing work in the pipeline, but the fast dispatch, sharded work queues, and sharded internal locking structures that Somnath has discussed all made it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Oct 1, 2014 at 7:07 AM, Andrei Mikhailovsky and...@arhont.com wrote: Greg, are they going to be a part of the next stable release? Cheers From: Gregory Farnum g...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: Timur Nurlygayanov tnurlygaya...@mirantis.com, ceph-users ceph-us...@ceph.com Sent: Wednesday, 1 October, 2014 3:04:51 PM Subject: Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small? On Wed, Oct 1, 2014 at 5:24 AM, Andrei Mikhailovsky and...@arhont.com wrote: Timur, As far as I know, the latest master has a number of improvements for ssd disks. If you check the mailing list discussion from a couple of weeks back, you can see that the latest stable firefly is not that well optimised for ssd drives and IO is limited. However changes are being made to address that. I am well surprised that you can get 10K IOps as in my tests I was not getting over 3K IOPs on the ssd disks which are capable of doing 90K IOps. P.S. does anyone know if the ssd optimisation code will be added to the next maintenance release of firefly? Not a chance. The changes enabling that improved throughput are very invasive and sprinkled all over the OSD; they aren't the sort of thing that one does backport or that one could put on top of a stable release for any meaningful definition of stable. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?
On Wed, Oct 1, 2014 at 9:21 AM, Mark Nelson mark.nel...@inktank.com wrote: On 10/01/2014 11:18 AM, Gregory Farnum wrote: All the stuff I'm aware of is part of the testing we're doing for Giant. There is probably ongoing work in the pipeline, but the fast dispatch, sharded work queues, and sharded internal locking structures that Somnath has discussed all made it. I seem to recall there was a deadlock issue or something with fast dispatch. Were we able to get that solved for Giant? Fast dispatch is not enabled in librados, but I don't think most users should be able to tell on that end. If they can, it'll be switched on at some point in the Hammer development process. If it's small enough we may backport eventually (we know how to go about it, but the change will require more testing than we were comfortable with assigning at this stage in an LTS). -Greg Mark -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Oct 1, 2014 at 7:07 AM, Andrei Mikhailovsky and...@arhont.com wrote: Greg, are they going to be a part of the next stable release? Cheers From: Gregory Farnum g...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: Timur Nurlygayanov tnurlygaya...@mirantis.com, ceph-users ceph-us...@ceph.com Sent: Wednesday, 1 October, 2014 3:04:51 PM Subject: Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small? On Wed, Oct 1, 2014 at 5:24 AM, Andrei Mikhailovsky and...@arhont.com wrote: Timur, As far as I know, the latest master has a number of improvements for ssd disks. If you check the mailing list discussion from a couple of weeks back, you can see that the latest stable firefly is not that well optimised for ssd drives and IO is limited. However changes are being made to address that. I am well surprised that you can get 10K IOps as in my tests I was not getting over 3K IOPs on the ssd disks which are capable of doing 90K IOps. P.S. does anyone know if the ssd optimisation code will be added to the next maintenance release of firefly? Not a chance. The changes enabling that improved throughput are very invasive and sprinkled all over the OSD; they aren't the sort of thing that one does backport or that one could put on top of a stable release for any meaningful definition of stable. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds isn't working anymore after osd's running full
Sorry; I guess this fell off my radar. The issue here is not that it's waiting for an osdmap; it got the requested map and went into replay mode almost immediately. In fact the log looks good except that it seems to finish replaying the log and then simply fail to transition into active. Generate a new one, adding in debug journaled = 20 and debug filer = 20, and we can probably figure out how to fix it. (This diagnosis is much easier in the upcoming Giant!) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Oct 7, 2014 at 7:55 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Gregory, We still have the same problems with our test ceph cluster and didn't receive a reply from you after I send you the requested log files. Do you know if it's possible to get our cephfs filesystem working again or is it better to give up the files on cephfs and start over again? We restarted the cluster serveral times but it's still degraded: [root@th1-mon001 ~]# ceph -w cluster c78209f5-55ea-4c70-8968-2231d2b05560 health HEALTH_WARN mds cluster is degraded monmap e3: 3 mons at {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, election epoch 432, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 mdsmap e190: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby osdmap e2248: 12 osds: 12 up, 12 in pgmap v197548: 492 pgs, 4 pools, 60297 MB data, 470 kobjects 124 GB used, 175 GB / 299 GB avail 491 active+clean 1 active+clean+scrubbing+deep One placement group stays in the deep scrubbing fase. Kind regards, Jasper Siero Van: Jasper Siero Verzonden: donderdag 21 augustus 2014 16:43 Aan: Gregory Farnum Onderwerp: RE: [ceph-users] mds isn't working anymore after osd's running full I did restart it but you are right about the epoch number which has changed but the situation looks the same. 2014-08-21 16:33:06.032366 7f9b5f3cd700 1 mds.0.27 need osdmap epoch 1994, have 1993 2014-08-21 16:33:06.032368 7f9b5f3cd700 1 mds.0.27 waiting for osdmap 1994 (which blacklists prior instance) I started the mds with the debug options and attached the log. Thanks, Jasper Van: Gregory Farnum [g...@inktank.com] Verzonden: woensdag 20 augustus 2014 18:38 Aan: Jasper Siero CC: ceph-users@lists.ceph.com Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full After restarting your MDS, it still says it has epoch 1832 and needs epoch 1833? I think you didn't really restart it. If the epoch numbers have changed, can you restart it with debug mds = 20, debug objecter = 20, debug ms = 1 in the ceph.conf and post the resulting log file somewhere? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Unfortunately that doesn't help. I restarted both the active and standby mds but that doesn't change the state of the mds. Is there a way to force the mds to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 1833, have 1832)? Thanks, Jasper Van: Gregory Farnum [g...@inktank.com] Verzonden: dinsdag 19 augustus 2014 19:49 Aan: Jasper Siero CC: ceph-users@lists.ceph.com Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hi all, We have a small ceph cluster running version 0.80.1 with cephfs on five nodes. Last week some osd's were full and shut itself down. To help de osd's start again I added some extra osd's and moved some placement group directories on the full osd's (which has a copy on another osd) to another place on the node (as mentioned in http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/) After clearing some space on the full osd's I started them again. After a lot of deep scrubbing and two pg inconsistencies which needed to be repaired everything looked fine except the mds which still is in the replay state and it stays that way. The log below says that mds need osdmap epoch 1833 and have 1832. 2014-08-18 12:29:22.268248 7fa786182700 1 mds.-1.0 handle_mds_map standby 2014-08-18 12:29:22.273995 7fa786182700 1 mds.0.25 handle_mds_map i am now mds.0.25 2014-08-18 12:29:22.273998 7fa786182700 1 mds.0.25 handle_mds_map state change up:standby -- up:replay 2014-08-18 12:29:22.274000 7fa786182700 1 mds.0.25 replay_start 2014-08-18 12:29:22.274014 7fa786182700 1 mds.0.25 recovery set is 2014-08-18 12:29:22.274016 7fa786182700 1 mds.0.25 need osdmap epoch 1833, have 1832 2014-08-18 12:29:22.274017 7fa786182700 1 mds.0.25 waiting for osdmap 1833 (which blacklists prior instance) # ceph status
Re: [ceph-users] Regarding Primary affinity configuration
On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo) johnu...@cisco.com wrote: Hi All, I have few questions regarding the Primary affinity. In the original blueprint (https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role_affinity ), one example has been given. For PG x, CRUSH returns [a, b, c] If a has primary_affinity of .5, b and c have 1 , with 50% probability, we will choose b or c instead of a. (25% for b, 25% for c) A) I was browsing through the code, but I could not find this logic of splitting the rest of configured primary affinity value between other osds. How is this handled? if (a CEPH_OSD_MAX_PRIMARY_AFFINITY (crush_hash32_2(CRUSH_HASH_RJENKINS1, seed, o) 16) = a) { // we chose not to use this primary. note it anyway as a // fallback in case we don't pick anyone else, but keep looking. if (pos 0) pos = i; } else { pos = i; break; } } It's a fallback mechanism — if the chosen primary for a PG has primary affinity less than the default (max), we (probabilistically) look for a different OSD to be the primary. We decide whether to offload by running a hash and discarding the OSD if the output value is greater than the OSDs affinity, and then we go through the list and run that calculation in order (obviously if the affinity is 1, then it passes without needing to run the hash). If no OSD in the list has a high enough hash value, we take the originally-chosen primary. B) Since, primary affinity value is configured independently, there can be a situation with [0.1,0.1,0.1] with total value that don’t add to 1. How is this taken care of? These primary affinity values are just compared against the hash output I mentioned, so the sum doesn't matter. In general we simply expect that OSDs which don't have the max weight value will be chosen as primary in proportion to their share of the total weight of their PG membership (ie, if they have a weight of .5 and everybody else has weight 1, they will be primary in half the normal number of PGs. If everybody has a weight of .5, they will be primary in the normal proportions. Etc). C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is osd.0 always returned? If the first OSD in the PG list has primary affinity of 1 then it is always the primary for that OSD, yes. That's not osd.0, though; just the first OSD in the PG list. ;) D) After calculating primary based on the affinity values, I see a shift of osds so that primary comes to the front. Why is this needed?. I thought, primary affinity value affects only reads and hence, osd ordering need not be changed. Primary affinity impacts which OSD is chosen to be primary; the primary is the ordering point for *all* access to the PG. That includes writes as well as reads, plus coordination of the cluster on map changes. We move the primary to the front of the list...well, I think it's just because we were lazy and there are a bunch of places that assume the first OSD in a replicated pool is the primary. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blueprints
On Thu, Oct 9, 2014 at 4:01 PM, Robert LeBlanc rob...@leblancnet.us wrote: I have a question regarding submitting blueprints. Should only people who intend to do the work of adding/changing features of Ceph submit blueprints? I'm not primarily a programmer (but can do programming if needed), but have a feature request for Ceph. Blueprints are documents *for* developers. If you as a user have enough information about the feature you want, and the things it needs to do in Ceph, to generate a reasonable description of the feature, its user interface, and a skeleton of how it could be implemented, we'd love a blueprint. Blueprints which are backed by developers are more likely to get time at CDS, I think (Patrick/Sage could confirm), but even just having them is helpful. If that sounds intimidating, we take less detailed feature requests in our Redmine at tracker.ceph.com too. ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds isn't working anymore after osd's running full
Ugh, debug journaler, not debug journaled. That said, the filer output tells me that you're missing an object out of the MDS log. (200.08f5) I think this issue should be resolved if you dump the journal to a file, reset it, and then undump it. (These are commands you can invoke from ceph-mds.) I haven't done this myself in a long time, so there may be some hard edges around it. In particular, I'm not sure if the dumped journal file will stop when the data stops, or if it will be a little too long. If so, we can fix that by truncating the dumped file to the proper length and resetting and undumping again. (And just to harp on it, this journal manipulation is a lot simpler in Giant... ;) ) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Oct 8, 2014 at 7:11 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, No problem thanks for looking into the log. I attached the log to this email. I'm looking forward for the new release because it would be nice to have more possibilities to diagnose problems. Kind regards, Jasper Siero Van: Gregory Farnum [g...@inktank.com] Verzonden: dinsdag 7 oktober 2014 19:45 Aan: Jasper Siero CC: ceph-users Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full Sorry; I guess this fell off my radar. The issue here is not that it's waiting for an osdmap; it got the requested map and went into replay mode almost immediately. In fact the log looks good except that it seems to finish replaying the log and then simply fail to transition into active. Generate a new one, adding in debug journaled = 20 and debug filer = 20, and we can probably figure out how to fix it. (This diagnosis is much easier in the upcoming Giant!) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Oct 7, 2014 at 7:55 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Gregory, We still have the same problems with our test ceph cluster and didn't receive a reply from you after I send you the requested log files. Do you know if it's possible to get our cephfs filesystem working again or is it better to give up the files on cephfs and start over again? We restarted the cluster serveral times but it's still degraded: [root@th1-mon001 ~]# ceph -w cluster c78209f5-55ea-4c70-8968-2231d2b05560 health HEALTH_WARN mds cluster is degraded monmap e3: 3 mons at {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, election epoch 432, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 mdsmap e190: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby osdmap e2248: 12 osds: 12 up, 12 in pgmap v197548: 492 pgs, 4 pools, 60297 MB data, 470 kobjects 124 GB used, 175 GB / 299 GB avail 491 active+clean 1 active+clean+scrubbing+deep One placement group stays in the deep scrubbing fase. Kind regards, Jasper Siero Van: Jasper Siero Verzonden: donderdag 21 augustus 2014 16:43 Aan: Gregory Farnum Onderwerp: RE: [ceph-users] mds isn't working anymore after osd's running full I did restart it but you are right about the epoch number which has changed but the situation looks the same. 2014-08-21 16:33:06.032366 7f9b5f3cd700 1 mds.0.27 need osdmap epoch 1994, have 1993 2014-08-21 16:33:06.032368 7f9b5f3cd700 1 mds.0.27 waiting for osdmap 1994 (which blacklists prior instance) I started the mds with the debug options and attached the log. Thanks, Jasper Van: Gregory Farnum [g...@inktank.com] Verzonden: woensdag 20 augustus 2014 18:38 Aan: Jasper Siero CC: ceph-users@lists.ceph.com Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full After restarting your MDS, it still says it has epoch 1832 and needs epoch 1833? I think you didn't really restart it. If the epoch numbers have changed, can you restart it with debug mds = 20, debug objecter = 20, debug ms = 1 in the ceph.conf and post the resulting log file somewhere? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Unfortunately that doesn't help. I restarted both the active and standby mds but that doesn't change the state of the mds. Is there a way to force the mds to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 1833, have 1832)? Thanks, Jasper Van: Gregory Farnum [g...@inktank.com] Verzonden: dinsdag 19 augustus 2014 19:49 Aan: Jasper Siero CC: ceph-users@lists.ceph.com Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hi all, We have
Re: [ceph-users] ceph tell osd.6 version : hang
On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary l...@dachary.org wrote: Hi, On a 0.80.6 cluster the command ceph tell osd.6 version hangs forever. I checked that it establishes a TCP connection to the OSD, raised the OSD debug level to 20 and I do not see https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991 in the logs. All other OSDs answer to the same version command as they should. And ceph daemon osd.6 version on the machine running OSD 6 responds as it should. There also are an ever growing number of slow requests on this OSD. But not error in the logs. In other words, except for taking forever to answer any kind of request the OSD looks fine. Another OSD running on the same machine is behaving well. Any idea what that behaviour relates to ? What commands have you run? The admin socket commands don't require nearly as many locks, nor do they go through the same event loops that messages do. You might have found a deadlock or something. (In which case just restarting the OSD would probably fix it, but you should grab a core dump first.) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph tell osd.6 version : hang
On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary l...@dachary.org wrote: On 12/10/2014 17:48, Gregory Farnum wrote: On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary l...@dachary.org wrote: Hi, On a 0.80.6 cluster the command ceph tell osd.6 version hangs forever. I checked that it establishes a TCP connection to the OSD, raised the OSD debug level to 20 and I do not see https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991 in the logs. All other OSDs answer to the same version command as they should. And ceph daemon osd.6 version on the machine running OSD 6 responds as it should. There also are an ever growing number of slow requests on this OSD. But not error in the logs. In other words, except for taking forever to answer any kind of request the OSD looks fine. Another OSD running on the same machine is behaving well. Any idea what that behaviour relates to ? What commands have you run? The admin socket commands don't require nearly as many locks, nor do they go through the same event loops that messages do. You might have found a deadlock or something. (In which case just restarting the OSD would probably fix it, but you should grab a core dump first.) # /etc/init.d/ceph stop osd.6 === osd.6 === Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6 === osd.6 === Starting Ceph osd.6 on g3... starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 /var/lib/ceph/osd/ceph-6/journal root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version { version: ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)} root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version and now it blocks. It looks like a deadlock happens shortly after it boots. Is this the same cluster you're reporting on in the tracker? Anyway, apparently it's a disk state issue. I have no idea what kind of bug in Ceph could cause this, so my guess is that a syscall is going out to lunch — although that should get caught up in the internal heartbeat checkin code. Like I said, grab a core dump and look for deadlocks or blocked sys calls in the filestore. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph tell osd.6 version : hang
On Sun, Oct 12, 2014 at 9:29 AM, Loic Dachary l...@dachary.org wrote: On 12/10/2014 18:22, Gregory Farnum wrote: On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary l...@dachary.org wrote: On 12/10/2014 17:48, Gregory Farnum wrote: On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary l...@dachary.org wrote: Hi, On a 0.80.6 cluster the command ceph tell osd.6 version hangs forever. I checked that it establishes a TCP connection to the OSD, raised the OSD debug level to 20 and I do not see https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991 in the logs. All other OSDs answer to the same version command as they should. And ceph daemon osd.6 version on the machine running OSD 6 responds as it should. There also are an ever growing number of slow requests on this OSD. But not error in the logs. In other words, except for taking forever to answer any kind of request the OSD looks fine. Another OSD running on the same machine is behaving well. Any idea what that behaviour relates to ? What commands have you run? The admin socket commands don't require nearly as many locks, nor do they go through the same event loops that messages do. You might have found a deadlock or something. (In which case just restarting the OSD would probably fix it, but you should grab a core dump first.) # /etc/init.d/ceph stop osd.6 === osd.6 === Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6 === osd.6 === Starting Ceph osd.6 on g3... starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 /var/lib/ceph/osd/ceph-6/journal root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version { version: ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)} root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version and now it blocks. It looks like a deadlock happens shortly after it boots. Is this the same cluster you're reporting on in the tracker? Yes, it is the same cluster as http://tracker.ceph.com/issues/9750 although I can't imagine how the two could be related, they probably are. Anyway, apparently it's a disk state issue. I have no idea what kind of bug in Ceph could cause this, so my guess is that a syscall is going out to lunch — although that should get caught up in the internal heartbeat checkin code. Like I said, grab a core dump and look for deadlocks or blocked sys calls in the filestore. I created http://tracker.ceph.com/issues/9751 and attached the log with debug_filestore = 20. There are many slow requests but I can't relate them to any kind of error. It does not core dump, should I kill it to get a coredump and then examine it ? I've never tried that ;-) That's what I was thinking; you send it a SIGQUIT signal and it'll dump. Or apparently you can use gcore instead, which won't quit it. The log doesn't have anything glaringly obvious; was it already hung when you packaged that? If so, it must be some kind of deadlock and the backtraces from the core dump will probably tell us what happened. One way or the other the problem will be fixed soon (tonight). I'd like to take advantage of the broken state we have to figure it out. Resurecting the OSD that may unblock http://tracker.ceph.com/issues/9751 and may also unblock http://tracker.ceph.com/issues/9750 and we'll lose a chance to diagnose this rare condition. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Handling of network failures in the cluster network
On Mon, Oct 13, 2014 at 11:32 AM, Martin Mailand mar...@tuxadero.com wrote: Hi List, I have a ceph cluster setup with two networks, one for public traffic and one for cluster traffic. Network failures in the public network are handled quite well, but network failures in the cluster network are handled very badly. I found several discussions on the ml about this topic and they stated that the problem should be fixed, but I still have problems. I use ceph v0.86 with a standard crushmap, 4 osds per host and 6 hosts in the root default therefore I have 24 osds overall. Each storage node has 2 10Gbit nics one for public and one for cluster traffic, if I take down ONE of the links in the cluster network the cluster stops working. I tested it several times and I could observe following different behaviors. 1. Cluster stops forever. 2. After a timeout of around 120 seconds all other osds gets marked down. The osds on the storage node with the link failure stays up. Then all other osds boot and come back and the osds on the node with the failure are marked down and the cluster starts to work again. 3. After a timeout of around 120 seconds the osds on the node with the link failure gets marked down and the cluster starts to work again. Therefore a link failure in the cluster network has a very severe impact on the cluster availability. Is this a configuration mistake on my side? Any help would be greatly appreciated. How did you test taking down the connection? What config options have you specified on the OSDs and in the monitor? None of the scenarios you're describing make much sense on a semi-recent (post-dumpling-release) version of Ceph. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Misconfigured caps on client.admin key, anyway to recover from EAESS denied?
On Mon, Oct 13, 2014 at 4:04 PM, Wido den Hollander w...@42on.com wrote: On 14-10-14 00:53, Anthony Alba wrote: Following the manual starter guide, I set up a Ceph cluster with HEALTH_OK, (1 mon, 2 osd). In testing out auth commands I misconfigured the client.admin key by accidentally deleting mon 'allow *'. Now I'm getting EACESS denied for all ceph actions. Is there a way to recover or recreate a new client.admin key. You can disable cephx completely, fix the key and enable cephx again. auth_cluster_required, auth_service_required and auth_client_required Set it to 'none' and restart the monitors and OSDs. You can also inject it through the admin socket if you want to. Mmm, I don't think that will work — Ceph still refers to the stored client capabilities; it just doesn't validate them. I *believe* if you grab the monitor key you can use that to make the necessary changes, though. Otherwise hacking at the monitor stores is an option. -Greg Key was: client.admin key: ABCDEFG... caps: [mon] allow * caps: [osd] allow * Misconfigured key: ABCDEFG... caps: [osd] allow * ...now all ceph commands fail, so I'm not sure how to start fixing the key on the mons/osds. - anthony ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD very slow startup
On Monday, October 13, 2014, Lionel Bouton lionel+c...@bouton.name wrote: Hi, # First a short description of our Ceph setup You can skip to the next section (Main questions) to save time and come back to this one if you need more context. We are currently moving away from DRBD-based storage backed by RAID arrays to Ceph for some of our VMs. Our focus is on resiliency and capacity (one VM was outgrowing the largest RAID10 we had) and not maximum performance (at least not yet). Our Ceph OSDs are fairly unbalanced because 2 are on 2 historic hosts each with 4 disks in a hardware RAID10 configuration and no place available for new disks in the chassis. 12 additional OSD are on 2 new systems with 6 disk drives dedicated to one OSD each (CPU and RAM configurations are nearly identical on the 4 hosts). All hosts are used for running VMs too, we took some precautions to avoid too much interference: each host has CPU and RAM to spare for the OSD. CPU usage exhibits some bursts on occasions but as we only have one or two VM on each host, they can't starve the OSD which have between 2 and 8 full fledge cores (4 to 16 hardware threads) for them depending on the current load. We have at least 4GB of free RAM per OSD on each host at all times (including room for at least a 4GB OS cache). To sum up we have a total of 14 OSDs, the 2 largest ones on RAID10 are clearly our current bottleneck. That said until we have additional hardware they allow us to maintain availability even if 2 servers are down (default crushmap with pool configured with 3 replicas on 3 different hosts) and performance is acceptable (backfilling/scrubing/... pgs required some tuning though and I'm eagerly waiting for 0.80.7 to begin tests of the new io priority tunables). Everything is based on SATA/SAS 7200t/min disk drives behind P410 Raid controllers (HP Proliant systems) with battery backed memory to help with write bursts. The OSDs are a mix of: - Btrfs on 3.17.0 kernels on individual disks, 450GB use on 2TB (3.17.0 fixes a filesystem lockup we had with earlier kernels manifesting itself with concurrent accesses to several Btrfs filesystems according to recent lkml posts), - Btrfs on 3.12.21 kernels on the 2 systems with RAID10, 1.5TB used on 3TB (no lockup on these yet but they will migrate to 3.17.0 when we'll have enough experience with it). - XFS for a minority of individual disks (with a dedicated partition for the journal). Most of them have the same history (all being created at the same time), only two of them have been created later (following Btrfs corruption and/or conversion to XFS) and are avoided when comparing behaviours. All Btrfs volumes use these mount options: rw,noatime,nodiratime,compress=lzo,space_cache,autodefrag,recovery All OSDs use a 5GB journal. We slowly add monitoring to the setup to see what are the benefits of Btrfs in our case (ceph osd perf, kernel io wait per devices, osd CPU usage, ...). One long term objective is to slowly raise the performance both by migrating to/adding more suitable hardware and tuning the software side. Detailed monitoring is supposed to help us study the behaviour of isolated OSDs with different settings and being warned early if they generate performance problems to take them out with next to no impact on the whole storage network (we are strong believers in slow, incremental and continuous change and distributed storage with redundancy makes it easy to implement). # Main questions The system works well but I just realised when restarting one of the 2 large Btrfs OSD that it was very slow to rejoin the network (ceph osd set noout was used for the restart). I stopped the OSD init after 5 minutes to investigate what was going on and didn't find any obvious problem (filesystem sane, no swapping, CPU hogs, concurrent IO not able to starve the system by itself, ...). Next restarts took between 43s (nearly no concurrent disk access and warm caches after an earlier restart without umounting the filesystem) and 3mn57s (one VM still on DRBD doing ~30 IO/s on the same volume and cold caches after a filesystem mount). It seems that the startup time is getting longer on the 2 large Btrfs filesystems (the other one gives similar results: 3mn48s on the first try for example). I noticed that it was a bit slow a week ago but not as much (there was ~half as much data on them at the time). OSDs on individual disks don't exhibit this problem (with warm caches init finishes in ~4s on the small Btrfs volumes, ~3s on the XFS volumes) but they are on dedicated disks with less data. With warm caches most of the time is spent between: osd.n osdmap load_pgs osd.n osdmap load_pgs opened m pgs log lines in /var/log/ceph/ceph-osd.n.log (m is ~650 for both OSD). So it seems most of the time is spent opening pgs. What could explain such long startup times? Is the OSD init doing a lot of random disk accesses? Is it dependant on
Re: [ceph-users] Misconfigured caps on client.admin key, anyway to recover from EAESS denied?
On Monday, October 13, 2014, Anthony Alba ascanio.al...@gmail.com wrote: You can disable cephx completely, fix the key and enable cephx again. auth_cluster_required, auth_service_required and auth_client_required That did not work: i.e disabling cephx in the cluster conf and restarting the cluster. The cluster still complained about failed authentication. I *believe* if you grab the monitor key you can use that to make the necessary changes, though. Otherwise hacking at the monitor stores is an option. You mean use the mon. key but as the client.admin user? It's been a while since I've done this, but once upon a time you could use the mon key and the ID mon. and then send mon commands from the ceph cli. I'd try that. -Greg - anthony -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Handling of network failures in the cluster network
On Mon, Oct 13, 2014 at 1:37 PM, Martin Mailand mar...@tuxadero.com wrote: Hi Greg, I took down the interface with ifconfig p7p1 down. I attached the config of the first monitor and the first osd. I created the cluster with ceph-deploy. The version is ceph version 0.86 (97dcc0539dfa7dac3de74852305d51580b7b1f82). On 13.10.2014 21:45, Gregory Farnum wrote: How did you test taking down the connection? What config options have you specified on the OSDs and in the monitor? None of the scenarios you're describing make much sense on a semi-recent (post-dumpling-release) version of Ceph. Best Regards, martin Hmm, do you have any logs? 120 seconds is just way longer than the failure detection should normally take, unless you've been playing with it enough to stretch out the extra time the monitor waits to be certain. But I did realize that in your configuration you probably want to set one or both of mon_osd_min_down_reporters and mon_osd_min_down_reports to a number greater than the number of OSDs you have on a single host. (They default to 1 and 3, respectively.) That's probably how the disconnected node managed to fail all of the other nodes — it's failure reports to the monitor arrived first. You can also run tests with mon_osd_adjust_heartbeat_grace option set to false, to get more predictable results. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds isn't working anymore after osd's running full
ceph-mds --undump-journal rank journal-file Looks like it accidentally (or on purpose? you can break things with it) got left out of the help text. On Tue, Oct 14, 2014 at 8:19 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, I dumped the journal successful to a file: journal is 9483323613~134215459 read 134213311 bytes at offset 9483323613 wrote 134213311 bytes at offset 9483323613 to journaldumptgho NOTE: this is a _sparse_ file; you can $ tar cSzf journaldumptgho.tgz journaldumptgho to efficiently compress it while preserving sparseness. I see the option for resetting the mds journal but I can't find the option for undumping /importing the journal: usage: ceph-mds -i name [flags] [[--journal_check rank]|[--hot-standby][rank]] -m monitorip:port connect to monitor at given address --debug_mds n debug MDS level (e.g. 10) --dump-journal rank filename dump the MDS journal (binary) for rank. --dump-journal-entries rank filename dump the MDS journal (JSON) for rank. --journal-check rank replay the journal for rank, then exit --hot-standby rank start up as a hot standby for rank --reset-journal rank discard the MDS journal for rank, and replace it with a single event that updates/resets inotable and sessionmap on replay. Do you know how to undump the journal back into ceph? Jasper Van: Gregory Farnum [g...@inktank.com] Verzonden: vrijdag 10 oktober 2014 23:45 Aan: Jasper Siero CC: ceph-users Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full Ugh, debug journaler, not debug journaled. That said, the filer output tells me that you're missing an object out of the MDS log. (200.08f5) I think this issue should be resolved if you dump the journal to a file, reset it, and then undump it. (These are commands you can invoke from ceph-mds.) I haven't done this myself in a long time, so there may be some hard edges around it. In particular, I'm not sure if the dumped journal file will stop when the data stops, or if it will be a little too long. If so, we can fix that by truncating the dumped file to the proper length and resetting and undumping again. (And just to harp on it, this journal manipulation is a lot simpler in Giant... ;) ) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Oct 8, 2014 at 7:11 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, No problem thanks for looking into the log. I attached the log to this email. I'm looking forward for the new release because it would be nice to have more possibilities to diagnose problems. Kind regards, Jasper Siero Van: Gregory Farnum [g...@inktank.com] Verzonden: dinsdag 7 oktober 2014 19:45 Aan: Jasper Siero CC: ceph-users Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full Sorry; I guess this fell off my radar. The issue here is not that it's waiting for an osdmap; it got the requested map and went into replay mode almost immediately. In fact the log looks good except that it seems to finish replaying the log and then simply fail to transition into active. Generate a new one, adding in debug journaled = 20 and debug filer = 20, and we can probably figure out how to fix it. (This diagnosis is much easier in the upcoming Giant!) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Oct 7, 2014 at 7:55 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Gregory, We still have the same problems with our test ceph cluster and didn't receive a reply from you after I send you the requested log files. Do you know if it's possible to get our cephfs filesystem working again or is it better to give up the files on cephfs and start over again? We restarted the cluster serveral times but it's still degraded: [root@th1-mon001 ~]# ceph -w cluster c78209f5-55ea-4c70-8968-2231d2b05560 health HEALTH_WARN mds cluster is degraded monmap e3: 3 mons at {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, election epoch 432, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 mdsmap e190: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby osdmap e2248: 12 osds: 12 up, 12 in pgmap v197548: 492 pgs, 4 pools, 60297 MB data, 470 kobjects 124 GB used, 175 GB / 299 GB avail 491 active+clean 1 active+clean+scrubbing+deep One placement group stays in the deep scrubbing fase. Kind regards, Jasper Siero Van: Jasper Siero Verzonden: donderdag 21 augustus 2014 16:43 Aan: Gregory Farnum Onderwerp: RE: [ceph-users] mds isn't working anymore after osd's running full I did restart it but you are right about
Re: [ceph-users] Firefly maintenance release schedule
On Wed, Oct 15, 2014 at 9:39 AM, Dmitry Borodaenko dborodae...@mirantis.com wrote: On Tue, Sep 30, 2014 at 6:49 PM, Dmitry Borodaenko dborodae...@mirantis.com wrote: Last stable Firefly release (v0.80.5) was tagged on July 29 (over 2 months ago). Since then, there were twice as many commits merged into the firefly branch than there existed on the branch before v0.80.5: $ git log --oneline --no-merges v0.80..v0.80.5|wc -l 122 $ git log --oneline --no-merges v0.80.5..firefly|wc -l 227 Is this a one time aberration in the process or should we expect the gap between maintenance updates for LTS releases of Ceph to keep growing? I didn't get a response to that nag other than the v0.80.6 release announcement on the day after, so I guess it wasn't completely ignored :) Except it turned out v0.80.6 was slightly less than useful as a maintenance release: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-October/043701.html Two weeks later we have v0.80.7 with 3 more commits that hopefully make it actually usable. There are many ways to look at that from release management perspective. Good: 2 weeks is much better than 2 months. Bad: that's 2.5 months since last *stable* Firefly release. Ugly: that's 2 weeks for 3 commits, and now we have 54 more waiting for the next release... Wait what?! Oh right, 54 more commits were merged from firefly-next as soon as v0.80.7 was tagged: $ git log --oneline --no-merges v0.80.7..firefly|wc -l 54 Some of these are fixes for Urgent priority bugs, crashes, and data loss: http://tracker.ceph.com/issues/9492 http://tracker.ceph.com/issues/9039 http://tracker.ceph.com/issues/9582 http://tracker.ceph.com/issues/9307 etc. So what a Ceph deployer supposed to do with this? Wait another couple of weeks (hopefully) for v0.80.8? Take v0.80.7 and hope not to encounter any of these bugs? Or label Firefly as not production ready yet and go back to Dumpling? My personal preference obviously would be the first option, but waiting for 2.5 more months is not going to fit my schedule :( Take .80.7. All of the bugs you've cited, you are supremely unlikely to run into. The Urgent tag is a measure of planning priority, not of impact to users; here it generally means we found a bug on a stable branch that we can reproduce. Taking them in order: http://tracker.ceph.com/issues/9492: only happens if you try and cheat with your CRUSH rules, and obviously nobody did that until Sage suggested it as a solution to the problem somebody had 29 days ago when this was discovered. http://tracker.ceph.com/issues/9039: The most serious here, but only happens if you're using RGW, and storing user data in multiple pools, and issue a COPY command to copy data between different pools. http://tracker.ceph.com/issues/9582: Only happens if you're using the op timeout feature of librados with the C bindings OR the op timeout feature *and* the user-provided buffers in the C++ interface. (To the best of my knowledge, the people who discovered this are the only ones using op timeouts.) http://tracker.ceph.com/issues/9307: I'm actually not sure what's going on here; looks like some kind of extremely rare race when authorizing requests? (ie, fixed by a retry) We messed up the v0.80.6 release in a very specific way (and if you were deploying a new cluster it wasn't a problem), but you're extrapolating too much from the presence of patches about what their impact is and what the system's stability is. These are largely cleaning up rough edges around user interfaces, and smoothing out issues in the new functionality that a standard deployment isn't going to experience. :) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.
[Re-added the list.] I assume you added more clients and checked that it didn't scale past that? You might look through the list archives; there are a number of discussions about how and how far you can scale SSD-backed cluster performance. Just scanning through the config options you set, you might want to bump up all the filestore and journal queue values a lot farther. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Oct 16, 2014 at 9:51 AM, Mark Wu wud...@gmail.com wrote: Thanks for the reply. I am not using single client. Writing 5 rbd volumes on 3 host can reach the peak. The client is fio and also running on osd nodes. But there're no bottlenecks on cpu or network. I also tried running client on two non osd servers, but the same result. 2014 年 10 月 17 日 上午 12:29于 Gregory Farnum g...@inktank.com写道: If you're running a single client to drive these tests, that's your bottleneck. Try running multiple clients and aggregating their numbers. -Greg On Thursday, October 16, 2014, Mark Wu wud...@gmail.com wrote: Hi list, During my test, I found ceph doesn't scale as I expected on a 30 osds cluster. The following is the information of my setup: HW configuration: 15 Dell R720 servers, and each server has: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and hyper-thread enabled. 128GB memory two Intel 3500 SSD disks, connected with MegaRAID SAS 2208 controller, each disk is configured as raid0 separately. bonding with two 10GbE nics, used for both the public network and cluster network. SW configuration: OS CentOS 6.5, Kernel 3.17, Ceph 0.86 XFS as file system for data. each SSD disk has two partitions, one is osd data and the other is osd journal. the pool has 2048 pgs. 2 replicas. 5 monitors running on 5 of the 15 servers. Ceph configuration (in memory debugging options are disabled) [osd] osd data = /var/lib/ceph/osd/$cluster-$id osd journal = /var/lib/ceph/osd/$cluster-$id/journal osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd mount options xfs = rw,noatime,logbsize=256k,delaylog osd journal size = 20480 osd mon heartbeat interval = 30 # Performance tuning filestore osd_max_backfills = 10 osd_recovery_max_active = 15 merge threshold = 40 filestore split multiple = 8 filestore fd cache size = 1024 osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd max backfills = 1 osd recovery op priority = 1 throttler perf counter = false osd enable op tracker = false filestore_queue_max_ops = 5000 filestore_queue_committing_max_ops = 5000 journal_max_write_entries = 1000 journal_queue_max_ops = 5000 objecter_inflight_ops = 8192 When I test with 7 servers (14 osds), the maximum iops of 4k random write I saw is 17k on single volume and 44k on the whole cluster. I expected the number of 30 osds cluster could approximate 90k. But unfornately, I found that with 30 osds, it almost provides the performce as 14 osds, even worse sometime. I checked the iostat output on all the nodes, which have similar numbers. It's well distributed but disk utilization is low. In the test with 14 osds, I can see higher utilization of disk (80%~90%). So do you have any tunning suggestion to improve the performace with 30 osds? Any feedback is appreciated. iostat output: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb 0.0088.500.00 5188.00 0.00 93397.00 18.00 0.900.17 0.09 47.85 sdc 0.00 443.500.00 5561.50 0.00 97324.00 17.50 4.060.73 0.09 47.90 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0017.500.00 28.00 0.00 3948.00 141.00 0.010.29 0.05 0.15 sdb 0.0069.500.00 4932.00 0.00 87067.50 17.65 2.270.46 0.09 43.45 sdc 0.0069.000.00 4855.50 0.00 105771.50 21.78 0.950.20 0.10 46.40 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.000.00 42.50 0.00 3948.00 92.89 0.010.19 0.04 0.15 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0012.000.008.00 0.00 568.00 71.00 0.000.12 0.12 0.10 sdb 0.0072.500.00 5046.50 0.00 113198.50 22.43 1.090.22 0.10 51.40 sdc 0.0072.50
Re: [ceph-users] why the erasure code pool not support random write?
This is a common constraint in many erasure coding storage system. It arises because random writes turn into a read-modify-write cycle (in order to redo the parity calculations). So we simply disallow them in EC pools, which works fine for the target use cases right now. -Greg On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote: hi, cephers: When I look into the ceph source code, I found the erasure code pool not support the random write, it only support the append write. Why? Is that random write of is erasure code high cost and the performance of the deep scrub is very poor? Thanks. -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD very slow startup
On Mon, Oct 20, 2014 at 8:25 AM, Lionel Bouton lionel+c...@bouton.name wrote: Hi, More information on our Btrfs tests. Le 14/10/2014 19:53, Lionel Bouton a écrit : Current plan: wait at least a week to study 3.17.0 behavior and upgrade the 3.12.21 nodes to 3.17.0 if all goes well. 3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only (no corruption but OSD goes down) on some access patterns with snapshots: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html The bug may be present in earlier kernels (at least the 3.16.4 code in fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and 3.17.1) but seems at least less likely to show up (never saw it with 3.16.4 in several weeks but it happened with 3.17.1 three times in just a few hours). As far as I can tell from its Changelog, 3.17.1 didn't patch any vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same behaviour. I switched all servers to 3.16.4 which I had previously tested without any problem. The performance problem is still there with 3.16.4. In fact one of the 2 large OSD was so slow it was repeatedly marked out and generated lots of latencies when in. I just had to remove it: when this OSD is shut down with noout to avoid backfills slowing down the storage network, latencies are back to normal. I chose to reformat this one with XFS. The other big node has a nearly perfectly identical system (same hardware, same software configuration, same logical volume configuration, same weight in the crush map, comparable disk usage in the OSD fs, ...) but is behaving itself (maybe slower than our smaller XFS and Btrfs OSD, but usable). The only notable difference is that it was formatted more recently. So the performance problem might be linked to the cumulative amount of data access to the OSD over time. Yeah; we've seen this before and it appears to be related to our aggressive use of btrfs snapshots; it seems that btrfs doesn't defrag well under our use case. The btrfs developers make sporadic concerted efforts to improve things (and succeed!), but it apparently still hasn't gotten enough better yet. :( -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CRUSH depends on host + OSD?
On Tuesday, October 21, 2014, Chad Seys cws...@physics.wisc.edu wrote: Hi Craig, It's part of the way the CRUSH hashing works. Any change to the CRUSH map causes the algorithm to change slightly. Dan@cern could not replicate my observations, so I plan to follow his procedure (fake create an OSD, wait for rebalance, remove fake OSD) in the near future to see if I can replicate his! :) BTW, it's safer to remove OSDs and hosts by first marking the OSDs UP and OUT (ceph osd out OSDID). That will trigger the remapping, while keeping the OSDs in the pool so you have all of your replicas. I am under the impression that the procedure I posted does leave the OSDs in the pool while an additional replication takes place: After ceph osd crush remove osd.osdnum I see that the used % on the removed OSD slowly decreases as the relocation of blocks takes place. If my ceph-fu were strong enough I would try to find some block replicated num_replicas+1 times so that my belief would be well-founded. :) Also ceph osd crush remove osd.osdnum still shows the OSD in ceph osd tree, but it is not attached to any server. I think it might even be marked UP and DOWN, but I cannot confirm. So I believe so far the approaches are equivalent. BUT, I think that to keep an OSD out after using ceph osd out OSDID one needs to turn off auto in or something. I don't want to turn that off b/c in the past I had some slow drives which would occasionally be marked out. If they stayed out that could increase load on other drives, making them unresponsive, getting them marked out as well, leading to a domino effect where too many drives get marked out and the cluster goes down. Now I have better hardware, but since the scenario exists, I'd rather avoid it! :) There are separate options for automatically marking new drives in versus marking in established ones. Should be in the docs! :) -Greg If you mark the OSDs OUT, wait for the remapping to finish, and remove the OSDs and host from the CRUSH map, there will still be some data migration. Yep, this is what I see. But I find it weird. Ceph is also really good at handling multiple changes in a row. For example, I had to reformat all of my OSDs because I chose my mkfs.xfs parameters poorly. I removed the OSDS, without draining them first, which caused a lot of remapping. I then quickly formatted the OSDs, and put them back in. The CRUSH map went back to what it started with, and the only remapping required was to re-populate the newly formatted OSDs. In this case you'd be living with num_replicas-1 for a while. Sounds exciting! :) Thanks, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com javascript:; http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question/idea about performance problems with a few overloaded OSDs
On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton lionel+c...@bouton.name wrote: Hi, I've yet to install 0.80.7 on one node to confirm its stability and use the new IO prirority tuning parameters enabling prioritized access to data from client requests. In the meantime, faced with large slowdowns caused by resync or external IO load (although external IO load is not expected it can happen in migrations from other storage solutions like in our recent experience) I've got an idea related to the underlying problem (IO load concurrent with client requests or even concentrated client-requests) that might already be implemented (or not being of much value) so I'll write it down to get feedback. When IO load is not balanced correctly across OSDs the most loaded OSD becomes a bottleneck in both write and read requests and for many (most?) workloads will become a bottleneck for the whole storage network as seen by the client. This happened to us on numerous occasions (low filesystem performance, OSD restarts triggering backfills or recoveries) For read requests would it be beneficial for OSDs to communicate with their peers to find out their recent IO mean/median/... service time and make OSDs able to proxy requests to less loaded nodes when they are substantially more loaded than their peers? If the additional network load generated by proxying requests proves detrimental to the overall performance, maybe an update to librados to accept a hint to redirect read requests for a given PG and a given period might be a solution. I understand that even if this is possible for read requests this doesn't apply to write requests because they are synchronized across all replicas. That said diminishing read load on one OSD without modifying write behavior will obviously help the OSD process write requests faster. If the general idea isn't bad or already obsoleted by another it's obviously not trivial. For example it can create unstable feedback loops so if I were to try and implement it I'll probably start with a selective proxy/redirect with a probability of proxying/redirecting being computed from the respective loads of all OSDs storing a given PG to avoid ping-pong situations where read requests overload OSDs before overloading another and coming round again. Any thought? Is it based on wrong assumptions? Would it prove to be a can of worms if someone tried to implement it? Yeah, there's one big thing you're missing: we strictly order reads and writes to an object, and the primary is the serialization point. If we were to proxy reads to another replica it would be easy enough for the primary to continue handling the ordering, but if it were just a redirect it wouldn't be able to do so (the primary doesn't know when the read is completed, allowing it to start a write). Setting up the proxy of course requires a lot more code, but more importantly it's more resource-intensive on the primary, so I'm not sure if it's worth it. :/ The primary affinity value we recently introduced is designed to help alleviate persistent balancing problems around this by letting you reduce how many PGs an OSD is primary for without changing the location of the actual data in the cluster. But dynamic updates to that aren't really feasible either (it's a map change and requires repeering). There are also relaxed consistency mechanisms that let clients read from a replica (randomly, or the one closest to them, etc), but with these there's no good way to get load data from the OSDs to the clients. So redirects of some kind sound like a good feature, but I'm not sure how one could go about implementing them reasonably. I think the actual proxy is probably the best bet, but that's an awful lot of code in critical places and with lots of dependencies whose performance/balancing benefits I'm a little dubious of. :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Extremely slow small files rewrite performance
Are these tests conducted using a local fs on RBD, or using CephFS? If CephFS, do you have multiple clients mounting the FS, and what are they doing? What client (kernel or ceph-fuse)? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Oct 21, 2014 at 9:05 AM, Sergey Nazarov nataraj...@gmail.com wrote: Hi I just built a new cluster using this quickstart instructions: http://ceph.com/docs/master/start/ And here is what I am seeing: # time for i in {1..10}; do echo $i $i.txt ; done real 0m0.081s user 0m0.000s sys 0m0.004s And if I try to repeat the same command (when files already created): # time for i in {1..10}; do echo $i $i.txt ; done real 0m48.894s user 0m0.000s sys 0m0.004s I was very surprised and then just tried to rewrite a single file: # time echo 1 1.txt real 0m3.133s user 0m0.000s sys 0m0.000s BTW, I dont think it is the problem with OSD speed or network: # time sysbench --num-threads=1 --test=fileio --file-total-size=1G --file-test-mode=rndrw prepare 1073741824 bytes written in 23.52 seconds (43.54 MB/sec). Here is my ceph cluster status and verion: # ceph -w cluster d3dcacc3-89fb-4db0-9fa9-f1f6217280cb health HEALTH_OK monmap e4: 4 mons at {atl-fs10=10.44.101.70:6789/0,atl-fs11=10.44.101.91:6789/0,atl-fs12=10.44.101.92:6789/0,atl-fs9=10.44.101.69:6789/0}, election epoch 40, quorum 0,1,2,3 atl-fs9,atl-fs10,atl-fs11,atl-fs12 mdsmap e33: 1/1/1 up {0=atl-fs12=up:active}, 3 up:standby osdmap e92: 4 osds: 4 up, 4 in pgmap v8091: 192 pgs, 3 pools, 123 MB data, 1658 objects 881 GB used, 1683 GB / 2564 GB avail 192 active+clean client io 1820 B/s wr, 1 op/s # ceph -v ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) All nodes connected with gigabit network. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Extremely slow small files rewrite performance
Can you enable debugging on the client (debug ms = 1, debug client = 20) and mds (debug ms = 1, debug mds = 20), run this test again, and post them somewhere for me to look at? While you're at it, can you try rados bench and see what sort of results you get? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Oct 21, 2014 at 10:57 AM, Sergey Nazarov nataraj...@gmail.com wrote: It is CephFS mounted via ceph-fuse. I am getting the same results not depending on how many other clients are having this fs mounted and their activity. Cluster is working on Debian Wheezy, kernel 3.2.0-4-amd64. On Tue, Oct 21, 2014 at 1:44 PM, Gregory Farnum g...@inktank.com wrote: Are these tests conducted using a local fs on RBD, or using CephFS? If CephFS, do you have multiple clients mounting the FS, and what are they doing? What client (kernel or ceph-fuse)? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Oct 21, 2014 at 9:05 AM, Sergey Nazarov nataraj...@gmail.com wrote: Hi I just built a new cluster using this quickstart instructions: http://ceph.com/docs/master/start/ And here is what I am seeing: # time for i in {1..10}; do echo $i $i.txt ; done real 0m0.081s user 0m0.000s sys 0m0.004s And if I try to repeat the same command (when files already created): # time for i in {1..10}; do echo $i $i.txt ; done real 0m48.894s user 0m0.000s sys 0m0.004s I was very surprised and then just tried to rewrite a single file: # time echo 1 1.txt real 0m3.133s user 0m0.000s sys 0m0.000s BTW, I dont think it is the problem with OSD speed or network: # time sysbench --num-threads=1 --test=fileio --file-total-size=1G --file-test-mode=rndrw prepare 1073741824 bytes written in 23.52 seconds (43.54 MB/sec). Here is my ceph cluster status and verion: # ceph -w cluster d3dcacc3-89fb-4db0-9fa9-f1f6217280cb health HEALTH_OK monmap e4: 4 mons at {atl-fs10=10.44.101.70:6789/0,atl-fs11=10.44.101.91:6789/0,atl-fs12=10.44.101.92:6789/0,atl-fs9=10.44.101.69:6789/0}, election epoch 40, quorum 0,1,2,3 atl-fs9,atl-fs10,atl-fs11,atl-fs12 mdsmap e33: 1/1/1 up {0=atl-fs12=up:active}, 3 up:standby osdmap e92: 4 osds: 4 up, 4 in pgmap v8091: 192 pgs, 3 pools, 123 MB data, 1658 objects 881 GB used, 1683 GB / 2564 GB avail 192 active+clean client io 1820 B/s wr, 1 op/s # ceph -v ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) All nodes connected with gigabit network. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fio rbd stalls during 4M reads
There's an issue in master branch temporarily that makes rbd reads greater than the cache size hang (if the cache was on). This might be that. (Jason is working on it: http://tracker.ceph.com/issues/9854) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Oct 23, 2014 at 5:09 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: I'm doing some fio tests on Giant using fio rbd driver to measure performance on a new ceph cluster. However with block sizes 1M (initially noticed with 4M) I am seeing absolutely no IOPS for *reads* - and the fio process becomes non interrupteable (needs kill -9): $ ceph -v ceph version 0.86-467-g317b83d (317b831a917f70838870b31931a79bdd4dd0) $ fio --version fio-2.1.11-20-g9a44 $ fio read-busted.fio env-read-4M: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=rbd, iodepth=32 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.8 Jobs: 1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050441d:06h:58m:03s] This appears to be a pure fio rbd driver issue, as I can attach the relevant rbd volume to a vm and dd from it using 4M blocks no problem. Any ideas? Cheers Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] error when executing ceph osd pool set foo-hot cache-mode writeback
On Tue, Oct 28, 2014 at 3:24 AM, Cristian Falcas cristi.fal...@gmail.com wrote: Hello, In the documentation about creating an cache pool, you find this: Cache mode The most important policy is the cache mode: ceph osd pool set foo-hot cache-mode writeback But when trying to run the above command, I get an error: ceph osd pool set ssd_cache cache-mode writeback Invalid command: cache-mode not in size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid osd pool set poolname size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid val {--yes-i-really-mean-it} : set pool parameter var to val Error EINVAL: invalid command Is this deprecated? I'm using version 0.80. Those are the commands I used to create the cache: # Set up a read/write cache pool ssd_cache for pool images: ceph osd tier add images ssd_cache ceph osd tier cache-mode ssd_cache writeback # Direct all traffic for images to ssd_cache: ceph osd tier set-overlay images ssd_cache ceph osd pool set ssd_cache cache-mode writeback # Set the target size and enable the tiering agent for ssd_cache: ceph osd pool set ssd_cache hit_set_type bloom ceph osd pool set ssd_cache hit_set_count 1 ceph osd pool set ssd_cache hit_set_period 3600 # 1 hour ceph osd pool set ssd_cache target_max_bytes 4000 # 500 GB # will begin flushing dirty objects when 40% of the pool is dirty and begin evicting clean objects when we reach 80% of the target size. ceph osd pool set ssd_cache cache_target_dirty_ratio .4 ceph osd pool set ssd_cache cache_target_full_ratio .8 Where are you seeing the ceph osd pool set ssd_cache cache-mode writeback from? You're setting that with the ceph osd tier cache-mode ssd_cache writeback command. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding a monitor to
On Mon, Oct 27, 2014 at 11:37 AM, Patrick Darley patrick.dar...@codethink.co.uk wrote: Hi there Over the last week or so, I've been trying to connect a ceph monitor node running on a baserock system to connect to a simple 3-node ubuntu ceph cluster. The 3 node ubunutu cluster was created by following the documented Quick installation guide using 3 VMs running ubuntu Trusty. After the ubuntu cluster has been deployed I would then follow the directions below, which I derived from comparing the ceph-deploy debug information, the ceph documentation on adding monitor nodes to an existing system and the ceph documentation on bootstrapping monitor nodes. 1. scp the /etc/ceph/* from admin node 2. create the dir: mkdir /var/lib/ceph/mon/ceph-bcc08 3. generate mon keyring: sudo ceph auth get mon. -o /var/lib/ceph/tmp/ceph-bcc08.mon.keyring 4. generate monmap: sudo ceph mon getmap -o /var/lib/ceph/tmp/monmap Yeah, this is wrong. You're here giving the monitor its own keyring which it is going to expect anybody to talk to to be encrypting with. The docs have a section on adding monitors which should work verbatim; if not it's a doc bug: http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#adding-monitors -Greg 5. That filesystem thingy: sudo ceph-mon --cluster ceph --mkfs -i bcc08 --keyring /var/lib/ceph/tmp/ceph-bcc08.mon.keyring --monmap /var/lib/ceph/tmp/monmap 6. Unlink keys and old monmap: rm /var/lib/ceph/tmp/* 7. touch things: touch /var/lib/ceph/mon/ceph-bcc08/done and touch /var/lib/ceph/mon/ceph-bcc08/sysvinit 8. Then start the mon: sudo /etc/init.d/ceph start mon.bcc08 When I carry out these steps in the attempt to add a baserock system to the ubuntu cluster, the monitor node has not been added to the cluster and the admin socket mon_status gives the following output. ~ # ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.bcc07.asok mon_status { name: bcc07, rank: -1, state: probing, election_epoch: 0, quorum: [], outside_quorum: [], extra_probe_peers: [], sync_provider: [], monmap: { epoch: 0, fsid: 4460079d-42f4-4e3a-8ce3-e2a7fa2685e6, modified: 2014-10-27 12:37:25.531542, created: 2014-10-27 12:37:25.531542, mons: [ { rank: 0, name: ucc01, addr: 192.168.122.95:6789\/0}]}} And the newly added monitor remains stuck in the probing state indefinitely. To try and resolve this issue I have looked at the problems monitor troubleshooting page of the ceph documentation, eg. ntp sychronisation and checking network connectivity (to the best of my ability :-s ). It is also worth mentioning that I have created a 3 node ceph cluster on baserock machines (1 mon, 2 osds) then successfully added monitor nodes running baserock and ubuntu systems using the same 8 step process given above. This leaves me confused as to why adding the monitor run on baserock to the all ubuntu cluster specifically is causing problems. Are there any reasons why this 'probing' problem could be occuring? Im feeling a little stuck of how to proceed and would welcome any suggestions. Thanks for your help, Patrick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds isn't working anymore after osd's running full
You'll need to gather a log with the offsets visible; you can do this with debug ms = 1; debug mds = 20; debug journaler = 20. -Greg On Fri, Oct 24, 2014 at 7:03 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg and John, I used the patch on the ceph cluster and tried it again: /usr/bin/ceph-mds -i th1-mon001 -c /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 journaldumptgho-mon001 undump journaldumptgho-mon001 start 9483323613 len 134213311 writing header 200. writing 9483323613~1048576 writing 9484372189~1048576 writing 9614395613~1048576 writing 9615444189~1048576 writing 9616492765~1044159 done. It went well without errors and after that I restarted the mds. The status went from up:replay to up:reconnect to up:rejoin(lagged or crashed) In the log there is an error about trim_to trimming_pos and its like Greg mentioned that maybe the dumpfile needs to be truncated to the proper length and resetting and undumping again. How can I truncate the dumped file to the correct length? The mds log during the undumping and starting the mds: http://pastebin.com/y14pSvM0 Kind Regards, Jasper Van: john.sp...@inktank.com [john.sp...@inktank.com] namens John Spray [john.sp...@redhat.com] Verzonden: donderdag 16 oktober 2014 12:23 Aan: Jasper Siero CC: Gregory Farnum; ceph-users Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full Following up: firefly fix for undump is: https://github.com/ceph/ceph/pull/2734 Jasper: if you still need to try undumping on this existing firefly cluster, then you can download ceph-mds packages from this wip-firefly-undump branch from http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/ Cheers, John On Wed, Oct 15, 2014 at 8:15 PM, John Spray john.sp...@redhat.com wrote: Sadly undump has been broken for quite some time (it was fixed in giant as part of creating cephfs-journal-tool). If there's a one line fix for this then it's probably worth putting in firefly since it's a long term supported branch -- I'll do that now. John On Wed, Oct 15, 2014 at 8:23 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, The dump and reset of the journal was succesful: [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph --dump-journal 0 journaldumptgho-mon001 journal is 9483323613~134215459 read 134213311 bytes at offset 9483323613 wrote 134213311 bytes at offset 9483323613 to journaldumptgho-mon001 NOTE: this is a _sparse_ file; you can $ tar cSzf journaldumptgho-mon001.tgz journaldumptgho-mon001 to efficiently compress it while preserving sparseness. [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph --reset-journal 0 old journal was 9483323613~134215459 new journal start will be 9621733376 (4194304 bytes past old end) writing journal head writing EResetJournal entry done Undumping the journal was not successful and looking into the error client_lock.is_locked() is showed several times. The mds is not running when I start the undumping so maybe have forgot something? [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 journaldumptgho-mon001 undump journaldumptgho-mon001 start 9483323613 len 134213311 writing header 200. osdc/Objecter.cc: In function 'ceph_tid_t Objecter::op_submit(Objecter::Op*)' thread 7fec3e5ad7a0 time 2014-10-15 09:09:32.020287 osdc/Objecter.cc: 1225: FAILED assert(client_lock.is_locked()) ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) 1: /usr/bin/ceph-mds() [0x80f15e] 2: (Dumper::undump(char const*)+0x65d) [0x56c7ad] 3: (main()+0x1632) [0x569c62] 4: (__libc_start_main()+0xfd) [0x7fec3ca68d5d] 5: /usr/bin/ceph-mds() [0x567d99] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2014-10-15 09:09:32.021313 7fec3e5ad7a0 -1 osdc/Objecter.cc: In function 'ceph_tid_t Objecter::op_submit(Objecter::Op*)' thread 7fec3e5ad7a0 time 2014-10-15 09:09:32.020287 osdc/Objecter.cc: 1225: FAILED assert(client_lock.is_locked()) ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) 1: /usr/bin/ceph-mds() [0x80f15e] 2: (Dumper::undump(char const*)+0x65d) [0x56c7ad] 3: (main()+0x1632) [0x569c62] 4: (__libc_start_main()+0xfd) [0x7fec3ca68d5d] 5: /usr/bin/ceph-mds() [0x567d99] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 0 2014-10-15 09:09:32.021313 7fec3e5ad7a0 -1 osdc/Objecter.cc: In function 'ceph_tid_t Objecter::op_submit(Objecter::Op*)' thread 7fec3e5ad7a0 time 2014-10-15 09:09:32.020287 osdc/Objecter.cc: 1225: FAILED
Re: [ceph-users] Adding a monitor to
I'm sorry, you're right — I misread it. :( But indeed step 6 is the crucial one, which tells the existing monitors to accept the new one into the cluster. You'll need to run it with an admin client keyring that can connect to the existing cluster; that's probably the part that has gone wrong. You don't need to run it from the new monitor, so if you're having trouble getting the keys to behave I'd just run it from an existing system. :) -Greg On Tue, Oct 28, 2014 at 10:11 AM, Patrick Darley patrick.dar...@codethink.co.uk wrote: On 2014-10-28 16:08, Gregory Farnum wrote: On Mon, Oct 27, 2014 at 11:37 AM, Patrick Darley patrick.dar...@codethink.co.uk wrote: Hi there Over the last week or so, I've been trying to connect a ceph monitor node running on a baserock system to connect to a simple 3-node ubuntu ceph cluster. The 3 node ubunutu cluster was created by following the documented Quick installation guide using 3 VMs running ubuntu Trusty. After the ubuntu cluster has been deployed I would then follow the directions below, which I derived from comparing the ceph-deploy debug information, the ceph documentation on adding monitor nodes to an existing system and the ceph documentation on bootstrapping monitor nodes. 1. scp the /etc/ceph/* from admin node 2. create the dir: mkdir /var/lib/ceph/mon/ceph-bcc08 3. generate mon keyring: sudo ceph auth get mon. -o /var/lib/ceph/tmp/ceph-bcc08.mon.keyring 4. generate monmap: sudo ceph mon getmap -o /var/lib/ceph/tmp/monmap Yeah, this is wrong. You're here giving the monitor its own keyring which it is going to expect anybody to talk to to be encrypting with. If you are referring to steps 3 and 4 above, I believe these are synonymous with steps 3 and 4 of the documentation you recommended. The monitor keyring and the current monmap are retrieved from the initial monitor. they are then used in step 5 to prepare the monitor's data directory. The docs have a section on adding monitors which should work verbatim; if not it's a doc bug: http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#adding-monitors Thanks for the recommendation. I have tried to use this procedure a couple of times but got stuck at step number 6 of this method. The command given hangs then times out, causing the rest of the cluster to fail. -Greg Thanks for the reply! Much appreciated, Patrick 5. That filesystem thingy: sudo ceph-mon --cluster ceph --mkfs -i bcc08 --keyring /var/lib/ceph/tmp/ceph-bcc08.mon.keyring --monmap /var/lib/ceph/tmp/monmap 6. Unlink keys and old monmap: rm /var/lib/ceph/tmp/* 7. touch things: touch /var/lib/ceph/mon/ceph-bcc08/done and touch /var/lib/ceph/mon/ceph-bcc08/sysvinit 8. Then start the mon: sudo /etc/init.d/ceph start mon.bcc08 When I carry out these steps in the attempt to add a baserock system to the ubuntu cluster, the monitor node has not been added to the cluster and the admin socket mon_status gives the following output. ~ # ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.bcc07.asok mon_status { name: bcc07, rank: -1, state: probing, election_epoch: 0, quorum: [], outside_quorum: [], extra_probe_peers: [], sync_provider: [], monmap: { epoch: 0, fsid: 4460079d-42f4-4e3a-8ce3-e2a7fa2685e6, modified: 2014-10-27 12:37:25.531542, created: 2014-10-27 12:37:25.531542, mons: [ { rank: 0, name: ucc01, addr: 192.168.122.95:6789\/0}]}} And the newly added monitor remains stuck in the probing state indefinitely. To try and resolve this issue I have looked at the problems monitor troubleshooting page of the ceph documentation, eg. ntp sychronisation and checking network connectivity (to the best of my ability :-s ). It is also worth mentioning that I have created a 3 node ceph cluster on baserock machines (1 mon, 2 osds) then successfully added monitor nodes running baserock and ubuntu systems using the same 8 step process given above. This leaves me confused as to why adding the monitor run on baserock to the all ubuntu cluster specifically is causing problems. Are there any reasons why this 'probing' problem could be occuring? Im feeling a little stuck of how to proceed and would welcome any suggestions. Thanks for your help, Patrick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Troubleshooting Incomplete PGs
On Thu, Oct 23, 2014 at 6:41 AM, Chris Kitzmiller ckitzmil...@hampshire.edu wrote: On Oct 22, 2014, at 8:22 PM, Craig Lewis wrote: Shot in the dark: try manually deep-scrubbing the PG. You could also try marking various osd's OUT, in an attempt to get the acting set to include osd.25 again, then do the deep-scrub again. That probably won't help though, because the pg query says it probed osd.25 already... actually , it doesn't. osd.25 is in probing_osds not probed_osds. The deep-scrub might move things along. Re-reading your original post, if you marked the slow osds OUT, but left them running, you should not have lost data. That's true. I just marked them out. I did lose osd.10 (in addition to out'ting those other two OSDs) so I'm not out of the woods yet. If the scrubs don't help, it's probably time to hop on IRC. When I issue the deep-scrub command the cluster just doesn't scrub it. Same for regular scrub. :( This pool was offering an RBD which I've lost my connection to and it won't remount so my data is totally inaccessible at the moment. Thanks for your help so far! It looks like you are suffering from http://tracker.ceph.com/issues/9752, which we've not yet seen in-house but have had reported a few times. I suspect that Loic (CC'ed) would like to discuss your cluster's history with you to try and narrow it down. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding a monitor to
[Re-adding the list, so this is archived for future posterity.] On Wed, Oct 29, 2014 at 6:11 AM, Patrick Darley patrick.dar...@codethink.co.uk wrote: Thanks again for the reply Greg! On 2014-10-28 17:39, Gregory Farnum wrote: I'm sorry, you're right — I misread it. :( No worries, I had included some misleading words like generate in my rough description where retrive would have been more appropriate. Sorry! But indeed step 6 is the crucial one, which tells the existing monitors to accept the new one into the cluster. You'll need to run it with an admin client keyring that can connect to the existing cluster; that's probably the part that has gone wrong. You don't need to run it from the new monitor, I think, in order to carry out the 5th step you also need the client.admin keyring present, that'd be preparing the monitors data directory. I had scp-ed it across to the monitor along with the ceph.conf file and pu them in the expected location, /etc/ceph/, prior to running that command. so if you're having trouble getting the keys to behave I'd just run it from an existing system. :) I tried running this command, step 6, from the admin node of my ubuntu ceph cluster. As I had experienced before, the command hung. Then trying to run any ceph commands on the rest of the cluster I get a long hang then the following error: cc@ucc01:~$ ceph -s 2014-10-29 10:40:33.748334 7ffaec051700 0 monclient(hunting): authenticate timed out after 300 2014-10-29 10:40:33.748499 7ffaec051700 0 librados: client.admin authentication error (110) Connection timed out Error connecting to cluster: TimedOut The monitor that I was trying to add can be started ok after this (once I have touched the done and sysvinit files) but also gives the above error when attempting to run the ceph -s. Checking the log file I see the following lines repeated: 2014-10-29 10:01:01.721905 7ffd548ac700 0 mon.bcc07@-1(probing) e0 handle_probe ignoring fsid 5021163c-3c0b-4ec5-83fe-f0622c0e9447 != f2d609ef-2065-4862-a821-55c484d61dca 2014-10-29 10:01:01.809991 7ffd550ad700 1 mon.bcc07@-1(probing).paxos(paxos recovering c 0..0) is_readable now=2014-10-29 10:01:01.809996 lease_expire=0.00 has v0 lc 0 2014-10-29 10:01:03.721559 7ffd548ac700 0 mon.bcc07@-1(probing) e0 handle_probe ignoring fsid 5021163c-3c0b-4ec5-83fe-f0622c0e9447 != f2d609ef-2065-4862-a821-55c484d61dca 2014-10-29 10:01:03.810466 7ffd550ad700 1 mon.bcc07@-1(probing).paxos(paxos recovering c 0..0) is_readable now=2014-10-29 10:01:03.810467 lease_expire=0.00 has v0 lc 0 The initial monitor has the following log at around a similar time: 2014-10-29 10:01:02.169655 7f52e7408700 0 mon.ucc01@1(probing) e2 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca != 5021163c-3c0b-4ec5-83fe-f0622c0e9447 2014-10-29 10:01:04.170153 7f52e7408700 0 mon.ucc01@1(probing) e2 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca != 5021163c-3c0b-4ec5-83fe-f0622c0e9447 2014-10-29 10:01:06.169300 7f52e7408700 0 mon.ucc01@1(probing) e2 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca != 5021163c-3c0b-4ec5-83fe-f0622c0e9447 It looks to me like there might be conflicting fsid values being compared somewhere, but checking the ceph.conf files on the nodes I found them to be declared as the same. The log files recorded a similar output on both monitors for some time. I then turned off the monitor I was attempting to add at approximately 12:39:30 and the log file of the initial monitor has the following output around this time: 2014-10-29 12:39:30.304639 7f52e7408700 0 mon.ucc01@1(probing) e2 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca != 5021163c-3c0b-4ec5-83fe-f0622c0e9447 Okay, that's indeed not right. I suspect this is your issue but I'm not entirely certain because your other symptoms are a bit weird. I bet Joao can help though; he maintains the monitor and deals with these issues a lot more often than I do. :) -Greg 2014-10-29 12:39:32.023964 7f52e7c09700 0 mon.ucc01@1(probing).data_health(1) update_stats avail 68% total 14318640 used 3748076 avail 9820180 2014-10-29 12:39:32.303740 7f52e7408700 0 mon.ucc01@1(probing) e2 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca != 5021163c-3c0b-4ec5-83fe-f0622c0e9447 2014-10-29 12:39:32.394606 7f52e53fd700 0 -- 192.168.122.95:6789/0 192.168.122.42:6789/0 pipe(0x55e5180 sd=24 :6789 s=2 pgs=1 cs=1 l=0 c=0x39bfde0).fault with nothing to send, going to standby 2014-10-29 12:39:33.862400 7f52e5902700 0 -- 192.168.122.95:6789/0 192.168.122.42:6789/0 pipe(0x55e5180 sd=13 :6789 s=1 pgs=1 cs=2 l=0 c=0x39bfde0).fault 2014-10-29 12:40:32.024807 7f52e7c09700 0 mon.ucc01@1(probing).data_health(1) update_stats avail 68% total 14318640 used 3748072 avail 9820184 2014-10-29 12:41:32.025632 7f52e7c09700 0 mon.ucc01@1
Re: [ceph-users] mds isn't working anymore after osd's running full
On Wed, Oct 29, 2014 at 7:51 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hello Greg, I added the debug options which you mentioned and started the process again: [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph --reset-journal 0 old journal was 9483323613~134233517 new journal start will be 9621733376 (4176246 bytes past old end) writing journal head writing EResetJournal entry done [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 -c /etc/ceph/ceph.conf --cluster ceph --undump-journal 0 journaldumptgho-mon001 undump journaldumptgho-mon001 start 9483323613 len 134213311 writing header 200. writing 9483323613~1048576 writing 9484372189~1048576 writing 9485420765~1048576 writing 9486469341~1048576 writing 9487517917~1048576 writing 9488566493~1048576 writing 9489615069~1048576 writing 9490663645~1048576 writing 9491712221~1048576 writing 9492760797~1048576 writing 9493809373~1048576 writing 9494857949~1048576 writing 9495906525~1048576 writing 9496955101~1048576 writing 9498003677~1048576 writing 9499052253~1048576 writing 9500100829~1048576 writing 9501149405~1048576 writing 9502197981~1048576 writing 9503246557~1048576 writing 9504295133~1048576 writing 9505343709~1048576 writing 9506392285~1048576 writing 9507440861~1048576 writing 9508489437~1048576 writing 9509538013~1048576 writing 9510586589~1048576 writing 9511635165~1048576 writing 9512683741~1048576 writing 9513732317~1048576 writing 9514780893~1048576 writing 9515829469~1048576 writing 9516878045~1048576 writing 9517926621~1048576 writing 9518975197~1048576 writing 9520023773~1048576 writing 9521072349~1048576 writing 9522120925~1048576 writing 9523169501~1048576 writing 9524218077~1048576 writing 9525266653~1048576 writing 9526315229~1048576 writing 9527363805~1048576 writing 9528412381~1048576 writing 9529460957~1048576 writing 9530509533~1048576 writing 9531558109~1048576 writing 9532606685~1048576 writing 9533655261~1048576 writing 9534703837~1048576 writing 9535752413~1048576 writing 9536800989~1048576 writing 9537849565~1048576 writing 9538898141~1048576 writing 9539946717~1048576 writing 9540995293~1048576 writing 9542043869~1048576 writing 9543092445~1048576 writing 9544141021~1048576 writing 9545189597~1048576 writing 9546238173~1048576 writing 9547286749~1048576 writing 9548335325~1048576 writing 9549383901~1048576 writing 9550432477~1048576 writing 9551481053~1048576 writing 9552529629~1048576 writing 9553578205~1048576 writing 9554626781~1048576 writing 9555675357~1048576 writing 9556723933~1048576 writing 9557772509~1048576 writing 9558821085~1048576 writing 9559869661~1048576 writing 9560918237~1048576 writing 9561966813~1048576 writing 9563015389~1048576 writing 9564063965~1048576 writing 9565112541~1048576 writing 9566161117~1048576 writing 9567209693~1048576 writing 9568258269~1048576 writing 9569306845~1048576 writing 9570355421~1048576 writing 9571403997~1048576 writing 9572452573~1048576 writing 9573501149~1048576 writing 9574549725~1048576 writing 9575598301~1048576 writing 9576646877~1048576 writing 9577695453~1048576 writing 9578744029~1048576 writing 9579792605~1048576 writing 9580841181~1048576 writing 9581889757~1048576 writing 9582938333~1048576 writing 9583986909~1048576 writing 9585035485~1048576 writing 9586084061~1048576 writing 9587132637~1048576 writing 9588181213~1048576 writing 9589229789~1048576 writing 9590278365~1048576 writing 9591326941~1048576 writing 9592375517~1048576 writing 9593424093~1048576 writing 9594472669~1048576 writing 9595521245~1048576 writing 9596569821~1048576 writing 9597618397~1048576 writing 9598666973~1048576 writing 9599715549~1048576 writing 9600764125~1048576 writing 9601812701~1048576 writing 9602861277~1048576 writing 9603909853~1048576 writing 9604958429~1048576 writing 9606007005~1048576 writing 9607055581~1048576 writing 9608104157~1048576 writing 9609152733~1048576 writing 9610201309~1048576 writing 9611249885~1048576 writing 9612298461~1048576 writing 9613347037~1048576 writing 9614395613~1048576 writing 9615444189~1048576 writing 9616492765~1044159 done. [root@th1-mon001 ~]# service ceph start mds === mds.th1-mon001 === Starting Ceph mds.th1-mon001 on th1-mon001... starting mds.th1-mon001 at :/0 The new logs: http://pastebin.com/wqqjuEpy These don't have the increased debugging levels set. :( I'm not sure where you could have put them that they didn't get picked up, but make sure it's in the ceph.conf that this mds daemon is referring to. (You can see the debug levels in use in the --- logging levels --- section; they appear to all be default.) -Greg
Re: [ceph-users] Delete pools with low priority?
Dan (who wrote that slide deck) is probably your best bet here, but I believe pool deletion is not very configurable and fairly expensive right now. I suspect that it will get better in Hammer or Infernalis, once we have a unified op work queue that we can independently prioritize all IO through (this was a blueprint in CDS today!). Similar problems with snap trimming and scrubbing were resolved by introducing sleeps between ops, but that's a bit of a hack itself and should be going away once proper IO prioritization is available. -Greg On Wed, Oct 29, 2014 at 8:19 AM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: Bump :-) Any ideas on this? They would be much appreciated. Also: Sorry for a possible double post, client had forgotten its email config. On 2014-10-22 21:21:54 +, Daniel Schneller said: We have been running several rounds of benchmarks through the Rados Gateway. Each run creates several hundred thousand objects and similarly many containers. The cluster consists of 4 machines, 12 OSD disks (spinning, 4TB) — 48 OSDs total. After running a set of benchmarks we renamed the pools used by the gateway pools to get a clean baseline. In total we now have several million objects and containers in 3 pools. Redundancy for all pools is set to 3. Today we started deleting the benchmark data. Once the first renamed set of RGW pools was executed, cluster performance started to go down the drain. Using iotop we can see that the disks are all working furiously. As the command to delete the pools came back very quickly, the assumption is that we are now seeing the effects of the actual objects being removed, causing lots and lots of IO activity on the disks, negatively impacting regular operations. We are running OpenStack on top of Ceph, and we see drastic reduction in responsiveness of these machines as well as in CephFS. Fortunately this is still a test setup, so no production systems are affected. Nevertheless I would like to ask a few questions: 1) Is it possible to have the object deletion run in some low-prio mode? 2) If not, is there another way to delete lots and lots of objects without affecting the rest of the cluster so badly? 3) Can we somehow determine the progress of the deletion so far? We would like to estimate if this is going to take hours, days or weeks? 4) Even if not possible for the already running deletion, could be get a progress for the remaining pools we still want to delete? 5) Are there any parameters that we might tune — even if just temporarily - to speed this up? Slide 18 of http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern describes a very similar situation. Thanks, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crash with rados cppool and snapshots
On Wed, Oct 29, 2014 at 7:49 AM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: Hi! We are exploring options to regularly preserve (i.e. backup) the contents of the pools backing our rados gateways. For that we create nightly snapshots of all the relevant pools when there is no activity on the system to get consistent states. In order to restore the whole pools back to a specific snapshot state, we tried to use the rados cppool command (see below) to copy a snapshot state into a new pool. Unfortunately this causes a segfault. Are we doing anything wrong? This command: rados cppool --snap snap-1 deleteme.lp deleteme.lp2 2 segfault.txt Produces this output: *** Caught signal (Segmentation fault) ** in thread 7f8f49a927c0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: rados() [0x43eedf] 2: (()+0x10340) [0x7f8f48738340] 3: (librados::IoCtxImpl::snap_lookup(char const*, unsigned long*)+0x17) [0x7f8f48aff127] 4: (main()+0x1385) [0x411e75] 5: (__libc_start_main()+0xf5) [0x7f8f4795fec5] 6: rados() [0x41c6f7] 2014-10-29 12:03:22.761653 7f8f49a927c0 -1 *** Caught signal (Segmentation fault) ** in thread 7f8f49a927c0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: rados() [0x43eedf] 2: (()+0x10340) [0x7f8f48738340] 3: (librados::IoCtxImpl::snap_lookup(char const*, unsigned long*)+0x17) [0x7f8f48aff127] 4: (main()+0x1385) [0x411e75] 5: (__libc_start_main()+0xf5) [0x7f8f4795fec5] 6: rados() [0x41c6f7] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Full segfault file and the objdump output for the rados command can be found here: - https://public.centerdevice.de/53bddb80-423e-4213-ac62-59fe8dbb9bea - https://public.centerdevice.de/50b81566-41fb-439a-b58b-e1e32d75f32a We updated to the 0.80.7 release (saw the issue with 0.80.5 before and had hoped that the long list of bugfixes in the release notes would include a fix for this) but are still seeing it. Rados gateways, OSDs, MONs etc. have all been restarted after the update. Package versions as follows: daniel.schneller@node01 [~] $ ➜ dpkg -l | grep ceph ii ceph0.80.7-1trusty ii ceph-common 0.80.7-1trusty ii ceph-fs-common 0.80.7-1trusty ii ceph-fuse 0.80.7-1trusty ii ceph-mds0.80.7-1trusty ii libcephfs1 0.80.7-1trusty ii python-ceph 0.80.7-1trusty daniel.schneller@node01 [~] $ ➜ uname -a Linux node01 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Copying without the snapshot works. Should this work at least in theory? Well, that's interesting. I'm not sure if this can be expected to work properly, but it certainly shouldn't crash there. Looking at it a bit, you can make it not crash by specifying -p deleteme.lp as well, but it simply copies the current state of the pool, not the snapped state. If you could generate a ticket or two at tracker.ceph.com, that would be helpful! -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Swift + radosgw: How do I find accounts/containers/objects limitation?
On Fri, Oct 31, 2014 at 9:55 AM, Narendra Trivedi (natrived) natri...@cisco.com wrote: Hi All, I have been working with Openstack Swift + radosgw to stress the whole object storage from the Swift side (I have been creating containers and objects for days now) but can’t actually find the limitation when it comes to the number of accounts, containers, objects that can be created in the entire object storage. I have tried radosgw-admin but without any luck. Does how this can be found? There are no hard limits on any of these entities, except for a configurable one on the number of buckets per user. There is slow performance degradation as things like the number of objects in a bucket or number of buckets per user grows too large, but the thresholds will vary depending on your cluster. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Swift + radosgw: How do I find accounts/containers/objects limitation?
It defaults to 1000 and can be set via the rgw_admin utility or the admin API when via the max-buckets param. On Fri, Oct 31, 2014 at 10:01 AM, Narendra Trivedi (natrived) natri...@cisco.com wrote: Thanks, Gregory. Do you know how can I find out where the number of buckets for a particular user has been configured? --Narendra -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Friday, October 31, 2014 11:58 AM To: Narendra Trivedi (natrived) Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Swift + radosgw: How do I find accounts/containers/objects limitation? On Fri, Oct 31, 2014 at 9:55 AM, Narendra Trivedi (natrived) natri...@cisco.com wrote: Hi All, I have been working with Openstack Swift + radosgw to stress the whole object storage from the Swift side (I have been creating containers and objects for days now) but can’t actually find the limitation when it comes to the number of accounts, containers, objects that can be created in the entire object storage. I have tried radosgw-admin but without any luck. Does how this can be found? There are no hard limits on any of these entities, except for a configurable one on the number of buckets per user. There is slow performance degradation as things like the number of objects in a bucket or number of buckets per user grows too large, but the thresholds will vary depending on your cluster. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] giant release osd down
What happened when you did the OSD prepare and activate steps? Since your OSDs are either not running or can't communicate with the monitors, there should be some indication from those steps. -Greg On Sun, Nov 2, 2014 at 6:44 AM Shiv Raj Singh virk.s...@gmail.com wrote: Hi All I am new to ceph and I have been trying to configure 3 node ceph cluster with 1 monitor and 2 osd nodes. I have reinstall and recreated the cluster three teams and I ma stuck against the wall . My monitor is working as desired (I guess) but the status of the ods is down. I am following this link http://docs.ceph.com/docs/v0.80.5/install/manual-deployment/ for configuring the osd. The reason why I am not using ceph-deply is because I want to understand the technology. can someone please help e udnerstand what im doing wrong !! :-) !! *Some useful diagnostic information * ceph2:~$ ceph osd tree # idweight type name up/down reweight -1 2 root default -3 1 host ceph2 0 1 osd.0 down0 -2 1 host ceph3 1 1 osd.1 down0 ceph health detail HEALTH_WARN 64 pgs stuck inactive; 64 pgs stuck unclean pg 0.22 is stuck inactive since forever, current state creating, last acting [] pg 0.21 is stuck inactive since forever, current state creating, last acting [] pg 0.20 is stuck inactive since forever, current state creating, last acting [] ceph -s cluster a04ee359-82f8-44c4-89b5-60811bef3f19 health HEALTH_WARN 64 pgs stuck inactive; 64 pgs stuck unclean monmap e1: 1 mons at {ceph1=192.168.101.41:6789/0}, election epoch 1, quorum 0 ceph1 osdmap e9: 2 osds: 0 up, 0 in pgmap v10: 64 pgs, 1 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 64 creating My configurations are as below: sudo nano /etc/ceph/ceph.conf [global] fsid = a04ee359-82f8-44c4-89b5-60811bef3f19 mon initial members = ceph1 mon host = 192.168.101.41 public network = 192.168.101.0/24 auth cluster required = cephx auth service required = cephx auth client required = cephx [osd] osd journal size = 1024 filestore xattr use omap = true osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 333 osd pool default pgp num = 333 osd crush chooseleaf type = 1 [mon.ceph1] host = ceph1 mon addr = 192.168.101.41:6789 [osd.0] host = ceph2 #devs = {path-to-device} [osd.1] host = ceph3 #devs = {path-to-device} .. OSD mount location On ceph2 /dev/sdb1 5.0G 1.1G 4.0G 21% /var/lib/ceph/osd/ceph-0 on Ceph3 /dev/sdb1 5.0G 1.1G 4.0G 21% /var/lib/ceph/osd/ceph-1 My Linux OS lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 14.04 LTS Release:14.04 Codename: trusty Regards Shiv ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
On Mon, Nov 3, 2014 at 7:46 AM, Chad Seys cws...@physics.wisc.edu wrote: Hi All, I upgraded from emperor to firefly. Initial upgrade went smoothly and all placement groups were active+clean . Next I executed 'ceph osd crush tunables optimal' to upgrade CRUSH mapping. Okay...you know that's a data movement command, right? So you should expect it to impact operations. (Although not the crashes you're witnessing.) Now I keep having OSDs go down or have requests blocked for long periods of time. I start back up the down OSDs and recovery eventually stops, but with 100s of incomplete and down+incomplete pgs remaining. The ceph web page says If you see this state [incomplete], report a bug, and try to start any failed OSDs that may contain the needed information. Well, all the OSDs are up, though some have blocked requests. Also, the logs of the OSDs which go down have this message: 2014-11-02 21:46:33.615829 7ffcf0421700 0 -- 192.168.164.192:6810/31314 192.168.164.186:6804/20934 pipe(0x2faa0280 sd=261 :6810 s=2 pgs=9 19 cs=25 l=0 c=0x2ed022c0).fault with nothing to send, going to standby 2014-11-02 21:49:11.440142 7ffce4cf3700 0 -- 192.168.164.192:6810/31314 192.168.164.186:6804/20934 pipe(0xe512a00 sd=249 :6810 s=0 pgs=0 cs=0 l=0 c=0x2a308b00).accept connect_seq 26 vs existing 25 state standby 2014-11-02 21:51:20.085676 7ffcf6e3e700 -1 osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::statePG::RecoveryS tate::Crashed, PG::RecoveryState::RecoveryMachine::my_context)' thread 7ffcf6e3e700 time 2014-11-02 21:51:20.052242 osd/PG.cc: 5424: FAILED assert(0 == we got a bad state machine event) These failures are usually the result of adjusting tunables without having upgraded all the machines in the cluster — although they should also be fixed in v0.80.7. Are you still seeing crashes, or just the PG state issues? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 0.87 rados df fault
On Mon, Nov 3, 2014 at 4:40 AM, Thomas Lemarchand thomas.lemarch...@cloud-solutions.fr wrote: Update : /var/log/kern.log.1:Oct 31 17:19:17 c-mon kernel: [17289149.746084] [21787] 0 21780 492110 185044 920 240143 0 ceph-mon /var/log/kern.log.1:Oct 31 17:19:17 c-mon kernel: [17289149.746115] [13136] 0 1313652172 1753 590 0 ceph /var/log/kern.log.1:Oct 31 17:19:17 c-mon kernel: [17289149.746126] Out of memory: Kill process 21787 (ceph-mon) score 827 or sacrifice child /var/log/kern.log.1:Oct 31 17:19:17 c-mon kernel: [17289149.746262] Killed process 21787 (ceph-mon) total-vm:1968440kB, anon-rss:740176kB, file-rss:0kB OOM kill. I have 1GB memory on my mons, and 1GB swap. It's the only mon that crashed. Is there a change in memory requirement from Firefly ? There generally shouldn't be, but I don't think it's something we monitored closely. More likely your monitor was running near its memory limit already and restarting all the OSDs (and servicing the resulting changes) pushed it over the edge. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
[ Re-adding the list. ] On Mon, Nov 3, 2014 at 10:49 AM, Chad Seys cws...@physics.wisc.edu wrote: Next I executed 'ceph osd crush tunables optimal' to upgrade CRUSH mapping. Okay...you know that's a data movement command, right? Yes. So you should expect it to impact operations. These failures are usually the result of adjusting tunables without having upgraded all the machines in the cluster — although they should also be fixed in v0.80.7. Are you still seeing crashes, or just the PG state issues? Still getting crashes. I believe all nodes are running 0.80.7 . Does ceph have a command to check this? (Otherwise I'll do an ssh-many to check.) There's a ceph osd metadata command, but i don't recall if it's in Firefly or only giant. :) Thanks! C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
Okay, assuming this is semi-predictable, can you start up one of the OSDs that is going to fail with debug osd = 20, debug filestore = 20, and debug ms = 1 in the config file and then put the OSD log somewhere accessible after it's crashed? Can you also verify that all of your monitors are running firefly, and then issue the command ceph scrub and report the output? -Greg On Mon, Nov 3, 2014 at 11:07 AM, Chad Seys cws...@physics.wisc.edu wrote: There's a ceph osd metadata command, but i don't recall if it's in Firefly or only giant. :) It's in firefly. Thanks, very handy. All the OSDs are running 0.80.7 at the moment. What next? Thanks again, Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem
On Mon, Nov 3, 2014 at 11:41 AM, Chad Seys cws...@physics.wisc.edu wrote: On Monday, November 03, 2014 13:22:47 you wrote: Okay, assuming this is semi-predictable, can you start up one of the OSDs that is going to fail with debug osd = 20, debug filestore = 20, and debug ms = 1 in the config file and then put the OSD log somewhere accessible after it's crashed? Alas, I have not yet noticed a pattern. Only thing I think is true is that they go down when I first make CRUSH changes. Then after restarting, they run without going down again. All the OSDs are running at the moment. Oh, interesting. What CRUSH changes exactly are you making that are spawning errors? What I've been doing is marking OUT the OSDs on which a request is blocked, letting the PGs recover, (drain the OSD of PGs completely), then remove and readd the OSD. So far OSDs treated this way no longer have blocked requests. Also, seems as though that slowly decreases the number of incomplete and down+incomplete PGs . Can you also verify that all of your monitors are running firefly, and then issue the command ceph scrub and report the output? Sure, should I wait until the current rebalancing is finished? I don't think it should matter, although I confess I'm not sure how much monitor load the scrubbing adds. (It's a monitor check; doesn't hit the OSDs at all.) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com