from:"Gregory Farnum"

Re: [ceph-users] Two osds are spaming dmesg every 900 seconds

2014-08-26 Thread Gregory Farnum

This is being output by one of the kernel clients, and it's just
saying that the connections to those two OSDs have died from
inactivity. Either the other OSD connections are used a lot more, or
aren't used at all.

In any case, it's not a problem; just a noisy notification. There's
not much you can do about it; sorry.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 12:01 PM, Andrei Mikhailovsky and...@arhont.com wrote:
 Hello

 I am seeing this message every 900 seconds on the osd servers. My dmesg 
 output is all filled with:

 [256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state 
 OPEN)
 [256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state 
 OPEN)


 Looking at the ceph-osd logs I see the following at the same time:

 2014-08-25 19:48:14.869145 7f0752125700  0 -- 192.168.168.200:6821/4097  
 192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 
 c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket 
 is 192.168.168.200:54457/0)


 This happens only on two osds and the rest of osds seem fine. Does anyone 
 know why am I seeing this and how to correct it?

 Thanks

 Andrei
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-fuse fails to mount

2014-08-26 Thread Gregory Farnum

In particular, we changed things post-Firefly so that the filesystem
isn't created automatically. You'll need to set it up (and its pools,
etc) explicitly to use it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby
richardnixonsh...@gmail.com wrote:
 Hi James,


 On 26 August 2014 07:17, LaBarre, James (CTR) A6IT james.laba...@cigna.com
 wrote:



 [ceph@first_cluster ~]$ ceph -s

 cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d

  health HEALTH_OK

  monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0}, election
 epoch 2, quorum 0 first_cluster

  mdsmap e4: 1/1/1 up {0=first_cluster=up:active}

  osdmap e13: 3 osds: 3 up, 3 in

   pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects

 19835 MB used, 56927 MB / 76762 MB avail

  192 active+clean


 This cluster has an MDS. It should mount.




 [ceph@second_cluster ~]$ ceph -s

 cluster 06f655b7-e147-4790-ad52-c57dcbf160b7

  health HEALTH_OK

  monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0}, election
 epoch 1, quorum 0 cilsdbxd1768

  osdmap e16: 7 osds: 7 up, 7 in

   pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects

 252 MB used, 194 GB / 194 GB avail

  192 active+clean


 No mdsmap line for this cluster, and therefore the filesystem won't mount.
 Have you added an MDS for this cluster, or has the mds daemon died? You'll
 have to get the mdsmap line to show before it will mount

 Sean


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS dying on Ceph 0.67.10

2014-08-26 Thread Gregory Farnum

I don't think the log messages you're showing are the actual cause of
the failure. The log file should have a proper stack trace (with
specific function references and probably a listed assert failure),
can you find that?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien
tientienminh080...@gmail.com wrote:
 Hi all,

 I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate =  2)

 When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing.

 I have 3 MDS in 3 nodes,the MDS process is dying after a while with a stack
 trace:

 ---

  2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154 ==
 osd.10 10.20.0.21:6802/15917 1  osd_op_reply(230 10003f6.
 [tmapup 0~0] ondisk = 0) v4  119+0+0 (1770421071 0 0) 0x2aece00 con
 0x2aa4200
-54 2014-08-26 17:08:34.362942 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.55 10.20.0.23:6800/2407 10  osd_op_reply(263
 100048a. [getxattr] ack = -2 (No such file or directory)) v4
  119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0
-53 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log submit_entry
 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
-52 2014-08-26 17:08:34.363022 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.37 10.20.0.22:6898/11994 6  osd_op_reply(226 1. [tmapput
 0~7664] ondisk = 0) v4  109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0
-51 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
 segment 293601899 2548 events
-50 2014-08-26 17:08:34.363117 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.17 10.20.0.21:6941/17572 9  osd_op_reply(264
 1000489. [getxattr] ack = -2 (No such file or directory)) v4
  119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180
-49 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log submit_entry
 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
-48 2014-08-26 17:08:34.363197 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.1 10.20.0.21:6872/13227 6  osd_op_reply(265 1000491.
 [getxattr] ack = -2 (No such file or directory)) v4  119+0+0 (1231782695
 0 0) 0x1e63400 con 0x1e7ac00
-47 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log submit_entry
 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
-46 2014-08-26 17:08:34.363274 7f1c2c704700  1 -- 10.20.0.21:6800/22154
 == osd.11 10.20.0.21:6884/7018 5  osd_op_reply(266 100047d.
 [getxattr] ack = -2 (No such file or directory)) v4  119+0+0 (2737916920
 0 0) 0x1e61e00 con 0x1e7bc80

 -
 I try to restart MDSs, but after a few seconds in a state of active, MDS
 switch to state laggy or crashed. I have a lot of important data on it.
 I do not want to use the command:
 ceph mds newfs metadata pool id data pool id --yes-i-really-mean-it

 :(

 Tien Bui.



 --
 Bui Minh Tien

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fresh Firefly install degraded without modified default tunables

2014-08-26 Thread Gregory Farnum

Hmm, that all looks basically fine. But why did you decide not to
segregate OSDs across hosts (according to your CRUSH rules)? I think
maybe it's the interaction of your map, setting choose_local_tries to
0, and trying to go straight to the OSDs instead of choosing hosts.
But I'm not super familiar with how the tunables would act under these
exact conditions.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 12:59 PM, Ripal Nathuji ri...@nathuji.com wrote:
 Hi Greg,

 Thanks for helping to take a look. Please find your requested outputs below.

 ceph osd tree:

 # id weight type name up/down reweight
 -1 0 root default
 -2 0 host osd1
 0 0 osd.0 up 1
 4 0 osd.4 up 1
 8 0 osd.8 up 1
 11 0 osd.11 up 1
 -3 0 host osd0
 1 0 osd.1 up 1
 3 0 osd.3 up 1
 6 0 osd.6 up 1
 9 0 osd.9 up 1
 -4 0 host osd2
 2 0 osd.2 up 1
 5 0 osd.5 up 1
 7 0 osd.7 up 1
 10 0 osd.10 up 1


 ceph -s:

 cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
  health HEALTH_WARN 832 pgs degraded; 832 pgs stuck unclean; recovery
 43/86 objects degraded (50.000%)
  monmap e1: 1 mons at {ceph-mon0=192.168.2.10:6789/0}, election epoch 2,
 quorum 0 ceph-mon0
  osdmap e34: 12 osds: 12 up, 12 in
   pgmap v61: 832 pgs, 8 pools, 840 bytes data, 43 objects
 403 MB used, 10343 MB / 10747 MB avail
 43/86 objects degraded (50.000%)
  832 active+degraded


 Thanks,
 Ripal

 On Aug 25, 2014, at 12:45 PM, Gregory Farnum g...@inktank.com wrote:

 What's the output of ceph osd tree? And the full output of ceph -s?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Mon, Aug 18, 2014 at 8:07 PM, Ripal Nathuji ri...@nathuji.com wrote:

 Hi folks,

 I've come across an issue which I found a fix for, but I'm not sure
 whether it's correct or if there is some other misconfiguration on my end
 and this is merely a symptom. I'd appreciate any insights anyone could
 provide based on the information below, and happy to provide more details as
 necessary.

 Summary: A fresh install of Ceph 0.80.5 comes up with all pgs marked as
 active+degraded. This reproduces on 12.04 as well as CentOS 7 with a varying
 number of OSD hosts (1, 2, 3), where each OSD host has four storage drives.
 The configuration file defines a default replica size of 2, and allows leafs
 of type 0. Specific snippet:

 [global]
  ...
  osd pool default size = 2
  osd crush chooseleaf type = 0


 I verified the crush rules were as expected:

  rules: [
{ rule_id: 0,
  rule_name: replicated_ruleset,
  ruleset: 0,
  type: 1,
  min_size: 1,
  max_size: 10,
  steps: [
{ op: take,
  item: -1,
  item_name: default},
{ op: choose_firstn,
  num: 0,
  type: osd},
{ op: emit}]}],


 Inspecting the pg dump I observed that all pgs had a single osd in the
 up/acting sets. That seemed to explain why the pgs were degraded, but it was
 unclear to me why a second OSD wasn't in the set. After trying a variety of
 things, I noticed that there was a difference between Emperor (which works
 fine in these configurations) and Firefly with the default tunables, where
 Firefly comes up with the bobtail profile. The setting
 choose_local_fallback_tries is 0 in this profile while it used to default to
 5 on Emperor. Sure enough, if I modify my crush map and set the parameter to
 a non-zero value, the cluster remaps and goes healthy with all pgs
 active+clean.

 The documentation states the optimal value of choose_local_fallback_tries is
 0 for FF, so I'd like to get a better understanding of this parameter and
 why modifying the default value moves the pgs to a clean state in my
 scenarios.

 Thanks,
 Ripal

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-fuse fails to mount

2014-08-26 Thread Gregory Farnum

[Re-added the list.]

I believe you'll find everything you need at
http://ceph.com/docs/master/cephfs/createfs/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Aug 26, 2014 at 1:25 PM, LaBarre, James  (CTR)  A6IT
james.laba...@cigna.com wrote:
 So is there a link for documentation on the newer versions?  (we're doing 
 evaluations at present, so I had wanted to work with newer versions, since it 
 would be closer to what we would end up using).


 -Original Message-
 From: Gregory Farnum [mailto:g...@inktank.com]
 Sent: Tuesday, August 26, 2014 4:05 PM
 To: Sean Crosby
 Cc: LaBarre, James (CTR) A6IT; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Ceph-fuse fails to mount

 In particular, we changed things post-Firefly so that the filesystem isn't 
 created automatically. You'll need to set it up (and its pools,
 etc) explicitly to use it.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby richardnixonsh...@gmail.com 
 wrote:
 Hi James,


 On 26 August 2014 07:17, LaBarre, James (CTR) A6IT
 james.laba...@cigna.com
 wrote:



 [ceph@first_cluster ~]$ ceph -s

 cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d

  health HEALTH_OK

  monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0},
 election epoch 2, quorum 0 first_cluster

  mdsmap e4: 1/1/1 up {0=first_cluster=up:active}

  osdmap e13: 3 osds: 3 up, 3 in

   pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects

 19835 MB used, 56927 MB / 76762 MB avail

  192 active+clean


 This cluster has an MDS. It should mount.




 [ceph@second_cluster ~]$ ceph -s

 cluster 06f655b7-e147-4790-ad52-c57dcbf160b7

  health HEALTH_OK

  monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0},
 election epoch 1, quorum 0 cilsdbxd1768

  osdmap e16: 7 osds: 7 up, 7 in

   pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects

 252 MB used, 194 GB / 194 GB avail

  192 active+clean


 No mdsmap line for this cluster, and therefore the filesystem won't mount.
 Have you added an MDS for this cluster, or has the mds daemon died?
 You'll have to get the mdsmap line to show before it will mount

 Sean


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.
 This email transmission may contain confidential information.  This
 information is intended only for the use of the individual(s) or entity to
 whom it is intended even if addressed incorrectly.  Please delete it from
 your files if you are not the intended recipient.  Thank you for your
 compliance.  Copyright (c) 2014 Cigna
 ==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] error ioctl(BTRFS_IOC_SNAP_CREATE) failed: (17) File exists

2014-08-27 Thread Gregory Farnum

This looks new to me. Can you try and start up the OSD with debug osd
= 20 and debug filestore = 20 in your conf, then put the log
somewhere accessible? (You can also use ceph-post-file if it's too
large for pastebin or something.)
Also, check dmesg and see if btrfs is complaining, and see what the
(folder, or more specifically snapshot) contents of the OSD data
directory are.

Since you *are* on btrfs this is probably reasonably recoverable, but
we'll have to see what's going on first.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Aug 26, 2014 at 10:18 PM, John Morris j...@zultron.com wrote:
 During reorganization of the Ceph system, including an updated CRUSH
 map and moving to btrfs, some PGs became stuck incomplete+remapped.
 Before that was resolved, a restart of osd.1 failed while creating a
 btrfs snapshot.  A 'ceph-osd -i 1 --flush-journal' fails with the same
 error.  See the below pasted log.

 This is a Bad Thing, because two PGs are now stuck down+peering.  A
 'ceph pg 2.74 query' shows they had been stuck on osd.1 before the
 btrfs problem, despite what the 'last acting' field shows in the below
 'ceph health detail' output.

 Is there any way to recover from this?  Judging from Google searches
 on the list archives, nobody has run into this problem before, so I'm
 quite worried that this spells backup recovery exercises for the next
 few days.

 Related question:  Are outright OSD crashes the reason btrfs is
 discouraged for production use?

 Thanks-

 John



 pg 2.74 is stuck inactive since forever, current state down+peering, last 
 acting [3,7,0,6]
 pg 3.73 is stuck inactive since forever, current state down+peering, last 
 acting [3,7,0,6]
 pg 2.74 is stuck unclean since forever, current state down+peering, last 
 acting [3,7,0,6]
 pg 3.73 is stuck unclean since forever, current state down+peering, last 
 acting [3,7,0,6]
 pg 2.74 is down+peering, acting [3,7,0,6]
 pg 3.73 is down+peering, acting [3,7,0,6]


 2014-08-26 22:36:12.641585 7f5b38e507a0  0 ceph version 0.67.10 
 (9d446bd416c52cd785ccf048ca67737ceafcdd7f), process ceph-osd, pid 10281
 2014-08-26 22:36:12.717100 7f5b38e507a0  0 filestore(/ceph/osd.1) mount 
 FIEMAP ioctl is supported and appears to work
 2014-08-26 22:36:12.717121 7f5b38e507a0  0 filestore(/ceph/osd.1) mount 
 FIEMAP ioctl is disabled via 'filestore fiemap' config option
 2014-08-26 22:36:12.717434 7f5b38e507a0  0 filestore(/ceph/osd.1) mount 
 detected btrfs
 2014-08-26 22:36:12.717471 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
 CLONE_RANGE ioctl is supported
 2014-08-26 22:36:12.765009 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
 SNAP_CREATE is supported
 2014-08-26 22:36:12.765335 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
 SNAP_DESTROY is supported
 2014-08-26 22:36:12.765541 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
 START_SYNC is supported (transid 3118)
 2014-08-26 22:36:12.789600 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
 WAIT_SYNC is supported
 2014-08-26 22:36:12.808287 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
 SNAP_CREATE_V2 is supported
 2014-08-26 22:36:12.834144 7f5b38e507a0  0 filestore(/ceph/osd.1) mount 
 syscall(SYS_syncfs, fd) fully supported
 2014-08-26 22:36:12.834377 7f5b38e507a0  0 filestore(/ceph/osd.1) mount found 
 snaps 6009082,6009083
 2014-08-26 22:36:12.834427 7f5b38e507a0 -1 filestore(/ceph/osd.1) 
 FileStore::mount: error removing old current subvol: (22) Invalid argument
 2014-08-26 22:36:12.861045 7f5b38e507a0 -1 filestore(/ceph/osd.1) mount 
 initial op seq is 0; something is wrong
 2014-08-26 22:36:12.861428 7f5b38e507a0 -1 ^[[0;31m ** ERROR: error 
 converting store /ceph/osd.1: (22) Invalid argument^[[0m
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 'incomplete' PGs: what does it mean?

2014-08-27 Thread Gregory Farnum

On Tue, Aug 26, 2014 at 10:46 PM, John Morris j...@zultron.com wrote:
 In the docs [1], 'incomplete' is defined thusly:

   Ceph detects that a placement group is missing a necessary period of
   history from its log. If you see this state, report a bug, and try
   to start any failed OSDs that may contain the needed information.

 However, during an extensive review of list postings related to
 incomplete PGs, an alternate and oft-repeated definition is something
 like 'the number of existing replicas is less than the min_size of the
 pool'.  In no list posting was there any acknowledgement of the
 definition from the docs.

 While trying to understand what 'incomplete' PGs are, I simply set
 min_size = 1 on this cluster with incomplete PGs, and they continue to
 be 'incomplete'.  Does this mean that definition #2 is incorrect?

 In case #1 is correct, how can the cluster be told to forget the lapse
 in history?  In our case, there was nothing writing to the cluster
 during the OSD reorganization that could have caused this lapse.

Yeah, these two meanings can (unfortunately) both lead to the
INCOMPLETE state being reported. I think that's going to be fixed in
our next major release (so that INCOMPLETE means not enough OSDs
hosting and missing log will translate into something else), but
for now the not enough OSDs is by far the more common. In your case
you probably are missing history, but you don't want to recover from
it using any of the cluster tools because they're likely to lose more
data than necessary. (Hopefully, you can just roll back to a slightly
older btrfs snapshot, but we'll see).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RAID underlying a Ceph config

2014-08-28 Thread Gregory Farnum

There aren't too many people running RAID under Ceph, as it's a second
layer of redundancy that in normal circumstances is a bit pointless.
But there are scenarios where it might be useful. You might check the
list archives for the anti-cephalopod question thread.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Aug 28, 2014 at 10:19 AM, LaBarre, James  (CTR)  A6IT
james.laba...@cigna.com wrote:
 Having heard some suggestions on RAID configuration under Gluster (we have
 someone else doing that evaluation, I’m doing the Ceph piece), I’m wondering
 what (if any) RAID configurations would be recommended for Ceph.   I have
 the impression that striping data could counteract/undermine data
 replication (with PGs potentially being on multiple disks, rather than
 within a single disk-oriented OSD).

 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.
 This email transmission may contain confidential information.  This
 information is intended only for the use of the individual(s) or entity to
 whom it is intended even if addressed incorrectly.  Please delete it from
 your files if you are not the intended recipient.  Thank you for your
 compliance.  Copyright (c) 2014 Cigna
 ==


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Filesystem - Production?

2014-08-28 Thread Gregory Farnum

On Thu, Aug 28, 2014 at 10:36 AM, Brian C. Huffman
bhuff...@etinternational.com wrote:
 Is Ceph Filesystem ready for production servers?

 The documentation says it's not, but I don't see that mentioned anywhere
 else.
 http://ceph.com/docs/master/cephfs/

Everybody has their own standards, but Red Hat isn't supporting it for
general production use at this time. If you're brave you could test it
under your workload for a while and see how it comes out; the known
issues are very much workload-dependent (or just general concerns over
polish).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MSWin CephFS

2014-08-28 Thread Gregory Farnum

On Thu, Aug 28, 2014 at 10:41 AM, LaBarre, James  (CTR)  A6IT
james.laba...@cigna.com wrote:
 Just out of curiosity, is there a way to mount a Ceph filesystem directly on
 a MSWindows system (2008 R2 server)?   Just wanted to try something out from
 a VM.

Nope, sorry.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 'incomplete' PGs: what does it mean?

2014-08-29 Thread Gregory Farnum

,
   last_epoch_started: 10278},
   recovery_state: [
 { name: Started\/Primary\/Peering,
   enter_time: 2014-08-29 01:22:50.132826,
   past_intervals: [
 [...]
 { first: 12809,
   last: 13101,
   maybe_went_rw: 1,
   up: [
 7,
 3,
 0,
 4],
   acting: [
 7,
 3,
 0,
 4]},
 [...]
 ],
   probing_osds: [
 0,
 1,
 2,
 3,
 4,
 5,
 7],
   down_osds_we_would_probe: [],
   peering_blocked_by: []},
 { name: Started,
   enter_time: 2014-08-29 01:22:50.132784}]}




 On Wed, Aug 27, 2014 at 12:40 PM, Gregory Farnum g...@inktank.com
 javascript:;
 wrote:

  On Tue, Aug 26, 2014 at 10:46 PM, John Morris john at zultron.com
  wrote:
   In the docs [1], 'incomplete' is defined thusly:
  
 Ceph detects that a placement group is missing a necessary period
 of
 history from its log. If you see this state, report a bug, and try
 to start any failed OSDs that may contain the needed information.
  
   However, during an extensive review of list postings related to
   incomplete PGs, an alternate and oft-repeated definition is
   something
   like 'the number of existing replicas is less than the min_size of
   the
   pool'.  In no list posting was there any acknowledgement of the
   definition from the docs.
  
   While trying to understand what 'incomplete' PGs are, I simply set
   min_size = 1 on this cluster with incomplete PGs, and they continue
   to
   be 'incomplete'.  Does this mean that definition #2 is incorrect?
  
   In case #1 is correct, how can the cluster be told to forget the
   lapse
   in history?  In our case, there was nothing writing to the cluster
   during the OSD reorganization that could have caused this lapse.
 
  Yeah, these two meanings can (unfortunately) both lead to the
  INCOMPLETE state being reported. I think that's going to be fixed in
  our next major release (so that INCOMPLETE means not enough OSDs
  hosting and missing log will translate into something else), but
  for now the not enough OSDs is by far the more common. In your case
  you probably are missing history, but you don't want to recover from
  it using any of the cluster tools because they're likely to lose more
  data than necessary. (Hopefully, you can just roll back to a slightly
  older btrfs snapshot, but we'll see).
  -Greg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question about monitor and paxos relationship

2014-08-29 Thread Gregory Farnum

On Thu, Aug 28, 2014 at 9:52 PM, pragya jain prag_2...@yahoo.co.in wrote:
 I have some basic question about monitor and paxos relationship:

 As the documents says, Ceph monitor contains cluster map, if there is any
 change in the state of the cluster, the change is updated in the cluster
 map. monitor use paxos algorithm to create the consensus among monitors to
 establish a quorum.
 And when we talk about the Paxos algorithm, documents says that monitor
 writes all changes to the Paxos instance and Paxos writes the changes to a
 key/value store for strong consistency.

 #1: I am unable to understand what actually the Paxos algorithm do? all
 changes in the cluster map are made by Paxos algorithm? how it create a
 consensus among monitors

Paxos is an algorithm for making decisions and/or safely replicating
data in a distributed system. The Ceph monitor cluster uses it for all
changes to any of its data.

 My assumption is: cluster map is updated when OSD report monitor about any
 changes, there is no role of Paxos in it. Paxos write changes made only for
 the monitors. Please somebody elaborate at this point.

Every change the monitors incorporate to any data structure, most
definitely including the OSD map's changes based on reports from OSDs,
is passed through paxos.

 #2: why odd no. of monitors are recommended for production cluster, not even
 no.?

This is because of a trait of the Paxos' systems durability and uptime
guarantees.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Misdirected client messages

2014-09-03 Thread Gregory Farnum

The clients are sending messages to OSDs which are not the primary for
the data. That shouldn't happen — clients which don't understand the
whole osdmap ought to be gated and prevented from accessing the
cluster at all. What version of Ceph are you running, and what
clients?
(We've seen this in dev versions but I can't think of any in named
releases off the top of my head. It's more likely if you're using
something like the primary affinity values or something.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Sep 3, 2014 at 6:26 AM, Maros Vegh maros.v...@microstep-mis.sk wrote:
 Hello,

 last weeks we observed many misdirected client messages in the logs.
 The messages are similar to this one:

 2014-09-03 15:20:55.696752 osd.24 192.168.61.3:6830/25216 234 : [WRN]
 client.2936377 192.168.61.105:0/983896378 misdirected
 client.2936377.1:4985727 pg 0.a7459c63 to osd.24 not [5,24] in e22827/22827

 Can somebody explain what is the issue and how to solve it?

 thanks

 Maros Vegh
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Updating the pg and pgp values

2014-09-08 Thread Gregory Farnum

On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah jshah2...@me.com wrote:
 While checking the health of the cluster, I ran to the following error:

 warning: health HEALTH_WARN too few pgs per osd (1 min 20)

 When I checked the pg and php numbers, I saw the value was the default value
 of 64

 ceph osd pool get data pg_num
 pg_num: 64
 ceph osd pool get data pgp_num
 pgp_num: 64

 Checking the ceph documents, I updated the numbers to 2000 using the
 following commands:

 ceph osd pool set data pg_num 2000
 ceph osd pool set data pgp_num 2000

 It started resizing the data and saw health warnings again:

 health HEALTH_WARN 1 requests are blocked  32 sec; pool data pg_num 2000 
 pgp_num 64

 and then:

 ceph health detail
 HEALTH_WARN 6 requests are blocked  32 sec; 3 osds have slow requests
 5 ops are blocked  65.536 sec
 1 ops are blocked  32.768 sec
 1 ops are blocked  32.768 sec on osd.16
 1 ops are blocked  65.536 sec on osd.77
 4 ops are blocked  65.536 sec on osd.98
 3 osds have slow requests

 This error also went away after a day.

 ceph health detail
 HEALTH_OK


 Now, the question I have is, will this pg number remain effective on the
 cluster, even if we restart MON or OSD’s on the individual disks?  I haven’t
 changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the
 ceph.conf and push that change to all the MON, MSD and OSD’s ?

It's durable once the commands are successful on the monitors. You're all done.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Updating the pg and pgp values

2014-09-08 Thread Gregory Farnum

It's stored in the OSDMap on the monitors.
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Sep 8, 2014 at 10:50 AM, JIten Shah jshah2...@me.com wrote:
 So, if it doesn’t refer to the entry in ceph.conf. Where does it actually 
 store the new value?

 —Jiten

 On Sep 8, 2014, at 10:31 AM, Gregory Farnum g...@inktank.com wrote:

 On Mon, Sep 8, 2014 at 10:08 AM, JIten Shah jshah2...@me.com wrote:
 While checking the health of the cluster, I ran to the following error:

 warning: health HEALTH_WARN too few pgs per osd (1 min 20)

 When I checked the pg and php numbers, I saw the value was the default value
 of 64

 ceph osd pool get data pg_num
 pg_num: 64
 ceph osd pool get data pgp_num
 pgp_num: 64

 Checking the ceph documents, I updated the numbers to 2000 using the
 following commands:

 ceph osd pool set data pg_num 2000
 ceph osd pool set data pgp_num 2000

 It started resizing the data and saw health warnings again:

 health HEALTH_WARN 1 requests are blocked  32 sec; pool data pg_num 2000 
 pgp_num 64

 and then:

 ceph health detail
 HEALTH_WARN 6 requests are blocked  32 sec; 3 osds have slow requests
 5 ops are blocked  65.536 sec
 1 ops are blocked  32.768 sec
 1 ops are blocked  32.768 sec on osd.16
 1 ops are blocked  65.536 sec on osd.77
 4 ops are blocked  65.536 sec on osd.98
 3 osds have slow requests

 This error also went away after a day.

 ceph health detail
 HEALTH_OK


 Now, the question I have is, will this pg number remain effective on the
 cluster, even if we restart MON or OSD’s on the individual disks?  I haven’t
 changed the values in /etc/ceph/ceph.conf. Do I need to make a change to the
 ceph.conf and push that change to all the MON, MSD and OSD’s ?

 It's durable once the commands are successful on the monitors. You're all 
 done.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Delays while waiting_for_osdmap according to dump_historic_ops

2014-09-08 Thread Gregory Farnum

On Sun, Sep 7, 2014 at 4:28 PM, Alex Moore a...@lspeed.org wrote:
 I recently found out about the ceph --admin-daemon
 /var/run/ceph/ceph-osd.id.asok dump_historic_ops command, and noticed
 something unexpected in the output on my cluster, after checking numerous
 output samples...

 It looks to me like normal write ops on my cluster spend roughly:

 1ms between received_at and waiting_for_osdmap
 1ms between waiting_for_osdmap and reached_pg
 15ms between reached_pg and commit_sent
 15ms between commit_sent and done

 For reference, this is a small (3-host) all-SSD cluster, with monitors
 co-located with OSDs. Each host has: 1 SSD for the OS, 1 SSD for the
 journal, and 1 SSD for the OSD + monitor data (I initially had the monitor
 data on the same drive as the OS, but encountered performance problems -
 which have since been allieviated by moving the monitor data to the same
 drives as the OSDs. Networking is infiniband (8 Gbps dedicated
 point-to-point link between each pair of hosts). I'm running v0.80.5. And
 the OSDs use XFS.

 Anyway, as this command intentionally shows the worst few recent IOs, I only
 rarely see examples that match the above norm. Rather, the typical
 outliers that it highlights are usually write IOs with ~100-300ms latency,
 where the extra latency exists purely between the received_at and
 reached_pg timestamps, and mostly in the waiting_for_osdmap step. Also
 it looks like these slow IOs come in batches. Every write IO arriving within
 the same ~1 second period will suffer from these strangely slow initial two
 steps, with the additional latency being almost identical for each one
 within the same batch. After which things return to normal again in that
 those steps take 1ms. So compared to the above norm, these look more
 like:

 ~50ms between received_at and waiting_for_osdmap
 ~150ms between waiting_for_osdmap and reached_pg
 15ms between reached_pg and commit_sent
 15ms between commit_sent and done

 This seems unexpected to me. I don't see why those initial steps in the IO
 should ever take such a long time to complete. Where should I be looking
 next to track down the cause? I'm guessing that waiting_for_osdmap
 involves OSD-Mon communication, and so perhaps indicates poor performance
 of the Mons. But for there to be any non-negligible delay between
 received_at and waiting_for_osdmap makes no sense to me at all.

First thing here is to explain what each of these events actually mean.
received_at is the point at which we *started* reading the message
off the wire. We have to finish reading it off and dispatch it to the
OSD before the next one.
waiting_for_osdmap is slightly misnamed; it's the point at which the
op was submitted to the OSD. It's called that because receiving a
message with a newer OSDMap epoch than we have is the most common
long-term delay in this phase, but we also have to do some other
preprocessing and queue the Op up.
reached_pg is the point at which the Op is dequeued by a worker
thread and has the necessary mutexes to get processed. After this
point we're going to try and actually do the operations described
(reads or writes).
commit_sent indicates that we've actually sent back the commit to
the client or primary OSD.
done indicates that the op has been completed (commit_sent doesn't
wait for the op to have been applied to the backing filesystem; this
does).

There are probably a bunch of causes for the behavior you're seeing,
but the most likely is that you've occasionally got a whole bunch of
operations going to a single object/placement group and they're taking
some time to process because they have to be serialized. This would
prevent the PG from handling newer ops while the old ones are still
being processed, and that could back up through the pipeline to slow
down the reads off the wire as well.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Gregory Farnum

On Mon, Sep 8, 2014 at 1:42 AM, Francois Deppierraz
franc...@ctrlaltdel.ch wrote:
 Hi,

 This issue is on a small 2 servers (44 osds) ceph cluster running 0.72.2
 under Ubuntu 12.04. The cluster was filling up (a few osds near full)
 and I tried to increase the number of pg per pool to 1024 for each of
 the 14 pools to improve storage space balancing. This increase triggered
 high memory usage on the servers which were unfortunately
 under-provisioned (16 GB RAM for 22 osds) and started to swap and crash.

 After installing memory into the servers, the result is a broken cluster
 with unfound objects and two osds (osd.6 and osd.43) crashing at startup.

 $ ceph health
 HEALTH_WARN 166 pgs backfill; 326 pgs backfill_toofull; 2 pgs
 backfilling; 765 pgs degraded; 715 pgs down; 1 pgs incomplete; 715 pgs
 peering; 5 pgs recovering; 2 pgs recovery_wait; 716 pgs stuck inactive;
 1856 pgs stuck unclean; 164 requests are blocked  32 sec; recovery
 517735/15915673 objects degraded (3.253%); 1241/7910367 unfound
 (0.016%); 3 near full osd(s); 1/43 in osds are down; noout flag(s) set

 osd.6 is crashing due to an assertion (trim_objectcould not find coid)
 which leads to a resolved bug report which unfortunately doesn't give
 any advise on how to repair the osd.

 http://tracker.ceph.com/issues/5473

 It is much less obvious why osd.43 is crashing, please have a look at
 the following osd logs:

 http://paste.ubuntu.com/8288607/
 http://paste.ubuntu.com/8288609/

The first one is not caused by the same thing as the ticket you
reference (it was fixed well before emperor), so it appears to be some
kind of disk corruption.
The second one is definitely corruption of some kind as it's missing
an OSDMap it thinks it should have. It's possible that you're running
into bugs in emperor that were fixed after we stopped doing regular
support releases of it, but I'm more concerned that you've got disk
corruption in the stores. What kind of crashes did you see previously;
are there any relevant messages in dmesg, etc?

Given these issues, you might be best off identifying exactly which
PGs are missing, carefully copying them to working OSDs (use the osd
store tool), and killing these OSDs. Do lots of backups at each
stage...
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Gregory Farnum

On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz
franc...@ctrlaltdel.ch wrote:
 Hi Greg,

 Thanks for your support!

 On 08. 09. 14 20:20, Gregory Farnum wrote:

 The first one is not caused by the same thing as the ticket you
 reference (it was fixed well before emperor), so it appears to be some
 kind of disk corruption.
 The second one is definitely corruption of some kind as it's missing
 an OSDMap it thinks it should have. It's possible that you're running
 into bugs in emperor that were fixed after we stopped doing regular
 support releases of it, but I'm more concerned that you've got disk
 corruption in the stores. What kind of crashes did you see previously;
 are there any relevant messages in dmesg, etc?

 Nothing special in dmesg except probably irrelevant XFS warnings:

 XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

Hmm, I'm not sure what the outcome of that could be. Googling for the
error message returns this as the first result, though:
http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58429
Which indicates that it's a real deadlock and capable of messing up
your OSDs pretty good.


 All logs from before the disaster are still there, do you have any
 advise on what would be relevant?

 Given these issues, you might be best off identifying exactly which
 PGs are missing, carefully copying them to working OSDs (use the osd
 store tool), and killing these OSDs. Do lots of backups at each
 stage...

 This sounds scary, I'll keep fingers crossed and will do a bunch of
 backups. There are 17 pg with missing objects.

 What do you exactly mean by the osd store tool? Is it the
 'ceph_filestore_tool' binary?

Yeah, that one.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Remaped osd at remote restart

2014-09-09 Thread Gregory Farnum

On Mon, Sep 8, 2014 at 6:33 AM, Eduard Kormann ekorm...@dunkel.de wrote:
 Hello,

 have I missed something or is it a feature: When I restart a osd on the
 belonging server so it restarts normally:

 root@cephosd10:~# service ceph restart osd.76
 === osd.76 ===
 === osd.76 ===
 Stopping Ceph osd.76 on cephosd10...kill 799176...done
 === osd.76 ===
 create-or-move updating item name 'osd.76' weight 3.64 at location
 {host=cephosd10,root=default} to crush map
 Starting Ceph osd.76 on cephosd10...
 starting osd.76 at :/0 osd_data /var/lib/ceph/osd/ceph-76
 /var/lib/ceph/osd/ceph-76/journal

 But if I trie to restart osd on the admin server...:

 root@cephadmin:/etc/ceph# service ceph -a restart osd.76
 === osd.76 ===
 === osd.76 ===
 Stopping Ceph osd.76 on cephosd10...kill 800262...kill 800262...done
 === osd.76 ===
 df: `/var/lib/ceph/osd/ceph-76/.': No such file or directory
 df: no file systems processed
 create-or-move updating item name 'osd.76' weight 1 at location
 {host=cephadmin,root=default} to crush map
 Starting Ceph osd.76 on cephosd10...
 starting osd.76 at :/0 osd_data /var/lib/ceph/osd/ceph-76
 /var/lib/ceph/osd/ceph-76/journal

 ...it will associated with the admin server in the crush map:
 -17 0   host cephadmin
 76  0   osd.76  up  0

 Before that each osd could be started from arbitrary server with option
 -a. Apparently it no longer works.
 How do I run any osd from any server without error messages?

...huh. I didn't realize the -a option still existed. You should be
able to prevent this from happening by adding osd crush update on
start = false to the global section of your ceph.conf on any nodes
which you are going to use to restart OSDs from other nodes with.

I created a ticket to address this issue: http://tracker.ceph.com/issues/9407
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] max_bucket limit -- safe to disable?

2014-09-09 Thread Gregory Farnum

On Tue, Sep 9, 2014 at 9:11 AM, Daniel Schneller
daniel.schnel...@centerdevice.com wrote:
 Hi list!

 Under 
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-September/033670.html
 I found a situation not unlike ours, but unfortunately either
 the list archive fails me or the discussion ended without a
 conclusion, so I dare to ask again :)

 We currently have a setup of 4 servers with 12 OSDs each,
 combined journal and data. No SSDs.

 We develop a document management application that accepts user
 uploads of all kinds of documents and processes them in several
 ways. For any given document, we might create anywhere from 10s
 to several hundred dependent artifacts.

 We are now preparing to move from Gluster to a Ceph based
 backend. The application uses the Apache JClouds Library to
 talk to the Rados Gateways that are running on all 4 of these
 machines, load balanced by haproxy.

 We currently intend to create one container for each document
 and put all the dependent and derived artifacts as objects into
 that container.
 This gives us a nice compartmentalization per document, also
 making it easy to remove a document and everything that is
 connected with it.

 During the first test runs we ran into the default limit of
 1000 containers per user. In the thread mentioned above that
 limit was removed (setting the max_buckets value to 0). We did
 that and now can upload more than 1000 documents.

 I just would like to understand

 a) if this design is recommended, or if there are reasons to go
about the whole issue in a different way, potentially giving
up the benefit of having all document artifacts under one
convenient handle.

 b) is there any absolute limit for max_buckets that we will run
into? Remember we are talking about 10s of millions of
containers over time.

 c) are any performance issues to be expected with this design
and can we tune any parameters to alleviate this?

 Any feedback would be very much appreciated.

Yehuda can talk about this with more expertise than I can, but I think
it should be basically fine. By creating so many buckets you're
decreasing the effectiveness of RGW's metadata caching, which means
the initial lookup in a particular bucket might take longer.
The big concern is that we do maintain a per-user list of all their
buckets — which is stored in a single RADOS object — so if you have an
extreme number of buckets that RADOS object could get pretty big and
become a bottleneck when creating/removing/listing the buckets. You
should run your own experiments to figure out what the limits are
there; perhaps you have an easy way of sharding up documents into
different users.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] max_bucket limit -- safe to disable?

2014-09-10 Thread Gregory Farnum

On Wednesday, September 10, 2014, Daniel Schneller 
daniel.schnel...@centerdevice.com wrote:

 On 09 Sep 2014, at 21:43, Gregory Farnum g...@inktank.com
 javascript:_e(%7B%7D,'cvml','g...@inktank.com'); wrote:


 Yehuda can talk about this with more expertise than I can, but I think
 it should be basically fine. By creating so many buckets you're
 decreasing the effectiveness of RGW's metadata caching, which means

 the initial lookup in a particular bucket might take longer.


 Thanks for your thoughts. With “initial lookup in a particular bucket”
 do you mean accessing any of the objects in a bucket? If we directly
 access the object (not enumerating the buckets content), would that
 still be an issue?
 Just trying to understand the inner workings a bit better to make
 more educated guesses :)


When doing an object lookup, the gateway combines the bucket ID with a
mangled version of the object name to try and do a read out of RADOS. It
first needs to get that bucket ID though -- it will cache an the bucket
name-ID mapping, but if you have a ton of buckets there could be enough
entries to degrade the cache's effectiveness. (So, you're more likely to
pay that extra disk access lookup.)




 The big concern is that we do maintain a per-user list of all their
 buckets — which is stored in a single RADOS object — so if you have an
 extreme number of buckets that RADOS object could get pretty big and
 become a bottleneck when creating/removing/listing the buckets. You


 Alright. Listing buckets is no problem, that we don’t do. Can you
 say what “pretty big” would be in terms of MB? How much space does a
 bucket record consume in there? Based on that I could run a few numbers.


Uh, a kilobyte per bucket? You could look it up in the source (I'm on my
phone) but I *believe* the bucket name is allowed to be larger than the
rest combined...
More particularly, though, if you've got a single user uploading documents,
each creating a new bucket, then those bucket creates are going to
serialize on this one object.
-Greg




 should run your own experiments to figure out what the limits are
 there; perhaps you have an easy way of sharding up documents into
 different users.


 Good advice. We can do that per distributor (an org unit in our
 software) to at least compartmentalize any potential locking issues
 in this area to that single entity. Still, there would be quite
 a lot of buckets/objects per distributor, so some more detail on
 the above items would be great.

 Thanks a lot!


 Daniel



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS roadmap (was Re: NAS on RBD)

2014-09-10 Thread Gregory Farnum

On Tue, Sep 9, 2014 at 6:10 PM, Blair Bethwaite
blair.bethwa...@gmail.com wrote:
 Hi Sage,

 Thanks for weighing into this directly and allaying some concerns.

 It would be good to get a better understanding about where the rough
 edges are - if deployers have some knowledge of those then they can be
 worked around to some extent.

It's just a very long process to qualify a filesystem, even in this
limited sense. We're still at the point where we're solving bugs that
the open-source community brings us rather than setting out to make it
stable for a particular identified workload.
For the moment most of our development effort is focused on
1) instrumentation that makes it possible for users (and developers!)
to identify the cause of problems we run across
2) basic mechanisms for fixing ephemeral bugs (things like booting
dead clients, restarting hung metadata ops, etc)
3) general usability issues that our newer developers and users are
reporting to us
4) the beginnings of fsck (correctness checking for now, no fixing yet)

 E.g., for our use-case it may be that
 whilst Inktank/RedHat won't provide support for CephFS that we are
 better off using it in a tightly controlled fashion (e.g., no
 snapshots, restricted set of native clients acting as presentation
 layer with others coming in via SAMBA  Ganesha, no dynamic metadata
 tree/s, ???) where we're less likely to run into issues.

Well, snapshots are definitely going to break your install (they're
disabled by default, now). Multi-mds is unstable enough that nobody
should be running with it.
We run samba and NFS tests in our nightlies and they mostly work,
although we've got some odd issues we've not tracked down when
*ending* the samba process or unmounting nfs. (Our best guess on these
is test or environment issues, rather than actual FS issues.) But
these are probably not complete.

 Related, given there is no fsck, how would one go about backing up the
 metadata in order to facilitate DR? Is there even a way for that to
 make sense given the decoupling of data  metadata pools...?

Uh, depends on the kind of DR you're going for, I guess. There are
lots of things that will backup a generic filesystem; you could do
something smarter with a bit of custom scripting using Ceph's rstats.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why one osd-op from client can get two osd-op-reply?

2014-09-10 Thread Gregory Farnum

On Wed, Sep 10, 2014 at 8:29 PM, yuelongguang fasts...@163.com wrote:




 as for ack and ondisk, ceph has size and min_size to decide there are  how
 many replications.
 if client receive ack or ondisk, which means there are at least min_size
 osds  have  done the ops?

 i am reading the cource code, could you help me with the two questions.

 1.
  on osd, where is the code that reply ops  separately  according to ack or
 ondisk.
  i check the code, but i thought they always are replied together.

It depends on what journaling mode you're in, but generally they're
triggered separately (unless it goes on disk first, in which case it
will skip the ack — this is the mode it uses for non-btrfs
filesystems). The places where it actually replies are pretty clear
about doing one or the other, though...


 2.
  now i just know how client write ops to primary osd, inside osd cluster,
 how it promises min_size copy are reached.
 i mean  when primary osd receives ops , how it spreads ops to others, and
 how it processes other's reply.

That's not how it works. The primary for a PG will not go active
with it until it has at least min_size copies that it knows about.
Once the OSD is doing any processing of the PG, it requires all
participating members to respond before it sends any messages back to
the client.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com



 greg, thanks very much





 在 2014-09-11 01:36:39，Gregory Farnum g...@inktank.com 写道：

 The important bit there is actually near the end of the message output line,
 where the first says ack and the second says ondisk.

 I assume you're using btrfs; the ack is returned after the write is applied
 in-memory and readable by clients. The ondisk (commit) message is returned
 after it's durable to the journal or the backing filesystem.
 -Greg

 On Wednesday, September 10, 2014, yuelongguang fasts...@163.com wrote:

 hi,all
 i recently debug ceph rbd, the log tells that  one write to osd can get
 two if its reply.
 the difference between them is seq.
 why?

 thanks
 ---log-
 reader got message 6 0x7f58900010a0 osd_op_reply(15
 rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
 write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
 2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue
 0x7f58900010a0 prio 127
 2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
 c=0xfae940).reader reading tag...
 2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
 c=0xfae940).reader got MSG
 2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
 c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
 2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 
 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
 c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
 2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
 c=0xfae940).reader got front 247
 2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 
 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
 c=0xfae940).aborted = 0
 2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
 c=0xfae940).reader got 247 + 0 + 0 byte message
 2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq #
 = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
 2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 
 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
 c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15
 rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
 write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6




 --
 Software Engineer #42 @ http://inktank.com | http://ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd cpu usage is bigger than 100%

2014-09-11 Thread Gregory Farnum

Presumably it's going faster when you have a deeper iodepth? So the reason
it's using more CPU is because it's doing more work. That's all there is to
it. (And the OSD uses a lot more CPU than some storage systems do, because
it does a lot more work than them.)
-Greg

On Thursday, September 11, 2014, yuelongguang fasts...@163.com wrote:

 hi,all
 i am testing   rbd performance, now there is only one vm which is using
 rbd as its disk, and inside it  fio is doing r/w.
 the big diffenence is that i set a big iodepth other than iodepth=1.
 according to my test,  the bigger iodepth, the bigger cpu usage.

 analyse  the output of top command.
 1.
 12% wa,  if it means disk speed is not fast enough?

 2. from where  we  can know  whether ceph's number of threads  is  enough
 or not?


 how do you think about it,  which part is using up cpu? i want to find the
 root cause, why big iodepth leads to high cpu usage.


 ---default options
 osd_op_threads: 2,
   osd_disk_threads: 1,
   osd_recovery_threads: 1,
 filestore_op_threads: 2,


 thanks

 --top---iodepth=16-
 top - 15:27:34 up 2 days,  6:03,  2 users,  load average: 0.49, 0.56, 0.62
 Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
 Cpu(s): 19.0%us,  8.1%sy,  0.0%ni, 59.3%id, 12.1%wa,  0.0%hi,  0.8%si,
 0.7%st
 Mem:   1922540k total,  1853180k used,69360k free, 7012k buffers
 Swap:  1048568k total,76796k used,   971772k free,  1034272k cached
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
 COMMAND

  2763 root  20   0 1112m 386m 5028 S 60.8 20.6 200:43.47 ceph-osd

  -top
 top - 19:50:08 up 1 day, 10:26,  2 users,  load average: 1.55, 0.97, 0.81
 Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
 Cpu(s): 37.6%us, 14.2%sy,  0.0%ni, 37.0%id,  9.4%wa,  0.0%hi,  1.3%si,
 0.5%st
 Mem:   1922540k total,  1820196k used,   102344k free,23100k buffers
 Swap:  1048568k total,91724k used,   956844k free,  1052292k cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
 COMMAND

  4312 root  20   0 1100m 337m 5192 S 107.3 18.0  88:33.27
 ceph-osd

  1704 root  20   0  514m 272m 3648 S  0.7 14.5   3:27.19 ceph-mon



 --iostat--

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd   5.50   137.50  247.00  782.00  2896.00  8773.00
 11.34 7.083.55   0.63  65.05

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd   9.50   119.00  327.50  458.50  3940.00  4733.50
 11.0312.03   19.66   0.70  55.40

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd  15.5010.50  324.00  559.50  3784.00  3398.00
 8.13 1.982.22   0.81  71.25

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd   4.50   253.50  273.50  803.00  3056.00 12155.00
 14.13 4.704.32   0.55  59.55

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd  10.00 6.00  294.00  488.00  3200.00  2933.50
 7.84 1.101.49   0.70  54.85

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd  10.0014.00  333.00  645.00  3780.00  3846.00
 7.80 2.132.15   0.90  87.55

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd  11.00   240.50  259.00  579.00  3144.00 10035.50
 15.73 8.51   10.18   0.84  70.20

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd  10.5017.00  318.50  707.00  3876.00  4084.50
 7.76 1.321.30   0.61  62.65

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd   4.50   208.00  233.50  918.00  2648.00 19214.50
 18.99 5.434.71   0.55  63.20

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vdd   7.00 1.50  306.00  212.00  3376.00  2176.50
 10.72 1.031.83   0.96  49.70






-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why one osd-op from client can get two osd-op-reply?

2014-09-11 Thread Gregory Farnum

It's the recovery and backfill code. There's not one place; it's what most
of the OSD code is for.

On Thursday, September 11, 2014, yuelongguang fasts...@163.com wrote:

 as for the second question, could you tell me where the code is.
 how ceph makes size/min_szie copies?

 thanks







 At 2014-09-11 12:19:18, Gregory Farnum g...@inktank.com 
 javascript:_e(%7B%7D,'cvml','g...@inktank.com'); wrote:
 On Wed, Sep 10, 2014 at 8:29 PM, yuelongguang fasts...@163.com 
 javascript:_e(%7B%7D,'cvml','fasts...@163.com'); wrote:
 
 
 
 
  as for ack and ondisk, ceph has size and min_size to decide there are  how
  many replications.
  if client receive ack or ondisk, which means there are at least min_size
  osds  have  done the ops?
 
  i am reading the cource code, could you help me with the two questions.
 
  1.
   on osd, where is the code that reply ops  separately  according to ack or
  ondisk.
   i check the code, but i thought they always are replied together.
 
 It depends on what journaling mode you're in, but generally they're
 triggered separately (unless it goes on disk first, in which case it
 will skip the ack — this is the mode it uses for non-btrfs
 filesystems). The places where it actually replies are pretty clear
 about doing one or the other, though...
 
 
  2.
   now i just know how client write ops to primary osd, inside osd cluster,
  how it promises min_size copy are reached.
  i mean  when primary osd receives ops , how it spreads ops to others, and
  how it processes other's reply.
 
 That's not how it works. The primary for a PG will not go active
 with it until it has at least min_size copies that it knows about.
 Once the OSD is doing any processing of the PG, it requires all
 participating members to respond before it sends any messages back to
 the client.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 
  greg, thanks very much
 
 
 
 
 
  在 2014-09-11 01:36:39，Gregory Farnum g...@inktank.com 
  javascript:_e(%7B%7D,'cvml','g...@inktank.com'); 写道：
 
  The important bit there is actually near the end of the message output 
  line,
  where the first says ack and the second says ondisk.
 
  I assume you're using btrfs; the ack is returned after the write is applied
  in-memory and readable by clients. The ondisk (commit) message is returned
  after it's durable to the journal or the backing filesystem.
  -Greg
 
  On Wednesday, September 10, 2014, yuelongguang fasts...@163.com 
  javascript:_e(%7B%7D,'cvml','fasts...@163.com'); wrote:
 
  hi,all
  i recently debug ceph rbd, the log tells that  one write to osd can get
  two if its reply.
  the difference between them is seq.
  why?
 
  thanks
  ---log-
  reader got message 6 0x7f58900010a0 osd_op_reply(15
  rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
  write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
  2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue
  0x7f58900010a0 prio 127
  2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
  10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
  c=0xfae940).reader reading tag...
  2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
  10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
  c=0xfae940).reader got MSG
  2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
  10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
  c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
  2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 
  10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
  c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
  2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
  10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
  c=0xfae940).reader got front 247
  2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 
  10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
  c=0xfae940).aborted = 0
  2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 
  10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
  c=0xfae940).reader got 247 + 0 + 0 byte message
  2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq #
  = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
  2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 
  10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
  c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15
  rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
  write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6
 
 
 
 
  --
  Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 





-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com

Re: [ceph-users] Cephfs upon Tiering

2014-09-11 Thread Gregory Farnum

On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:
 Hi all,

 I am testing the tiering functionality with cephfs. I used a replicated
 cache with an EC data pool, and a replicated metadata pool like this:


 ceph osd pool create cache 1024 1024
 ceph osd pool set cache size 2
 ceph osd pool set cache min_size 1
 ceph osd erasure-code-profile set profile11 k=8 m=3
 ruleset-failure-domain=osd
 ceph osd pool create ecdata 128 128 erasure profile11
 ceph osd tier add ecdata cache
 ceph osd tier cache-mode cache writeback
 ceph osd tier set-overlay ecdata cache
 ceph osd pool set cache hit_set_type bloom
 ceph osd pool set cache hit_set_count 1
 ceph osd pool set cache hit_set_period 3600
 ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
 ceph osd pool create metadata 128 128
 ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
 ceph fs new ceph_fs metadata cache  -- wrong ?

 I started testing with this, and this worked, I could write to it with
 cephfs and the cache was flushing to the ecdata pool as expected.
 But now I notice I made the fs right upon the cache, instead of the
 underlying data pool. I suppose I should have done this:

 ceph fs new ceph_fs metadata ecdata

 So my question is: Was this wrong and not doing the things I thought it did,
 or was this somehow handled by ceph and didn't it matter I specified the
 cache instead of the data pool?

Well, it's sort of doing what you want it to. You've told the
filesystem to use the cache pool as the location for all of its
data. But RADOS is pushing everything in the cache pool down to the
ecdata pool.
So it'll work for now as you want. But if in future you wanted to stop
using the caching pool, or switch it out for a different pool
entirely, that wouldn't work (whereas it would if the fs was using
ecdata).

We should perhaps look at prevent use of cache pools like this...hrm...
http://tracker.ceph.com/issues/9435
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cephfs upon Tiering

2014-09-11 Thread Gregory Farnum

On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil sw...@redhat.com wrote:
 On Thu, 11 Sep 2014, Gregory Farnum wrote:
 On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
 kenneth.waege...@ugent.be wrote:
  Hi all,
 
  I am testing the tiering functionality with cephfs. I used a replicated
  cache with an EC data pool, and a replicated metadata pool like this:
 
 
  ceph osd pool create cache 1024 1024
  ceph osd pool set cache size 2
  ceph osd pool set cache min_size 1
  ceph osd erasure-code-profile set profile11 k=8 m=3
  ruleset-failure-domain=osd
  ceph osd pool create ecdata 128 128 erasure profile11
  ceph osd tier add ecdata cache
  ceph osd tier cache-mode cache writeback
  ceph osd tier set-overlay ecdata cache
  ceph osd pool set cache hit_set_type bloom
  ceph osd pool set cache hit_set_count 1
  ceph osd pool set cache hit_set_period 3600
  ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
  ceph osd pool create metadata 128 128
  ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
  ceph fs new ceph_fs metadata cache  -- wrong ?
 
  I started testing with this, and this worked, I could write to it with
  cephfs and the cache was flushing to the ecdata pool as expected.
  But now I notice I made the fs right upon the cache, instead of the
  underlying data pool. I suppose I should have done this:
 
  ceph fs new ceph_fs metadata ecdata
 
  So my question is: Was this wrong and not doing the things I thought it 
  did,
  or was this somehow handled by ceph and didn't it matter I specified the
  cache instead of the data pool?

 Well, it's sort of doing what you want it to. You've told the
 filesystem to use the cache pool as the location for all of its
 data. But RADOS is pushing everything in the cache pool down to the
 ecdata pool.
 So it'll work for now as you want. But if in future you wanted to stop
 using the caching pool, or switch it out for a different pool
 entirely, that wouldn't work (whereas it would if the fs was using
 ecdata).

 We should perhaps look at prevent use of cache pools like this...hrm...
 http://tracker.ceph.com/issues/9435

 Should we?  I was planning on doing exactly this for my home cluster.

Not cache pools under CephFS, but specifying the cache pool as the
data pool (rather than some underlying pool). Or is there some reason
we might want the cache pool to be the one the filesystem is using for
indexing?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgraded now MDS won't start

2014-09-11 Thread Gregory Farnum

On Wed, Sep 10, 2014 at 4:24 PM, McNamara, Bradley
bradley.mcnam...@seattle.gov wrote:
 Hello,

 This is my first real issue since running Ceph for several months.  Here's 
 the situation:

 I've been running an Emperor cluster for several months.  All was good.  I 
 decided to upgrade since I'm running Ubuntu 13.10 and 0.72.2.  I decided to 
 first upgrade Ceph to 0.80.4, which was the last version in the apt 
 repository for 13.10.  I upgrade the MON's, then the OSD servers to 0.80.4; 
 all went as expected with no issues.  The last thing I did was upgrade the 
 MDS using the same process, but now the MDS won't start.  I've tried to 
 manually start the MDS with debugging on, and I have attached the file.  It 
 complains that it's looking for mds.0.20  need osdmap epoch 3602, have 3601.

 Anyway, I'd don't really use CephFS or RGW, so I don't need the MDS, but I'd 
 like to have it.  Can someone tell me how to fix it, or delete it, so I can 
 start over when I do need it?  Right now my cluster is HEALTH_WARN because of 
 it.

Uh, the log is from an MDS running Emperor. That one looks like it's
complaining because the mds data formats got updated for Firefly. ;)
You'll need to run debugging from a Firefly mds to try and get
something useful.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cephfs upon Tiering

2014-09-12 Thread Gregory Farnum

On Fri, Sep 12, 2014 at 1:53 AM, Kenneth Waegeman kenneth.waege...@ugent.be
javascript:; wrote:

 - Message from Sage Weil sw...@redhat.com javascript:; -
Date: Thu, 11 Sep 2014 14:10:46 -0700 (PDT)
From: Sage Weil sw...@redhat.com javascript:;
 Subject: Re: [ceph-users] Cephfs upon Tiering
  To: Gregory Farnum g...@inktank.com javascript:;
  Cc: Kenneth Waegeman kenneth.waege...@ugent.be javascript:;,
ceph-users
 ceph-users@lists.ceph.com javascript:;

 On Thu, 11 Sep 2014, Gregory Farnum wrote:

 On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil sw...@redhat.com
javascript:; wrote:
  On Thu, 11 Sep 2014, Gregory Farnum wrote:
  On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
  kenneth.waege...@ugent.be javascript:; wrote:
   Hi all,

   I am testing the tiering functionality with cephfs. I used a
   replicated
   cache with an EC data pool, and a replicated metadata pool like
   this:

   ceph osd pool create cache 1024 1024
   ceph osd pool set cache size 2
   ceph osd pool set cache min_size 1
   ceph osd erasure-code-profile set profile11 k=8 m=3
   ruleset-failure-domain=osd
   ceph osd pool create ecdata 128 128 erasure profile11
   ceph osd tier add ecdata cache
   ceph osd tier cache-mode cache writeback
   ceph osd tier set-overlay ecdata cache
   ceph osd pool set cache hit_set_type bloom
   ceph osd pool set cache hit_set_count 1
   ceph osd pool set cache hit_set_period 3600
   ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
   ceph osd pool create metadata 128 128
   ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
   ceph fs new ceph_fs metadata cache  -- wrong ?

   I started testing with this, and this worked, I could write to it
   with
   cephfs and the cache was flushing to the ecdata pool as expected.
   But now I notice I made the fs right upon the cache, instead of the
   underlying data pool. I suppose I should have done this:

   ceph fs new ceph_fs metadata ecdata

   So my question is: Was this wrong and not doing the things I
thought
   it did,
   or was this somehow handled by ceph and didn't it matter I
specified
   the
   cache instead of the data pool?

  Well, it's sort of doing what you want it to. You've told the
  filesystem to use the cache pool as the location for all of its
  data. But RADOS is pushing everything in the cache pool down to the
  ecdata pool.
  So it'll work for now as you want. But if in future you wanted to
stop
  using the caching pool, or switch it out for a different pool
  entirely, that wouldn't work (whereas it would if the fs was using
  ecdata).

 After this I tried with the 'ecdata' pool, which is not working because
 itself is an EC pool.
 So I guess specifying the cache pool is then indeed the only way, but
that's
 ok then if that works.
 It is just a bit confusing to specify the cache pool rather than the
data:)

*blinks*
Uh, yeah. I forgot about that check, which was added because somebody tried
to use CephFS on an EC pool without a cache on top. We've obviously got
some UI work to do. Thanks for the reminder!
-Greg

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Showing package loss in ceph main log

2014-09-12 Thread Gregory Farnum

Ceph messages are transmitted using tcp, so the system isn't directly aware
of packet loss at any level. I suppose we could try and export messenger
reconnect counts via the admin socket, but that'd be a very noisy measure
-- it seems simplest to just query the OS or hardware directly?
-Greg

On Friday, September 12, 2014, Josef Johansson jo...@oderland.se wrote:

 Hi,

 I've stumpled upon this a couple of times, where Ceph just stops
 responding, but still works.
 The cause has been package loss on the network layer, but Ceph doesn't
 say anything.

 Is there a debug flag for showing retransmission of package, or someway
 to see that packages are lost?

 Regards,
 Josef
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] a question regarding sparse file

2014-09-12 Thread Gregory Farnum

On Fri, Sep 12, 2014 at 9:26 AM, brandon li brandon.li@gmail.com wrote:
 Hi,

 I am new to ceph file system, and have got a newbie question:

 For a sparse file, how could ceph file system know the hole in the file was
 never created or some stripe was just simply lost?

CephFS does not keep any metadata to try and track that; it assumes
that non-existent objects are supposed to be holes. It relies on RADOS
not losing data.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS : rm file does not remove object in rados

2014-09-12 Thread Gregory Farnum

On Fri, Sep 12, 2014 at 6:49 AM, Florent Bautista flor...@coppint.com wrote:
 Hi all,

 Today I have a problem using CephFS. I use firefly last release, with
 kernel 3.16 client (Debian experimental).

 I have a directory in CephFS, associated to a pool pool2 (with
 set_layout).

 All is working fine, I can add and remove files, objects are stored in
 the right pool.

 But when Ceph cluster is overloaded (or for another reason, I don't
 know), sometimes when I remove a file, objects are not deleted in rados !

CephFS file removal is asynchronous with you removing it from the
filesystem. The files get moved into a stray directory and will get
deleted once nobody holds references to them any more.


 I explain : I want to remove a large directory, containing millions of
 files. For a moment, objects are really deleted in rados (I see it in
 rados df), but when I start to do some heavy operations (like moving
 volumes in rdb), objects are not deleted anymore, rados df returns a
 fixed number of objects. I can see that files are still deleting because
 I use rsync (rsync -avP --stats --delete /empty/dir/ /dir/to/delete/).

What do you mean you're rsyncing and can see files deleting? I don't understand.

Anyway, It's *possible* that the client is holding capabilities on the
deleted files and isn't handing them back, in which case unmounting it
would drop them (and then you could remount). I don't think we have
any commands designed to hasten that, though.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-12 Thread Gregory Farnum

On Fri, Sep 12, 2014 at 4:41 AM, Francois Deppierraz
franc...@ctrlaltdel.ch wrote:
 Hi,

 Following-up this issue, I've identified that almost all unfound objects
 belongs to a single RBD volume (with the help of the script below).

 Now what's the best way to try to recover the filesystem stored on this
 RBD volume?

 'mark_unfound_lost revert' or 'mark_unfound_lost lost' and then running
 fsck?

 By the way, I'm also still interested to know whether the procedure I've
 tried with ceph_objectstore_tool was correct?

Yeah, that was the correct procedure. I believe you should just need
to mark osd.6 as lost and remove it from the cluster and it will give
up on getting the pg back. (You may also need to force_create_pgs or
something; I don't recall. The docs should discuss that, though.)

Once you've given up on the objects, recovering data from rbd images
which included them is just like recovering from a lost hard drive
sector or whatever. Hopefully fsck in the VM leaves you with a working
filesystem, and however many files are still present...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


 Thanks!

 François

 [1] ceph-list-unfound.sh

 #!/bin/sh
 for pg in $(ceph health detail | awk '/unfound$/ { print $2; }'); do
 ceph pg $pg list_missing | jq .objects
 done | jq -s add | jq '.[] | .oid.oid'


 On 11. 09. 14 11:05, Francois Deppierraz wrote:
 Hi Greg,

 An attempt to recover pg 3.3ef by copying it from broken osd.6 to
 working osd.32 resulted in one more broken osd :(

 Here's what was actually done:

 root@storage1:~# ceph pg 3.3ef list_missing | head
 { offset: { oid: ,
   key: ,
   snapid: 0,
   hash: 0,
   max: 0,
   pool: -1,
   namespace: },
   num_missing: 219,
   num_unfound: 219,
   objects: [
 [...]
 root@storage1:~# ceph pg 3.3ef query
 [...]
   might_have_unfound: [
 { osd: 6,
   status: osd is down},
 { osd: 19,
   status: already probed},
 { osd: 32,
   status: already probed},
 { osd: 42,
   status: already probed}],
 [...]

 # Exporting pg 3.3ef from broken osd.6

 root@storage2:~# ceph_objectstore_tool --data-path
 /var/lib/ceph/osd/ceph-6/ --journal-path
 /var/lib/ceph/osd/ssd0/6.journal --pgid 3.3ef --op export --file
 ~/backup/osd-6.pg-3.3ef.export

 # Remove an empty pg 3.3ef which was already present on this OSD

 root@storage2:~# service ceph stop osd.32
 root@storage2:~# ceph_objectstore_tool --data-path
 /var/lib/ceph/osd/ceph-32/ --journal-path
 /var/lib/ceph/osd/ssd0/32.journal --pgid 3.3ef --op remove

 # Import pg 3.3ef from dump

 root@storage2:~# ceph_objectstore_tool --data-path
 /var/lib/ceph/osd/ceph-32/ --journal-path
 /var/lib/ceph/osd/ssd0/32.journal --op import --file
 ~/backup/osd-6.pg-3.3ef.export
 root@storage2:~# service ceph start osd.32

 -1 2014-09-10 18:53:37.196262 7f13fdd7d780  5 osd.32 pg_epoch:
 48366 pg[3.3ef(unlocked)] enter Initial
  0 2014-09-10 18:53:37.239479 7f13fdd7d780 -1 *** Caught signal
 (Aborted) **
  in thread 7f13fdd7d780

  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
  1: /usr/bin/ceph-osd() [0x8843da]
  2: (()+0xfcb0) [0x7f13fcfabcb0]
  3: (gsignal()+0x35) [0x7f13fb98a0d5]
  4: (abort()+0x17b) [0x7f13fb98d83b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f13fc2dc69d]
  6: (()+0xb5846) [0x7f13fc2da846]
  7: (()+0xb5873) [0x7f13fc2da873]
  8: (()+0xb596e) [0x7f13fc2da96e]
  9: /usr/bin/ceph-osd() [0x94b34f]
  10:
 (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator)+0x12c)
 [0x691b6c]
  11: (PGLog::read_log(ObjectStore*, coll_t, hobject_t, pg_info_t const,
 std::mapeversion_t, hobject_t, std::lesseversion_t,
 std::allocatorstd::paireversion_t const,
  hobject_t  , PGLog::IndexedLog, pg_missing_t,
 std::basic_ostringstreamchar, std::char_traitschar,
 std::allocatorchar , std::setstd::string, std::lessstd::
 string, std::allocatorstd::string *)+0x16d4) [0x7d3ef4]
  12: (PG::read_state(ObjectStore*, ceph::buffer::list)+0x2c1) [0x7951b1]
  13: (OSD::load_pgs()+0x18f3) [0x61e143]
  14: (OSD::init()+0x1b9a) [0x62726a]
  15: (main()+0x1e8d) [0x5d2d0d]
  16: (__libc_start_main()+0xed) [0x7f13fb97576d]
  17: /usr/bin/ceph-osd() [0x5d69d9]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 Fortunately it was possible to bring back osd.32 into a working state
 simply be removing this pg.

 root@storage2:~# ceph_objectstore_tool --data-path
 /var/lib/ceph/osd/ceph-32/ --journal-path
 /var/lib/ceph/osd/ssd0/32.journal --pgid 3.3ef --op remove

 Did I miss something from this procedure or does it mean that this pg is
 definitely lost?

 Thanks!

 François

 On 09. 09. 14 00:23, Gregory Farnum wrote:
 On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz
 franc...@ctrlaltdel.ch wrote:
 Hi Greg,

 Thanks for your support!

 On 08. 09. 14 20:20, Gregory Farnum wrote

Re: [ceph-users] Removing MDS

2014-09-12 Thread Gregory Farnum

You can turn off the MDS and create a new FS in new pools. The ability
to shut down a filesystem more completely is coming in Giant.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Sep 12, 2014 at 1:16 PM, LaBarre, James  (CTR)  A6IT
james.laba...@cigna.com wrote:
 We were building a test cluster here, and I enabled MDS in order to use
 ceph-fuse to fill the cluster with data.   It seems the metadata server is
 having problems, so I figured I’d just remove it and rebuild it.  However,
 the “ceph-deploy mds destroy” command is not implemented; it appears that
 once you have created an MDS, you can’t get rid of it without demolishing
 your entire cluster and building from scratch.   And since the cluster is
 already out of whack, there seems to be no way to even drop OSDs to restart
 it cleanly.



 Should I just reboot all the OSD nodes and the monitor node, and hope the
 cluster comes up in a usable fashion?  Because there seems no other option
 short of  the burn-down and rebuild.



 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.
 This email transmission may contain confidential information.  This
 information is intended only for the use of the individual(s) or entity to
 whom it is intended even if addressed incorrectly.  Please delete it from
 your files if you are not the intended recipient.  Thank you for your
 compliance.  Copyright (c) 2014 Cigna
 ==


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why no likely() and unlikely() used in Ceph's source code?

2014-09-15 Thread Gregory Farnum

I don't know where the file came from, but likely/unlikely markers are the
kind of micro-optimization that isn't worth the cost in Ceph dev resources
right now.
-Greg

On Monday, September 15, 2014, Tim Zhang cofol1...@gmail.com wrote:

 Hey guys,
 After reading ceph source code, I find that there is a file named
 common/likely.h and it implements the function likely() and unlikey() which
 will optimize the prediction of code branch for cpu.
  But there isn't any place using these two functions, I am curious
 about why the developer of ceph not using these two functions to achieve
 more performance. Can anyone provide some hints?
 BR



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Dumpling cluster can't resolve peering failures, ceph pg query blocks, auth failures in logs

2014-09-15 Thread Gregory Farnum

Not sure, but have you checked the clocks on their nodes? Extreme
clock drift often results in strange cephx errors.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sun, Sep 14, 2014 at 11:03 PM, Florian Haas flor...@hastexo.com wrote:
 Hi everyone,

 [Keeping this on the -users list for now. Let me know if I should
 cross-post to -devel.]

 I've been asked to help out on a Dumpling cluster (a system
 bequeathed by one admin to the next, currently on 0.67.10, was
 originally installed with 0.67.5 and subsequently updated a few
 times), and I'm seeing a rather odd issue there. The cluster is
 relatively small, 3 MONs, 4 OSD nodes; each OSD node hosts a rather
 non-ideal 12 OSDs but its performance issues aren't really the point
 here.

 ceph health detail shows a bunch of PGs peering, but the usual
 troubleshooting steps don't really seem to work.

 For some PGs, ceph pg pgid query just blocks, doesn't return
 anything. Adding --debug_ms=10 shows that it's simply not getting a
 response back from one of the OSDs it's trying to talk to, as if
 packets dropped on the floor or were filtered out. However, opening a
 simple TCP connection to the OSD's IP and port works perfectly fine
 (netcat returns a Ceph signature).

 (Note, though, that because of a daemon flapping issue they at some
 point set both noout and nodown, so the cluster may not be
 behaving as normally expected when OSDs fail to respond in time.)

 Then there are some PGs where ceph pg pgid query is a little more
 verbose, though not exactly more successful:

 From ceph health detail:

 pg 6.c10 is stuck inactive for 1477.781394, current state peering,
 last acting [85,16]

 ceph pg 6.b1 query:

 2014-09-15 01:06:48.200418 7f29a6efc700  0 cephx: verify_reply
 couldn't decrypt with error: error decoding block for decryption
 2014-09-15 01:06:48.200428 7f29a6efc700  0 -- 10.47.17.1:0/1020420 
 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1
 c=0x2c00d90).failed verifying authorize reply
 2014-09-15 01:06:48.200465 7f29a6efc700  0 -- 10.47.17.1:0/1020420 
 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43263 s=1 pgs=0 cs=0 l=1
 c=0x2c00d90).fault
 2014-09-15 01:06:48.201000 7f29a6efc700  0 cephx: verify_reply
 couldn't decrypt with error: error decoding block for decryption
 2014-09-15 01:06:48.201008 7f29a6efc700  0 -- 10.47.17.1:0/1020420 
 10.47.16.33:6818/15630 pipe(0x2c00b00 sd=4 :43264 s=1 pgs=0 cs=0 l=1
 c=0x2c00d90).failed verifying authorize reply

 Oops. Now the admins swear they didn't touch the keys, but they are
 also (understandably) reluctant to just kill and redeploy all those
 OSDs, as these issues are basically scattered over a bunch of PGs
 touching many OSDs. How would they pinpoint this to be sure that
 they're not being bitten by a bug or misconfiguration?

 Not sure if people have seen this before — if so, I'd be grateful for
 some input. Loïc, Sébastien perhaps? Or João, Greg, Sage?

 Thanks in advance for any insight people might be able to share. :)

 Cheers,
 Florian
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD troubles on FS+Tiering

2014-09-15 Thread Gregory Farnum

The pidfile bug is already fixed in master/giant branches.

As for the crashing, I'd try killing all the osd processes and turning
them back on again. It might just be some daemon restart failed, or
your cluster could be sufficiently overloaded that the node disks are
going unresponsive and they're suiciding, or...
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Sep 15, 2014 at 5:43 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:
 Hi,

 I have some strange OSD problems. Before the weekend I started some rsync
 tests over CephFS, on a cache pool with underlying EC KV pool. Today the
 cluster is completely degraded:

 [root@ceph003 ~]# ceph status
 cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
  health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded; 168 pgs
 down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgs stuck
 degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgs stuck unclean;
 403 pgs stuck undersized; 403 pgs undersized; 300 requests are blocked  32
 sec; recovery 15170/27902361 objects degraded (0.054%); 1922/27902361
 objects misplaced (0.007%); 1 near full osd(s)
  monmap e1: 3 mons at
 {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
 election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003
  mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
  osdmap e719: 48 osds: 18 up, 18 in
   pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects
 2282 GB used, 31397 GB / 33680 GB avail
 15170/27902361 objects degraded (0.054%); 1922/27902361 objects
 misplaced (0.007%)
   68 down+remapped+peering
1 active
  754 active+clean
1 stale+incomplete
1 stale+active+clean+scrubbing
   14 active+undersized+degraded+remapped
7 incomplete
  100 down+peering
9 active+remapped
   59 stale+active+undersized+degraded
   19 active+undersized+degraded+remapped+backfill_toofull
  311 active+undersized+degraded

 I tried to figure out what happened in the global logs:

 2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF] pgmap
 v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
 129 TB avail; 4159 kB/s wr, 45 op/s
 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap
 v65893: 1344 pgs: 1344
 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap
 v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
 129 TB avail; 561 kB/s wr, 11 op/s
 2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF] osd.19
 10.141.8.181:6809/29664 failed (3 reports from 3 peers after 20.79 =
 grace 20.00)
 2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF] osdmap
 e117: 48 osds: 47 up, 48 in
 2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF] pgmap
 v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
 129 TB avail; 1353 kB/s wr, 13 op/s
 2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF] pgmap
 v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data,
 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s
 2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF] osdmap
 e118: 48 osds: 47 up, 48 in
 2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF] pgmap
 v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data,
 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s
 2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF] pgmap
 v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data,
 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s
 2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF] pgmap
 v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data,
 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s
 2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF] pgmap
 v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB data,
 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s
 2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN] 1
 slow requests, 1 included below; oldest blocked for  30.000137 secs
 2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN] slow
 request 30.000137 seconds old, received at 2014-09-13 08:00:49.335339:
 osd_op(client.7448.1:17751783 1203eac.000e [write 0~319488
 [1@-1],startsync 0~0] 1.b
 6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg
 2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 5 : [WRN] 7
 slow requests, 6 included below; oldest blocked for  31.001947 secs
 2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN] slow

Re: [ceph-users] Cephfs upon Tiering

2014-09-15 Thread Gregory Farnum

On Mon, Sep 15, 2014 at 6:32 AM, Berant Lemmenes ber...@lemmenes.com wrote:
 Greg,

 So is the consensus that the appropriate way to implement this scenario is
 to have the fs created on the EC backing pool vs. the cache pool but that
 the UI check needs to be tweaked to distinguish between this scenario and
 just trying to use a EC pool alone?

Yeah, we'll fix this for Giant. In practical terms it doesn't make
much difference right now; just want to be consistent for the future.
:)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


 I'm also interested in the scenario of having a EC backed pool fronted by a
 replicated cache for use with cephfs.

 Thanks,
 Berant

 On Fri, Sep 12, 2014 at 12:37 PM, Gregory Farnum g...@inktank.com wrote:

 On Fri, Sep 12, 2014 at 1:53 AM, Kenneth Waegeman
 kenneth.waege...@ugent.be wrote:
 
  - Message from Sage Weil sw...@redhat.com -
 Date: Thu, 11 Sep 2014 14:10:46 -0700 (PDT)
 From: Sage Weil sw...@redhat.com
  Subject: Re: [ceph-users] Cephfs upon Tiering
   To: Gregory Farnum g...@inktank.com
   Cc: Kenneth Waegeman kenneth.waege...@ugent.be, ceph-users
  ceph-users@lists.ceph.com
 
 
 
  On Thu, 11 Sep 2014, Gregory Farnum wrote:
 
  On Thu, Sep 11, 2014 at 11:39 AM, Sage Weil sw...@redhat.com wrote:
   On Thu, 11 Sep 2014, Gregory Farnum wrote:
   On Thu, Sep 11, 2014 at 4:13 AM, Kenneth Waegeman
   kenneth.waege...@ugent.be wrote:
Hi all,
   
I am testing the tiering functionality with cephfs. I used a
replicated
cache with an EC data pool, and a replicated metadata pool like
this:
   
   
ceph osd pool create cache 1024 1024
ceph osd pool set cache size 2
ceph osd pool set cache min_size 1
ceph osd erasure-code-profile set profile11 k=8 m=3
ruleset-failure-domain=osd
ceph osd pool create ecdata 128 128 erasure profile11
ceph osd tier add ecdata cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay ecdata cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))
ceph osd pool create metadata 128 128
ceph osd pool set metadata crush_ruleset 1 # SSD root in crushmap
ceph fs new ceph_fs metadata cache  -- wrong ?
   
I started testing with this, and this worked, I could write to it
with
cephfs and the cache was flushing to the ecdata pool as expected.
But now I notice I made the fs right upon the cache, instead of
the
underlying data pool. I suppose I should have done this:
   
ceph fs new ceph_fs metadata ecdata
   
So my question is: Was this wrong and not doing the things I
thought
it did,
or was this somehow handled by ceph and didn't it matter I
specified
the
cache instead of the data pool?
  
   Well, it's sort of doing what you want it to. You've told the
   filesystem to use the cache pool as the location for all of its
   data. But RADOS is pushing everything in the cache pool down to
   the
   ecdata pool.
   So it'll work for now as you want. But if in future you wanted to
   stop
   using the caching pool, or switch it out for a different pool
   entirely, that wouldn't work (whereas it would if the fs was using
   ecdata).
 
 
  After this I tried with the 'ecdata' pool, which is not working because
  itself is an EC pool.
  So I guess specifying the cache pool is then indeed the only way, but
  that's
  ok then if that works.
  It is just a bit confusing to specify the cache pool rather than the
  data:)

 *blinks*
 Uh, yeah. I forgot about that check, which was added because somebody
 tried to use CephFS on an EC pool without a cache on top. We've obviously
 got some UI work to do. Thanks for the reminder!
 -Greg


 --
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] does CephFS still have no fsck utility?

2014-09-15 Thread Gregory Farnum

CephFS in general has a lot fewer metadata structures than traditional
filesystems generally do; about the only thing that could go wrong
without users noticing directly is:
1) The data gets corrupted
2) Files somehow get removed from folders.

Data corruption is something RADOS is responsible for detecting
through its scrub processes and things. If CephFS actually dropped a
file, yeah, that'd be a problem which we don't have other mechanisms
of detecting at this time. But the more traditional sort of fsck
activities like looking for doubly-linked blocks or multiply-allocated
inodes are more or less impossible given our decentralized
architecture and lack of stored metadata (for instance, data blocks
are just objects whose name is calculated based on the inode number
and the offset within the file).
If it makes you feel better, fsck is something I've been working on a
lot recently, based on the design discussions we had early last year.
The first pass is just a scrubbing mechanism to make sure that the
hierarchy is self-consistent and the referenced RADOS objects actually
exist; later we'll move on to checking that each RADOS object is
associated with a particular file.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Sep 15, 2014 at 4:15 PM, brandon li brandon.li@gmail.com wrote:
 Thanks for the reply, Greg.

 With traditional file system experience, I have to admit it will take me
 some time to get used to the way CephFS works.  Considering it as part of my
 learning curve.  :-)

 One of concerns I have it that, without tools like fsck, how could we know
 the file system is still consistent?

 Even RADOS doesn't report error, could there be any miscommunication(e.g.,
 due to bug, networking issue, disk bit flip, ...) between metadata operation
 and stripe I/O?

 For example, the first stripe of a file is created but its inode id(on
 RADOS) is wrong for some reason, and thus RADOS doesn't think it belongs to
 the correct file. This may never happen and I just use it here to explain my
 concern.

 Thanks,
 Brandon


 On Mon, Sep 15, 2014 at 3:49 PM, Gregory Farnum g...@inktank.com wrote:

 On Mon, Sep 15, 2014 at 3:23 PM, brandon li brandon.li@gmail.com
 wrote:
  If it's true, is there any other tools I can use to check and repair the
  file system?

 Not much, no. That said, you shouldn't really need an fsck unless the
 underlying RADOS store went through some catastrophic event. Is there
 anything in particular you're worried about?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD troubles on FS+Tiering

2014-09-16 Thread Gregory Farnum

Heh, you'll have to talk to Haomai about issues with the
KeyValueStore, but I know he's found a number of issues in the version
of it that went to 0.85.

In future please flag when you're running with experimental stuff; it
helps direct attention to the right places! ;)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Sep 16, 2014 at 5:28 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:

 - Message from Gregory Farnum g...@inktank.com -
Date: Mon, 15 Sep 2014 10:37:07 -0700
From: Gregory Farnum g...@inktank.com
 Subject: Re: [ceph-users] OSD troubles on FS+Tiering
  To: Kenneth Waegeman kenneth.waege...@ugent.be
  Cc: ceph-users ceph-users@lists.ceph.com


 The pidfile bug is already fixed in master/giant branches.

 As for the crashing, I'd try killing all the osd processes and turning
 them back on again. It might just be some daemon restart failed, or
 your cluster could be sufficiently overloaded that the node disks are
 going unresponsive and they're suiciding, or...


 I restarted them that way, and they eventually got clean again.
 'ceph status' printed that 'ecdata' pool had too few pgs, so I changed the
 amount of pgs from 128 to 256 (with EC k+m=11)
 After a few minutes I checked the cluster state again:

 [root@ceph001 ~]# ceph status
 cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
  health HEALTH_WARN 100 pgs down; 155 pgs peering; 81 pgs stale; 240 pgs
 stuck inactive; 81 pgs stuck stale; 240 pgs stuck unclean; 746 requests are
 blocked  32 sec; 'cache' at/near target max; pool ecdata pg_num 256 
 pgp_num 128
  monmap e1: 3 mons at
 {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
 election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003
  mdsmap e6993: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
  osdmap e11023: 48 osds: 14 up, 14 in
   pgmap v160466: 1472 pgs, 4 pools, 3899 GB data, 2374 kobjects
 624 GB used, 7615 GB / 8240 GB avail
   75 creating
 1215 active+clean
  100 down+peering
1 active+clean+scrubbing
   10 stale
   16 stale+active+clean

 Again 34 OSDS are down.. This time I have the error log, I checked a few osd
 logs :

 I checked the first host that was marked down:

-17 2014-09-16 13:27:49.962938 7f5dfe6a3700  5 osd.7 pg_epoch: 8912
 pg[2.b0s3(unlocked)] enter Initial
-16 2014-09-16 13:27:50.008842 7f5e02eac700  1 --
 10.143.8.180:6833/53810 == osd.30 10.141.8.181:0/37396 2524 
 osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2  47+0+0
 (386299 0 0) 0x18ef7080 con 0x6961600
-15 2014-09-16 13:27:50.008892 7f5e02eac700  1 --
 10.143.8.180:6833/53810 -- 10.141.8.181:0/37396 -- osd_ping(ping_reply
 e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x7326900 con 0x6961600
-14 2014-09-16 13:27:50.009159 7f5e046af700  1 --
 10.141.8.180:6847/53810 == osd.30 10.141.8.181:0/37396 2524 
 osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2  47+0+0
 (386299 0 0) 0x2210a760 con 0xadd0420
-13 2014-09-16 13:27:50.009202 7f5e046af700  1 --
 10.141.8.180:6847/53810 -- 10.141.8.181:0/37396 -- osd_ping(ping_reply
 e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x14e35a00 con 0xadd0420
-12 2014-09-16 13:27:50.034378 7f5dfeea4700  5 osd.7 pg_epoch: 8912
 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Reset 0.127612 1
 0.000123
-11 2014-09-16 13:27:50.034432 7f5dfeea4700  5 osd.7 pg_epoch: 8912
 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started
-10 2014-09-16 13:27:50.034452 7f5dfeea4700  5 osd.7 pg_epoch: 8912
 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Start
 -9 2014-09-16 13:27:50.034469 7f5dfeea4700  1 osd.7 pg_epoch: 8912
 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] stateStart: transitioning
 to Stray
 -8 2014-09-16 13:27:50.034491 7f5dfeea4700  5 osd.7 pg_epoch: 8912
 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
 les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
 pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Start 0.38 0
 0.00
 -7 2014-09-16 13:27:50.034521 7f5dfeea4700  5 osd.7 pg_epoch: 8912
 pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
 les/c 813/815 805/8912

Re: [ceph-users] does CephFS still have no fsck utility?

2014-09-16 Thread Gregory Farnum

http://tracker.ceph.com/issues/4137 contains links to all the tasks we
have so far. You can also search any of the ceph-devel list archives
for forward scrub.


On Mon, Sep 15, 2014 at 10:16 PM, brandon li brandon.li@gmail.com wrote:
 Great to know you are working on it!

 I am new to the mailing list. Is there any reference of  discussion last
 year, so I can look into. or any bug number I can watch to keep track of the
 development?

 Thanks,
 Brandon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] what are these files for mon?

2014-09-16 Thread Gregory Farnum

I don't really know; Joao has handled all these cases. I *think* they've
been tied to a few bad versions of LevelDB, but I'm not certain. (There
were a number of discussions about it on the public mailing lists.)
-Greg

On Tuesday, September 16, 2014, Florian Haas flor...@hastexo.com wrote:

 Hi Greg,

 just picked up this one from the archive while researching a different
 issue and thought I'd follow up.

 On Tue, Aug 19, 2014 at 6:24 PM, Gregory Farnum g...@inktank.com
 javascript:; wrote:
  The sst files are files used by leveldb to store its data; you cannot
  remove them. Are you running on a very small VM? How much space are
  the files taking up in aggregate?
  Speaking generally, I think you should see something less than a GB
  worth of data there, but some versions of leveldb under some scenarios
  are known to misbehave and grow pretty large.

 Can you elaborate on the scenarios where leveldb is misbehaving? I've
 also seen reports of this before, with .sst files growing to several
 GB in size. Is this a cause for concern (for example, would you expect
 mons to slow down) and if so, how would you recover? Would you
 essentially nuke the mon and replace it with another?

 Cheers,
 Florian



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mount ceph block device over specific NIC

2014-09-16 Thread Gregory Farnum

Assuming you're using the kernel?

In any case, Ceph generally doesn't do anything to select between
different NICs; it just asks for a connection to a given IP. So you
should just be able to set up a route for that IP.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Sep 16, 2014 at 4:13 AM, Arne K. Haaje a...@drlinux.no wrote:
 Hello,

 We have a machine that mounts a rbd image as a block device, then rsync files
 from another server to this mount.

 As this rsync traffic will have to share bandwith with the writing to the 
 RBD, I
 wonder if it is possible to specify which NIC to mount the RBD through?

 We are using 0.85.5 on Ubuntu 14.04.

 Regards,

 Arne

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Still seing scrub errors in .80.5

2014-09-16 Thread Gregory Farnum

On Tue, Sep 16, 2014 at 12:03 AM, Marc m...@shoowin.de wrote:
 Hello fellow cephalopods,

 every deep scrub seems to dig up inconsistencies (i.e. scrub errors)
 that we could use some help with diagnosing.

 I understand there used to be a data corruption issue before .80.3 so we
 made sure that all the nodes were upgraded to .80.5 and all the daemons
 were restarted (they all report .80.5 when contacted via socket).
 *After* that we ran a deep scrub, which obviously found errors, which we
 then repaired. But unfortunately, it's now a week later, and the next
 deep scrub has dug up new errors, which shouldn't have happened I think...?

 ceph.log shows these errors in between the deep scrub messages:

 2014-09-15 07:56:23.164818 osd.15 10.10.10.55:6804/23853 364 : [ERR]
 3.335 shard 2: soid
 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest
 3090820441 != known digest 3787996302
 2014-09-15 07:56:23.164827 osd.15 10.10.10.55:6804/23853 365 : [ERR]
 3.335 shard 6: soid
 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest
 3259686791 != known digest 3787996302
 2014-09-15 07:56:28.485713 osd.15 10.10.10.55:6804/23853 366 : [ERR]
 3.335 deep-scrub 0 missing, 1 inconsistent objects
 2014-09-15 07:56:28.485734 osd.15 10.10.10.55:6804/23853 367 : [ERR]
 3.335 deep-scrub 2 errors

Uh, I'm afraid those errors were never output as a result of bugs in
Firefly. These are indicating actual data differences between the
nodes, whereas the Firefly issue was a metadata flag that wasn't
handled properly in mixed-version OSD clusters.

I don't think Ceph has ever had a bug that would change the data
payload between OSDs. Searching the tracker logs, the only entries
with this error message are:
1) The local filesystem is not misbehaving under the workload we give
it (and there are no known filesystem issues that are exposed by
running firefly OSDs in default config that I can think of — certainly
none with this error)
2) The disks themselves are bad.

:/

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Still seing scrub errors in .80.5

2014-09-16 Thread Gregory Farnum

Ah, you're right — it wasn't popping up in the same searches and I'd
forgotten that was so recent.

In that case, did you actually deep scrub *everything* in the cluster,
Marc? You'll need to run and fix every PG in the cluster, and the
background deep scrubbing doesn't move through the data very quickly.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Sep 16, 2014 at 11:32 AM, Dan Van Der Ster
daniel.vanders...@cern.ch wrote:
 Hi Greg,
 I believe Marc is referring to the corruption triggered by set_extsize on
 xfs. That option was disabled by default in 0.80.4... See the thread
 firefly scrub error.
 Cheers,
 Dan



 From: Gregory Farnum g...@inktank.com
 Sent: Sep 16, 2014 8:15 PM
 To: Marc
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Still seing scrub errors in .80.5

 On Tue, Sep 16, 2014 at 12:03 AM, Marc m...@shoowin.de wrote:
 Hello fellow cephalopods,

 every deep scrub seems to dig up inconsistencies (i.e. scrub errors)
 that we could use some help with diagnosing.

 I understand there used to be a data corruption issue before .80.3 so we
 made sure that all the nodes were upgraded to .80.5 and all the daemons
 were restarted (they all report .80.5 when contacted via socket).
 *After* that we ran a deep scrub, which obviously found errors, which we
 then repaired. But unfortunately, it's now a week later, and the next
 deep scrub has dug up new errors, which shouldn't have happened I
 think...?

 ceph.log shows these errors in between the deep scrub messages:

 2014-09-15 07:56:23.164818 osd.15 10.10.10.55:6804/23853 364 : [ERR]
 3.335 shard 2: soid
 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest
 3090820441 != known digest 3787996302
 2014-09-15 07:56:23.164827 osd.15 10.10.10.55:6804/23853 365 : [ERR]
 3.335 shard 6: soid
 6ba68735/rbd_data.59e3c2ae8944a.06b1/head//3 digest
 3259686791 != known digest 3787996302
 2014-09-15 07:56:28.485713 osd.15 10.10.10.55:6804/23853 366 : [ERR]
 3.335 deep-scrub 0 missing, 1 inconsistent objects
 2014-09-15 07:56:28.485734 osd.15 10.10.10.55:6804/23853 367 : [ERR]
 3.335 deep-scrub 2 errors

 Uh, I'm afraid those errors were never output as a result of bugs in
 Firefly. These are indicating actual data differences between the
 nodes, whereas the Firefly issue was a metadata flag that wasn't
 handled properly in mixed-version OSD clusters.

 I don't think Ceph has ever had a bug that would change the data
 payload between OSDs. Searching the tracker logs, the only entries
 with this error message are:
 1) The local filesystem is not misbehaving under the workload we give
 it (and there are no known filesystem issues that are exposed by
 running firefly OSDs in default config that I can think of — certainly
 none with this error)
 2) The disks themselves are bad.

 :/

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Packages for 0.85?

2014-09-16 Thread Gregory Farnum

Thanks for the poke; looks like something went wrong during the
release build last week. We're investigating now.
-Greg

On Tue, Sep 16, 2014 at 11:08 AM, Daniel Swarbrick
daniel.swarbr...@profitbricks.com wrote:
 Hi,

 I saw that the development snapshot 0.85 was released last week, and
 have been patiently waiting for packages to appear, so that I can
 upgrade a test cluster here.

 Can we still expect packages (wheezy, in my case) of 0.85 to be published?

 Thanks!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replication factor of 50 on a 1000 OSD node cluster

2014-09-16 Thread Gregory Farnum

On Tue, Sep 16, 2014 at 5:10 PM, JIten Shah jshah2...@me.com wrote:
 Hi Guys,

 We have a cluster with 1000 OSD nodes and 5 MON nodes and 1 MDS node. In 
 order to be able to loose quite a few OSD’s and still survive the load, we 
 were thinking of making the replication factor to 50.

 Is that too big of a number? what is the performance implications and any 
 other issues that we should consider before setting it to that. Also, do we 
 need the same number of metadata copies too or it can be less?

Don't do that. Every write has to be synchronously copied to every
replica, so 50x replication will give you very high latencies and very
low write bandwidth to each object. If you're just worried about not
losing data, there are a lot of people with big clusters running 3x
replication and it's been fine.
If you have some use case where you think you're going to be turning
off a bunch of nodes simultaneously without planning, Ceph might not
be the storage system for your needs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replication factor of 50 on a 1000 OSD node cluster

2014-09-16 Thread Gregory Farnum

Yeah, so generally those will be correlated with some failure domain,
and if you spread your replicas across failure domains you won't hit
any issues. And if hosts are down for any length of time the OSDs will
re-replicate data to keep it at proper redundancy.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Tue, Sep 16, 2014 at 5:53 PM, JIten Shah jshah2...@icloud.com wrote:
 Thanks Greg.

 We may not turn off the nodes randomly without planning but with a 1000 node 
 cluster, there could be 5 to 10 hosts that might crash or go down in case of 
 an event.

 —Jiten

 On Sep 16, 2014, at 5:35 PM, Gregory Farnum g...@inktank.com wrote:

 On Tue, Sep 16, 2014 at 5:10 PM, JIten Shah jshah2...@me.com wrote:
 Hi Guys,

 We have a cluster with 1000 OSD nodes and 5 MON nodes and 1 MDS node. In 
 order to be able to loose quite a few OSD’s and still survive the load, we 
 were thinking of making the replication factor to 50.

 Is that too big of a number? what is the performance implications and any 
 other issues that we should consider before setting it to that. Also, do we 
 need the same number of metadata copies too or it can be less?

 Don't do that. Every write has to be synchronously copied to every
 replica, so 50x replication will give you very high latencies and very
 low write bandwidth to each object. If you're just worried about not
 losing data, there are a lot of people with big clusters running 3x
 replication and it's been fine.
 If you have some use case where you think you're going to be turning
 off a bunch of nodes simultaneously without planning, Ceph might not
 be the storage system for your needs.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-community] Can't Start-up MDS

2014-09-17 Thread Gregory Farnum

That looks like the beginning of an mds creation to me. What's your
problem in more detail, and what's the output of ceph -s?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Sep 15, 2014 at 5:34 PM, Shun-Fa Yang shu...@gmail.com wrote:
 Hi all,

 I'm installed ceph v 0.80.5 on Ubuntu 14.04 server version by using
 apt-get...

 The log of mds shows as following:

 2014-09-15 17:24:58.291305 7fd6f6d47800  0 ceph version 0.80.5
 (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 10487

 2014-09-15 17:24:58.302164 7fd6f6d47800 -1 mds.-1.0 *** no OSDs are up as of
 epoch 8, waiting

 2014-09-15 17:25:08.302930 7fd6f6d47800 -1 mds.-1.-1 *** no OSDs are up as
 of epoch 8, waiting

 2014-09-15 17:25:19.322092 7fd6f1938700  1 mds.-1.0 handle_mds_map standby

 2014-09-15 17:25:19.325024 7fd6f1938700  1 mds.0.3 handle_mds_map i am now
 mds.0.3

 2014-09-15 17:25:19.325026 7fd6f1938700  1 mds.0.3 handle_mds_map state
 change up:standby -- up:creating

 2014-09-15 17:25:19.325196 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:1

 2014-09-15 17:25:19.325377 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:100

 2014-09-15 17:25:19.325381 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:600

 2014-09-15 17:25:19.325449 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:601

 2014-09-15 17:25:19.325489 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:602

 2014-09-15 17:25:19.325538 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:603

 2014-09-15 17:25:19.325564 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:604

 2014-09-15 17:25:19.325603 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:605

 2014-09-15 17:25:19.325627 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:606

 2014-09-15 17:25:19.325655 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:607

 2014-09-15 17:25:19.325682 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:608

 2014-09-15 17:25:19.325714 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:609

 2014-09-15 17:25:19.325738 7fd6f1938700  0 mds.0.cache creating system inode
 with ino:200

 Could someone tell me how to solve it?

 Thanks.

 --
 楊順發(yang shun-fa)

 ___
 Ceph-community mailing list
 ceph-commun...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mds unable to start with 0.85

2014-09-18 Thread Gregory Farnum

On Wed, Sep 17, 2014 at 9:59 PM, 廖建锋 de...@f-club.cn wrote:
 dear,
  my ceph cluster worked for about two weeks,  mds crashed every 2-3
 days,
 Now it stuck on replay , looks like replay crash and restart mds process
 again
  what can i do for this?

  1015 = # ceph -s
 cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18
 health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is
 degraded; noscrub,nodeep-scrub flag(s) set
 monmap e1: 2 mons at
 {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0}, election
 epoch 26, quorum 0,1 storage-1-213,storage-1-214
 mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby
 osdmap e1932: 18 osds: 18 up, 18 in
 flags noscrub,nodeep-scrub
 pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects
 4479 GB used, 32292 GB / 36772 GB avail
 444 active+clean
 56 active+clean+inconsistent
 client io 125 MB/s rd, 31 op/s

 MDS log here:

 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch 620
 from mon.0
 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby
 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch 621
 from mon.0
 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am now
 mds.0.272
 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state
 change up:standby -- up:replay
 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start
 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set
 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is
 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch 1929,
 have 1927
 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap 1929
 (which blacklists prior instance)
 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for
 myself; not doing anything
 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening
 inotable
 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening
 sessionmap
 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening
 mds log
 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log
 bounds
 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening
 snap table
 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.213:6806/6114
 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.213:6811/6385
 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.213:6801/6110
 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal 200
 to recover...
 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.214:6806/6238
 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered.
 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200 in
 format 1
 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1:
 loading/discovering base inodes
 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system inode
 with ino:100
 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system inode
 with ino:1
 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.214:6811/6276
 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2: replaying
 mds log
 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth
 -1,-2 - -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n()
 hs=0+0,ss=0+0 0x5da]
 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir 1
 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 |
 subtree=1 0x5da]
 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth
 -1,-2 - -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824
 f() n() hs=0+0,ss=0+0 0x5da03b8]
 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir
 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 |
 subtree=1 0x5da03b8]
 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache
 adjust_bounded_subtree_auth -2,-2 - 0,-2 on [dir 1 / [2,head] auth
 v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09
 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135
 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226
 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da]
 bound_dfs []
 2014-09-18 12:38:55.675116 7f823ca08700 7 mds.0.cache
 adjust_bounded_subtree_auth -2,-2 - 0,-2 on [dir 1 / [2,head] auth
 v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09
 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135
 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226
 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da]
 bounds
 2014-09-18 12:38:55.675129 7f823ca08700 7 mds.0.cache

Re: [ceph-users] Still seing scrub errors in .80.5

2014-09-18 Thread Gregory Farnum

On Thu, Sep 18, 2014 at 3:09 AM, Marc m...@shoowin.de wrote:
 Hi,

 we did run a deep scrub on everything yesterday, and a repair
 afterwards. Then a new deep scrub today, which brought new scrub errors.

 I did check the osd config, they report filestore_xfs_extsize: false,
 as it should be if I understood things correctly.

 FTR the deep scrub has been initiated like this:

 for pgnum in `ceph pg dump|grep active|awk '{print $1}'`; do ceph pg
 deep-scrub $pgnum; done

 How do we proceed from here?

Did the deep scrubs all actually complete yesterday, so these are new
errors and not just scrubs which weren't finished until now?

If so, I'd start looking at the scrub errors and which OSDs are
involved. Hopefully they'll have one or a few OSDs in common that you
can examine more closely.
But like I said before, my money's on faulty hardware or local
filesystems. Depending on how you're set up it's probably a good idea
to just start checking dmesg for any indications of trouble before you
start tackling it from the RADOS side.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS : rm file does not remove object in rados

2014-09-18 Thread Gregory Farnum

On Thu, Sep 18, 2014 at 10:39 AM, Florent B flor...@coppint.com wrote:
 On 09/12/2014 07:38 PM, Gregory Farnum wrote:
 On Fri, Sep 12, 2014 at 6:49 AM, Florent Bautista flor...@coppint.com 
 wrote:
 Hi all,

 Today I have a problem using CephFS. I use firefly last release, with
 kernel 3.16 client (Debian experimental).

 I have a directory in CephFS, associated to a pool pool2 (with
 set_layout).

 All is working fine, I can add and remove files, objects are stored in
 the right pool.

 But when Ceph cluster is overloaded (or for another reason, I don't
 know), sometimes when I remove a file, objects are not deleted in rados !
 CephFS file removal is asynchronous with you removing it from the
 filesystem. The files get moved into a stray directory and will get
 deleted once nobody holds references to them any more.

 My client is the only mounted and does not use files.

does not use files...what?


 This problems occurs when I delete files with rm, but not when I use
 given rsync command.


 I explain : I want to remove a large directory, containing millions of
 files. For a moment, objects are really deleted in rados (I see it in
 rados df), but when I start to do some heavy operations (like moving
 volumes in rdb), objects are not deleted anymore, rados df returns a
 fixed number of objects. I can see that files are still deleting because
 I use rsync (rsync -avP --stats --delete /empty/dir/ /dir/to/delete/).
 What do you mean you're rsyncing and can see files deleting? I don't 
 understand.

 When you run command I gave, syncing an empty dir with the dir you want
 deleted, rsync is telling you Deleting (file) for each file to unlink.


 Anyway, It's *possible* that the client is holding capabilities on the
 deleted files and isn't handing them back, in which case unmounting it
 would drop them (and then you could remount). I don't think we have
 any commands designed to hasten that, though.

 unmounting does not help.

 When I unlink() via rsync, objects are deleted in rados (it makes all
 cluster slow down, and have slow requests).

 When I use rm command, it is much faster but objects are not deleted in
 rados !

I think you're not doing what you think you're doing, then...those two
actions should look the same to CephFS.


 When I re-mount root CephFS, there are no files, all empty.

 But still have 125 MB of objects in metadata pool and 21.57 GB in my
 data pool (and it does not decrease...)...

Well, the metadata pool is never going to be emptied; that holds your
MDS journals. The data pool might not get entirely empty either; how
many objects does it say it has?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mds unable to start with 0.85

2014-09-18 Thread Gregory Farnum

On Thu, Sep 18, 2014 at 5:35 PM, 廖建锋 de...@f-club.cn wrote:
 if i turn on debug=20, the log will be more than 100G,

 looks no way to put,  do you have any other good way to figure it out?

It should compress well and you can use ceph-post-file if you don't
have a place to host it yourself.
-Greg


 would you like to log into the server to check?


 From: Gregory Farnum
 Date: 2014-09-19 02:33
 To: 廖建锋
 CC: ceph-users
 Subject: Re: [ceph-users] ceph mds unable to start with 0.85

 On Wed, Sep 17, 2014 at 9:59 PM, 廖建锋 de...@f-club.cn wrote:
 dear,
  my ceph cluster worked for about two weeks,  mds crashed every 2-3
 days,
 Now it stuck on replay , looks like replay crash and restart mds process
 again
  what can i do for this?

  1015 = # ceph -s
 cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18
 health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is
 degraded; noscrub,nodeep-scrub flag(s) set
 monmap e1: 2 mons at
 {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0},
 election
 epoch 26, quorum 0,1 storage-1-213,storage-1-214
 mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby
 osdmap e1932: 18 osds: 18 up, 18 in
 flags noscrub,nodeep-scrub
 pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects
 4479 GB used, 32292 GB / 36772 GB avail
 444 active+clean
 56 active+clean+inconsistent
 client io 125 MB/s rd, 31 op/s

 MDS log here:

 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch
 620
 from mon.0
 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby
 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch
 621
 from mon.0
 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am
 now
 mds.0.272
 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state
 change up:standby -- up:replay
 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start
 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set
 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is
 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch
 1929,
 have 1927
 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap
 1929
 (which blacklists prior instance)
 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for
 myself; not doing anything
 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening
 inotable
 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening
 sessionmap
 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening
 mds log
 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log
 bounds
 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening
 snap table
 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.213:6806/6114
 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.213:6811/6385
 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.213:6801/6110
 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal
 200
 to recover...
 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.214:6806/6238
 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered.
 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200
 in
 format 1
 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1:
 loading/discovering base inodes
 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system
 inode
 with ino:100
 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system
 inode
 with ino:1
 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on
 10.1.0.214:6811/6276
 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2:
 replaying
 mds log
 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth
 -1,-2 - -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f()
 n()
 hs=0+0,ss=0+0 0x5da]
 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir
 1
 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 |
 subtree=1 0x5da]
 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth
 -1,-2 - -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0
 state=1073741824
 f() n() hs=0+0,ss=0+0 0x5da03b8]
 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir
 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0
 |
 subtree=1 0x5da03b8]
 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache
 adjust_bounded_subtree_auth -2,-2 - 0,-2 on [dir 1 / [2,head] auth
 v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09
 17:49:20.00 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069
 b1824476527135
 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226
 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0

Re: [ceph-users] Renaming pools used by CephFS

2014-09-19 Thread Gregory Farnum

On Fri, Sep 19, 2014 at 10:21 AM, Jeffrey Ollie j...@ocjtech.us wrote:
 I've got a Ceph system (running 0.80.5) at home that I've been messing
 around with, partly to learn Ceph, but also as reliable storage for all of
 my media.  During the process I deleted the data and metadata pools used by
 CephFS and recreated them.  However, when I recreated the filesystem, the
 pool called data got assigned as a metadata pool and the pool called
 metadata got assigned as a data pool.

 Is there a safe way to rename the pools?  It's purely an aesthetic thing (I
 think), so if it's difficult/dangerous to do I'll leave it be.

You can rename pools with ceph osd pool rename current_name new_name.
Generally it's not a good idea to mess around with the CephFS pools,
though — in Giant  you'll be prevented from deleting them. ;)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Reassigning admin server

2014-09-23 Thread Gregory Farnum

On Mon, Sep 22, 2014 at 1:22 PM, LaBarre, James  (CTR)  A6IT
james.laba...@cigna.com wrote:
 If I have a machine/VM I am using as an Admin node for a ceph cluster, can I
 relocate that admin to another machine/VM after I’ve built a cluster?  I
 would expect as the Admin isn’t an actual operating part of the cluster
 itself (other than Calamari, if it happens to be running) the rest of the
 cluster should be adequately served with a –update-conf.

The admin node really just has the default ceph.conf and the keyrings
for admin access to your cluster. You just need to copy that data to
whatever other node(s) you want; there's no updating to do for the
rest of the cluster.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pgs stuck in active+clean+replay state

2014-09-25 Thread Gregory Farnum

I imagine you aren't actually using the data/metadata pool that these
PGs are in, but it's a previously-reported bug we haven't identified:
http://tracker.ceph.com/issues/8758
They should go away if you restart the OSDs that host them (or just
remove those pools), but it's not going to hurt anything as long as
you aren't using them.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Sep 25, 2014 at 3:37 AM, Pavel V. Kaygorodov pa...@inasan.ru wrote:
 Hi!

 16 pgs in our ceph cluster are in active+clean+replay state more then one day.
 All clients are working fine.
 Is this ok?

 root@bastet-mon1:/# ceph -w
 cluster fffeafa2-a664-48a7-979a-517e3ffa0da1
  health HEALTH_OK
  monmap e3: 3 mons at 
 {1=10.92.8.80:6789/0,2=10.92.8.81:6789/0,3=10.92.8.82:6789/0}, election epoch 
 2570, quorum 0,1,2 1,2,3
  osdmap e3108: 16 osds: 16 up, 16 in
   pgmap v1419232: 8704 pgs, 6 pools, 513 GB data, 125 kobjects
 2066 GB used, 10879 GB / 12945 GB avail
 8688 active+clean
   16 active+clean+replay
   client io 3237 kB/s wr, 68 op/s


 root@bastet-mon1:/# ceph pg dump | grep replay
 dumped all in format plain
 0.fd0   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:29.902766  0'0 3108:2628 
   [0,7,14,8] [0,7,14,8]   0   0'0 2014-09-23 02:23:49.463704  
 0'0 2014-09-23 02:23:49.463704
 0.e80   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:21.945082  0'0 3108:1823 
   [2,7,9,10] [2,7,9,10]   2   0'0 2014-09-22 14:37:32.910787  
 0'0 2014-09-22 14:37:32.910787
 0.aa0   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:29.326607  0'0 3108:2451 
   [0,7,15,12][0,7,15,12]  0   0'0 2014-09-23 00:39:10.717363  
 0'0 2014-09-23 00:39:10.717363
 0.9c0   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:29.325229  0'0 3108:1917 
   [0,7,9,12] [0,7,9,12]   0   0'0 2014-09-22 14:40:06.694479  
 0'0 2014-09-22 14:40:06.694479
 0.9a0   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:29.325074  0'0 3108:2486 
   [0,7,14,11][0,7,14,11]  0   0'0 2014-09-23 01:14:55.825900  
 0'0 2014-09-23 01:14:55.825900
 0.910   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:28.839148  0'0 3108:1962 
   [0,7,9,10] [0,7,9,10]   0   0'0 2014-09-22 14:37:44.652796  
 0'0 2014-09-22 14:37:44.652796
 0.8c0   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:28.838683  0'0 3108:2635 
   [0,2,9,11] [0,2,9,11]   0   0'0 2014-09-23 01:52:52.390529  
 0'0 2014-09-23 01:52:52.390529
 0.8b0   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:21.215964  0'0 3108:1636 
   [2,0,8,14] [2,0,8,14]   2   0'0 2014-09-23 01:31:38.134466  
 0'0 2014-09-23 01:31:38.134466
 0.500   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:35.869160  0'0 3108:1801 
   [7,2,15,10][7,2,15,10]  7   0'0 2014-09-20 08:38:53.963779  
 0'0 2014-09-13 10:27:26.977929
 0.440   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:35.871409  0'0 3108:1819 
   [7,2,15,10][7,2,15,10]  7   0'0 2014-09-20 11:59:05.208164  
 0'0 2014-09-20 11:59:05.208164
 0.390   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:28.653190  0'0 3108:1827 
   [0,2,9,10] [0,2,9,10]   0   0'0 2014-09-22 14:40:50.697850  
 0'0 2014-09-22 14:40:50.697850
 0.320   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:10.970515  0'0 3108:1719 
   [2,0,14,9] [2,0,14,9]   2   0'0 2014-09-20 12:06:23.716480  
 0'0 2014-09-20 12:06:23.716480
 0.2c0   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:28.647268  0'0 3108:2540 
   [0,7,12,8] [0,7,12,8]   0   0'0 2014-09-22 23:44:53.387815  
 0'0 2014-09-22 23:44:53.387815
 0.1f0   0   0   0   0   0   0   
 active+clean+replay 2014-09-24 02:38:28.651059  0'0 3108:2522 
   [0,2,14,11][0,2,14,11]  0   0'0 2014-09-22 23:38:16.315755  
 0'0 2014-09-22 23:38:16.315755
 0.7 0   0   0   0   0   0   0

Re: [ceph-users] PG stuck creating

2014-09-30 Thread Gregory Farnum

Yeah, the last acting set there is probably from prior to your lost
data and forced pg creation, so it might not have any bearing on
what's happening now.
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Sep 30, 2014 at 10:07 AM, Robert LeBlanc rob...@leblancnet.us wrote:
 I rebuilt the primary OSD (29) in the hopes it would unblock whatever it
 was, but no luck. I'll check the admin socket and see if there is anything I
 can find there.

 On Tue, Sep 30, 2014 at 10:36 AM, Gregory Farnum g...@inktank.com wrote:

 On Tuesday, September 30, 2014, Robert LeBlanc rob...@leblancnet.us
 wrote:

 On our dev cluster, I've got a PG that won't create. We had a host fail
 with 10 OSDs that needed to be rebuilt. A number of other OSDs were down for
 a few days (did I mention this was a dev cluster?). The other OSDs
 eventually came up once the OSD maps caught up on them. I rebuilt the OSDs
 on all the hosts because we were running into XFS lockups with bcache. There
 were a number of PGs that could not be found when all the hosts were
 rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the
 OSDs they were on as well as the PGs. I performed a repair on the OSDs as
 well without any luck. One of pools had a recommendation to increase the
 PGs, so I increased it thinking it might be able to help.

 Nothing was helping and I could not find any reference to them so I force
 created them. That cleared up all but one that is creating due to the new PG
 number. Now, there is nothing I can do to unstick this one PG, I can't force
 create it, I can't increase the pgp_num, nada. At one point when recreating
 the OSDs, some of the number got out of order and to calm my OCD, I fixed
 it requiring me to manually modify the CRUSH map as the OSD appeared in both
 hosts, this was before I increased the PGs.

 There is nothing critical on this cluster, but I'm using this as an
 opportunity to understand Ceph in case we run into something similar in our
 future production environment.

 HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool
 pg_num 256  pgp_num 128
 pg 4.bf is stuck inactive since forever, current state creating, last
 acting [29,15,32]
 pg 4.bf is stuck unclean since forever, current state creating, last
 acting [29,15,32]
 pool libvirt-pool pg_num 256  pgp_num 128
 [root@nodea ~]# ceph-osd --version
 ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)

 More output http://pastebin.com/ajgpU7Zx

 Thanks


 You should find out which OSD the PG maps to, and see if ceph pg query
 or the osd admin socket will expose anything useful about its state.
 -Greg


 --
 Software Engineer #42 @ http://inktank.com | http://ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG stuck creating

2014-09-30 Thread Gregory Farnum

On Tuesday, September 30, 2014, Robert LeBlanc rob...@leblancnet.us wrote:

 On our dev cluster, I've got a PG that won't create. We had a host fail
 with 10 OSDs that needed to be rebuilt. A number of other OSDs were down
 for a few days (did I mention this was a dev cluster?). The other OSDs
 eventually came up once the OSD maps caught up on them. I rebuilt the OSDs
 on all the hosts because we were running into XFS lockups with bcache.
 There were a number of PGs that could not be found when all the hosts were
 rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the
 OSDs they were on as well as the PGs. I performed a repair on the OSDs as
 well without any luck. One of pools had a recommendation to increase the
 PGs, so I increased it thinking it might be able to help.

 Nothing was helping and I could not find any reference to them so I force
 created them. That cleared up all but one that is creating due to the new
 PG number. Now, there is nothing I can do to unstick this one PG, I can't
 force create it, I can't increase the pgp_num, nada. At one point when
 recreating the OSDs, some of the number got out of order and to calm my
 OCD, I fixed it requiring me to manually modify the CRUSH map as the OSD
 appeared in both hosts, this was before I increased the PGs.

 There is nothing critical on this cluster, but I'm using this as an
 opportunity to understand Ceph in case we run into something similar in our
 future production environment.

 HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool
 pg_num 256  pgp_num 128
 pg 4.bf is stuck inactive since forever, current state creating, last
 acting [29,15,32]
 pg 4.bf is stuck unclean since forever, current state creating, last
 acting [29,15,32]
 pool libvirt-pool pg_num 256  pgp_num 128
 [root@nodea ~]# ceph-osd --version
 ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)

 More output http://pastebin.com/ajgpU7Zx

 Thanks


You should find out which OSD the PG maps to, and see if ceph pg query or
the osd admin socket will expose anything useful about its state.
-Greg


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

2014-10-01 Thread Gregory Farnum

On Wed, Oct 1, 2014 at 5:24 AM, Andrei Mikhailovsky and...@arhont.com wrote:
 Timur,

 As far as I know, the latest master has a number of improvements for ssd
 disks. If you check the mailing list discussion from a couple of weeks back,
 you can see that the latest stable firefly is not that well optimised for
 ssd drives and IO is limited. However changes are being made to address
 that.

 I am well surprised that you can get 10K IOps as in my tests I was not
 getting over 3K IOPs on the ssd disks which are capable of doing 90K IOps.

 P.S. does anyone know if the ssd optimisation code will be added to the next
 maintenance release of firefly?

Not a chance. The changes enabling that improved throughput are very
invasive and sprinkled all over the OSD; they aren't the sort of thing
that one does backport or that one could put on top of a stable
release for any meaningful definition of stable. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

2014-10-01 Thread Gregory Farnum

All the stuff I'm aware of is part of the testing we're doing for
Giant. There is probably ongoing work in the pipeline, but the fast
dispatch, sharded work queues, and sharded internal locking structures
that Somnath has discussed all made it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Oct 1, 2014 at 7:07 AM, Andrei Mikhailovsky and...@arhont.com wrote:

 Greg, are they going to be a part of the next stable release?

 Cheers
 

 From: Gregory Farnum g...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: Timur Nurlygayanov tnurlygaya...@mirantis.com, ceph-users
 ceph-us...@ceph.com
 Sent: Wednesday, 1 October, 2014 3:04:51 PM
 Subject: Re: [ceph-users] Why performance of benchmarks with small blocks is
 extremely small?

 On Wed, Oct 1, 2014 at 5:24 AM, Andrei Mikhailovsky and...@arhont.com
 wrote:
 Timur,

 As far as I know, the latest master has a number of improvements for ssd
 disks. If you check the mailing list discussion from a couple of weeks
 back,
 you can see that the latest stable firefly is not that well optimised for
 ssd drives and IO is limited. However changes are being made to address
 that.

 I am well surprised that you can get 10K IOps as in my tests I was not
 getting over 3K IOPs on the ssd disks which are capable of doing 90K IOps.

 P.S. does anyone know if the ssd optimisation code will be added to the
 next
 maintenance release of firefly?

 Not a chance. The changes enabling that improved throughput are very
 invasive and sprinkled all over the OSD; they aren't the sort of thing
 that one does backport or that one could put on top of a stable
 release for any meaningful definition of stable. :)
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why performance of benchmarks with small blocks is extremely small?

2014-10-01 Thread Gregory Farnum

On Wed, Oct 1, 2014 at 9:21 AM, Mark Nelson mark.nel...@inktank.com wrote:
 On 10/01/2014 11:18 AM, Gregory Farnum wrote:

 All the stuff I'm aware of is part of the testing we're doing for
 Giant. There is probably ongoing work in the pipeline, but the fast
 dispatch, sharded work queues, and sharded internal locking structures
 that Somnath has discussed all made it.


 I seem to recall there was a deadlock issue or something with fast dispatch.
 Were we able to get that solved for Giant?

Fast dispatch is not enabled in librados, but I don't think most users
should be able to tell on that end. If they can, it'll be switched on
at some point in the Hammer development process. If it's small enough
we may backport eventually (we know how to go about it, but the change
will require more testing than we were comfortable with assigning at
this stage in an LTS).
-Greg


 Mark

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Wed, Oct 1, 2014 at 7:07 AM, Andrei Mikhailovsky and...@arhont.com
 wrote:


 Greg, are they going to be a part of the next stable release?

 Cheers
 

 From: Gregory Farnum g...@inktank.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: Timur Nurlygayanov tnurlygaya...@mirantis.com, ceph-users
 ceph-us...@ceph.com
 Sent: Wednesday, 1 October, 2014 3:04:51 PM
 Subject: Re: [ceph-users] Why performance of benchmarks with small blocks
 is
 extremely small?

 On Wed, Oct 1, 2014 at 5:24 AM, Andrei Mikhailovsky and...@arhont.com
 wrote:

 Timur,

 As far as I know, the latest master has a number of improvements for ssd
 disks. If you check the mailing list discussion from a couple of weeks
 back,
 you can see that the latest stable firefly is not that well optimised
 for
 ssd drives and IO is limited. However changes are being made to address
 that.

 I am well surprised that you can get 10K IOps as in my tests I was not
 getting over 3K IOPs on the ssd disks which are capable of doing 90K
 IOps.

 P.S. does anyone know if the ssd optimisation code will be added to the
 next
 maintenance release of firefly?


 Not a chance. The changes enabling that improved throughput are very
 invasive and sprinkled all over the OSD; they aren't the sort of thing
 that one does backport or that one could put on top of a stable
 release for any meaningful definition of stable. :)
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds isn't working anymore after osd's running full

2014-10-07 Thread Gregory Farnum

Sorry; I guess this fell off my radar.

The issue here is not that it's waiting for an osdmap; it got the
requested map and went into replay mode almost immediately. In fact
the log looks good except that it seems to finish replaying the log
and then simply fail to transition into active. Generate a new one,
adding in debug journaled = 20 and debug filer = 20, and we can
probably figure out how to fix it.
(This diagnosis is much easier in the upcoming Giant!)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Oct 7, 2014 at 7:55 AM, Jasper Siero
jasper.si...@target-holding.nl wrote:
 Hello Gregory,

 We still have the same problems with our test ceph cluster and didn't receive 
 a reply from you after I send you the requested log files. Do you know if 
 it's possible to get our cephfs filesystem working again or is it better to 
 give up the files on cephfs and start over again?

 We restarted the cluster serveral times but it's still degraded:
 [root@th1-mon001 ~]# ceph -w
 cluster c78209f5-55ea-4c70-8968-2231d2b05560
  health HEALTH_WARN mds cluster is degraded
  monmap e3: 3 mons at 
 {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0},
  election epoch 432, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
  mdsmap e190: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
  osdmap e2248: 12 osds: 12 up, 12 in
   pgmap v197548: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
 124 GB used, 175 GB / 299 GB avail
  491 active+clean
1 active+clean+scrubbing+deep

 One placement group stays in the deep scrubbing fase.

 Kind regards,

 Jasper Siero


 
 Van: Jasper Siero
 Verzonden: donderdag 21 augustus 2014 16:43
 Aan: Gregory Farnum
 Onderwerp: RE: [ceph-users] mds isn't working anymore after osd's running full

 I did restart it but you are right about the epoch number which has changed 
 but the situation looks the same.
 2014-08-21 16:33:06.032366 7f9b5f3cd700  1 mds.0.27  need osdmap epoch 1994, 
 have 1993
 2014-08-21 16:33:06.032368 7f9b5f3cd700  1 mds.0.27  waiting for osdmap 1994 
 (which blacklists
 prior instance)
 I started the mds with the debug options and attached the log.

 Thanks,

 Jasper
 
 Van: Gregory Farnum [g...@inktank.com]
 Verzonden: woensdag 20 augustus 2014 18:38
 Aan: Jasper Siero
 CC: ceph-users@lists.ceph.com
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

 After restarting your MDS, it still says it has epoch 1832 and needs
 epoch 1833? I think you didn't really restart it.
 If the epoch numbers have changed, can you restart it with debug mds
 = 20, debug objecter = 20, debug ms = 1 in the ceph.conf and post
 the resulting log file somewhere?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Unfortunately that doesn't help. I restarted both the active and standby mds 
 but that doesn't change the state of the mds. Is there a way to force the 
 mds to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap 
 epoch 1833, have 1832)?

 Thanks,

 Jasper
 
 Van: Gregory Farnum [g...@inktank.com]
 Verzonden: dinsdag 19 augustus 2014 19:49
 Aan: Jasper Siero
 CC: ceph-users@lists.ceph.com
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running 
 full

 On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hi all,

 We have a small ceph cluster running version 0.80.1 with cephfs on five
 nodes.
 Last week some osd's were full and shut itself down. To help de osd's start
 again I added some extra osd's and moved some placement group directories on
 the full osd's (which has a copy on another osd) to another place on the
 node (as mentioned in
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/)
 After clearing some space on the full osd's I started them again. After a
 lot of deep scrubbing and two pg inconsistencies which needed to be repaired
 everything looked fine except the mds which still is in the replay state and
 it stays that way.
 The log below says that mds need osdmap epoch 1833 and have 1832.

 2014-08-18 12:29:22.268248 7fa786182700  1 mds.-1.0 handle_mds_map standby
 2014-08-18 12:29:22.273995 7fa786182700  1 mds.0.25 handle_mds_map i am now
 mds.0.25
 2014-08-18 12:29:22.273998 7fa786182700  1 mds.0.25 handle_mds_map state
 change up:standby -- up:replay
 2014-08-18 12:29:22.274000 7fa786182700  1 mds.0.25 replay_start
 2014-08-18 12:29:22.274014 7fa786182700  1 mds.0.25  recovery set is
 2014-08-18 12:29:22.274016 7fa786182700  1 mds.0.25  need osdmap epoch 1833,
 have 1832
 2014-08-18 12:29:22.274017 7fa786182700  1 mds.0.25  waiting for osdmap 1833
 (which blacklists prior instance)

  # ceph status

Re: [ceph-users] Regarding Primary affinity configuration

2014-10-09 Thread Gregory Farnum

On Thu, Oct 9, 2014 at 10:55 AM, Johnu George (johnugeo)
johnu...@cisco.com wrote:
 Hi All,
   I have few questions regarding the Primary affinity.  In the
 original blueprint
 (https://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role_affinity
 ), one example has been given.

 For PG x, CRUSH returns [a, b, c]
 If a has primary_affinity of .5, b and c have 1 , with 50% probability, we
 will choose b or c instead of a. (25% for b, 25% for c)

 A) I was browsing through the code, but I could not find this logic of
 splitting the rest of configured primary affinity value between other osds.
 How is this handled?

 if (a  CEPH_OSD_MAX_PRIMARY_AFFINITY 
 (crush_hash32_2(CRUSH_HASH_RJENKINS1,
 seed, o)  16) = a) {
   // we chose not to use this primary.  note it anyway as a
   // fallback in case we don't pick anyone else, but keep looking.
   if (pos  0)
 pos = i;
 } else {
   pos = i;
   break;
 }
   }

It's a fallback mechanism — if the chosen primary for a PG has primary
affinity less than the default (max), we (probabilistically) look for
a different OSD to be the primary. We decide whether to offload by
running a hash and discarding the OSD if the output value is greater
than the OSDs affinity, and then we go through the list and run that
calculation in order (obviously if the affinity is 1, then it passes
without needing to run the hash).
If no OSD in the list has a high enough hash value, we take the
originally-chosen primary.

 B) Since, primary affinity value is configured independently, there can be a
 situation with [0.1,0.1,0.1]  with total value that don’t add to 1.  How is
 this taken care of?

These primary affinity values are just compared against the hash
output I mentioned, so the sum doesn't matter. In general we simply
expect that OSDs which don't have the max weight value will be chosen
as primary in proportion to their share of the total weight of their
PG membership (ie, if they have a weight of .5 and everybody else has
weight 1, they will be primary in half the normal number of PGs. If
everybody has a weight of .5, they will be primary in the normal
proportions. Etc).


 C) Slightly confused. What happens for a situation with [1,0.5,1] ? Is osd.0
 always returned?

If the first OSD in the PG list has primary affinity of 1 then it is
always the primary for that OSD, yes. That's not osd.0, though; just
the first OSD in the PG list. ;)

 D) After calculating primary based on the affinity values, I see a shift of
 osds so that primary comes to the front. Why is this needed?. I thought,
 primary affinity value affects only reads and hence, osd ordering need not
 be changed.

Primary affinity impacts which OSD is chosen to be primary; the
primary is the ordering point for *all* access to the PG. That
includes writes as well as reads, plus coordination of the cluster on
map changes. We move the primary to the front of the list...well, I
think it's just because we were lazy and there are a bunch of places
that assume the first OSD in a replicated pool is the primary.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blueprints

2014-10-09 Thread Gregory Farnum

On Thu, Oct 9, 2014 at 4:01 PM, Robert LeBlanc rob...@leblancnet.us wrote:
 I have a question regarding submitting blueprints. Should only people who
 intend to do the work of adding/changing features of Ceph submit blueprints?
 I'm not primarily a programmer (but can do programming if needed), but have
 a feature request for Ceph.

Blueprints are documents *for* developers. If you as a user have
enough information about the feature you want, and the things it needs
to do in Ceph, to generate a reasonable description of the feature,
its user interface, and a skeleton of how it could be implemented,
we'd love a blueprint. Blueprints which are backed by developers are
more likely to get time at CDS, I think (Patrick/Sage could confirm),
but even just having them is helpful.

If that sounds intimidating, we take less detailed feature requests in
our Redmine at tracker.ceph.com too. ;)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds isn't working anymore after osd's running full

2014-10-10 Thread Gregory Farnum

Ugh, debug journaler, not debug journaled.

That said, the filer output tells me that you're missing an object out
of the MDS log. (200.08f5) I think this issue should be resolved
if you dump the journal to a file, reset it, and then undump it.
(These are commands you can invoke from ceph-mds.)
I haven't done this myself in a long time, so there may be some hard
edges around it. In particular, I'm not sure if the dumped journal
file will stop when the data stops, or if it will be a little too
long. If so, we can fix that by truncating the dumped file to the
proper length and resetting and undumping again.
(And just to harp on it, this journal manipulation is a lot simpler in
Giant... ;) )
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Wed, Oct 8, 2014 at 7:11 AM, Jasper Siero
jasper.si...@target-holding.nl wrote:
 Hello Greg,

 No problem thanks for looking into the log. I attached the log to this email.
 I'm looking forward for the new release because it would be nice to have more 
 possibilities to diagnose problems.

 Kind regards,

 Jasper Siero
 
 Van: Gregory Farnum [g...@inktank.com]
 Verzonden: dinsdag 7 oktober 2014 19:45
 Aan: Jasper Siero
 CC: ceph-users
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

 Sorry; I guess this fell off my radar.

 The issue here is not that it's waiting for an osdmap; it got the
 requested map and went into replay mode almost immediately. In fact
 the log looks good except that it seems to finish replaying the log
 and then simply fail to transition into active. Generate a new one,
 adding in debug journaled = 20 and debug filer = 20, and we can
 probably figure out how to fix it.
 (This diagnosis is much easier in the upcoming Giant!)
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Tue, Oct 7, 2014 at 7:55 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Gregory,

 We still have the same problems with our test ceph cluster and didn't 
 receive a reply from you after I send you the requested log files. Do you 
 know if it's possible to get our cephfs filesystem working again or is it 
 better to give up the files on cephfs and start over again?

 We restarted the cluster serveral times but it's still degraded:
 [root@th1-mon001 ~]# ceph -w
 cluster c78209f5-55ea-4c70-8968-2231d2b05560
  health HEALTH_WARN mds cluster is degraded
  monmap e3: 3 mons at 
 {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0},
  election epoch 432, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
  mdsmap e190: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
  osdmap e2248: 12 osds: 12 up, 12 in
   pgmap v197548: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
 124 GB used, 175 GB / 299 GB avail
  491 active+clean
1 active+clean+scrubbing+deep

 One placement group stays in the deep scrubbing fase.

 Kind regards,

 Jasper Siero


 
 Van: Jasper Siero
 Verzonden: donderdag 21 augustus 2014 16:43
 Aan: Gregory Farnum
 Onderwerp: RE: [ceph-users] mds isn't working anymore after osd's running 
 full

 I did restart it but you are right about the epoch number which has changed 
 but the situation looks the same.
 2014-08-21 16:33:06.032366 7f9b5f3cd700  1 mds.0.27  need osdmap epoch 1994, 
 have 1993
 2014-08-21 16:33:06.032368 7f9b5f3cd700  1 mds.0.27  waiting for osdmap 1994 
 (which blacklists
 prior instance)
 I started the mds with the debug options and attached the log.

 Thanks,

 Jasper
 
 Van: Gregory Farnum [g...@inktank.com]
 Verzonden: woensdag 20 augustus 2014 18:38
 Aan: Jasper Siero
 CC: ceph-users@lists.ceph.com
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running 
 full

 After restarting your MDS, it still says it has epoch 1832 and needs
 epoch 1833? I think you didn't really restart it.
 If the epoch numbers have changed, can you restart it with debug mds
 = 20, debug objecter = 20, debug ms = 1 in the ceph.conf and post
 the resulting log file somewhere?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Unfortunately that doesn't help. I restarted both the active and standby 
 mds but that doesn't change the state of the mds. Is there a way to force 
 the mds to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap 
 epoch 1833, have 1832)?

 Thanks,

 Jasper
 
 Van: Gregory Farnum [g...@inktank.com]
 Verzonden: dinsdag 19 augustus 2014 19:49
 Aan: Jasper Siero
 CC: ceph-users@lists.ceph.com
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running 
 full

 On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hi all,

 We have

Re: [ceph-users] ceph tell osd.6 version : hang

2014-10-12 Thread Gregory Farnum

On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary l...@dachary.org wrote:
 Hi,

 On a 0.80.6 cluster the command

 ceph tell osd.6 version

 hangs forever. I checked that it establishes a TCP connection to the OSD, 
 raised the OSD debug level to 20 and I do not see

 https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991

 in the logs. All other OSDs answer to the same version command as they 
 should. And ceph daemon osd.6 version on the machine running OSD 6 responds 
 as it should. There also are an ever growing number of slow requests on this 
 OSD. But not error in the logs. In other words, except for taking forever to 
 answer any kind of request the OSD looks fine.

 Another OSD running on the same machine is behaving well.

 Any idea what that behaviour relates to ?

What commands have you run? The admin socket commands don't require
nearly as many locks, nor do they go through the same event loops that
messages do. You might have found a deadlock or something. (In which
case just restarting the OSD would probably fix it, but you should
grab a core dump first.)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph tell osd.6 version : hang

2014-10-12 Thread Gregory Farnum

On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary l...@dachary.org wrote:


 On 12/10/2014 17:48, Gregory Farnum wrote:
 On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary l...@dachary.org wrote:
 Hi,

 On a 0.80.6 cluster the command

 ceph tell osd.6 version

 hangs forever. I checked that it establishes a TCP connection to the OSD, 
 raised the OSD debug level to 20 and I do not see

 https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991

 in the logs. All other OSDs answer to the same version command as they 
 should. And ceph daemon osd.6 version on the machine running OSD 6 responds 
 as it should. There also are an ever growing number of slow requests on 
 this OSD. But not error in the logs. In other words, except for taking 
 forever to answer any kind of request the OSD looks fine.

 Another OSD running on the same machine is behaving well.

 Any idea what that behaviour relates to ?

 What commands have you run? The admin socket commands don't require
 nearly as many locks, nor do they go through the same event loops that
 messages do. You might have found a deadlock or something. (In which
 case just restarting the OSD would probably fix it, but you should
 grab a core dump first.)

 # /etc/init.d/ceph stop osd.6
 === osd.6 ===
 Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done
 root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6
 === osd.6 ===
 Starting Ceph osd.6 on g3...
 starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 
 /var/lib/ceph/osd/ceph-6/journal
 root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version
 { version: ceph version 0.80.6 (f93610a4421cb670b08e974c6550ee715ac528ae)}
 root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version

 and now it blocks. It looks like a deadlock happens shortly after it boots.

Is this the same cluster you're reporting on in the tracker?

Anyway, apparently it's a disk state issue. I have no idea what kind
of bug in Ceph could cause this, so my guess is that a syscall is
going out to lunch — although that should get caught up in the
internal heartbeat checkin code. Like I said, grab a core dump and
look for deadlocks or blocked sys calls in the filestore.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph tell osd.6 version : hang

2014-10-12 Thread Gregory Farnum

On Sun, Oct 12, 2014 at 9:29 AM, Loic Dachary l...@dachary.org wrote:


 On 12/10/2014 18:22, Gregory Farnum wrote:
 On Sun, Oct 12, 2014 at 9:10 AM, Loic Dachary l...@dachary.org wrote:


 On 12/10/2014 17:48, Gregory Farnum wrote:
 On Sun, Oct 12, 2014 at 7:46 AM, Loic Dachary l...@dachary.org wrote:
 Hi,

 On a 0.80.6 cluster the command

 ceph tell osd.6 version

 hangs forever. I checked that it establishes a TCP connection to the OSD, 
 raised the OSD debug level to 20 and I do not see

 https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L4991

 in the logs. All other OSDs answer to the same version command as they 
 should. And ceph daemon osd.6 version on the machine running OSD 6 
 responds as it should. There also are an ever growing number of slow 
 requests on this OSD. But not error in the logs. In other words, except 
 for taking forever to answer any kind of request the OSD looks fine.

 Another OSD running on the same machine is behaving well.

 Any idea what that behaviour relates to ?

 What commands have you run? The admin socket commands don't require
 nearly as many locks, nor do they go through the same event loops that
 messages do. You might have found a deadlock or something. (In which
 case just restarting the OSD would probably fix it, but you should
 grab a core dump first.)

 # /etc/init.d/ceph stop osd.6
 === osd.6 ===
 Stopping Ceph osd.6 on g3...kill 23690...kill 23690...done
 root@g3:/var/lib/ceph/osd/ceph-6/current# /etc/init.d/ceph start osd.6
 === osd.6 ===
 Starting Ceph osd.6 on g3...
 starting osd.6 at :/0 osd_data /var/lib/ceph/osd/ceph-6 
 /var/lib/ceph/osd/ceph-6/journal
 root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version
 { version: ceph version 0.80.6 
 (f93610a4421cb670b08e974c6550ee715ac528ae)}
 root@g3:/var/lib/ceph/osd/ceph-6/current# ceph tell osd.6 version

 and now it blocks. It looks like a deadlock happens shortly after it boots.

 Is this the same cluster you're reporting on in the tracker?

 Yes, it is the same cluster as http://tracker.ceph.com/issues/9750 although I 
 can't imagine how the two could be related, they probably are.

 Anyway, apparently it's a disk state issue. I have no idea what kind
 of bug in Ceph could cause this, so my guess is that a syscall is
 going out to lunch — although that should get caught up in the
 internal heartbeat checkin code. Like I said, grab a core dump and
 look for deadlocks or blocked sys calls in the filestore.

 I created http://tracker.ceph.com/issues/9751 and attached the log with 
 debug_filestore = 20. There are many slow requests but I can't relate them to 
 any kind of error.

 It does not core dump, should I kill it to get a coredump and then examine it 
 ? I've never tried that ;-)

That's what I was thinking; you send it a SIGQUIT signal and it'll
dump. Or apparently you can use gcore instead, which won't quit it.
The log doesn't have anything glaringly obvious; was it already hung
when you packaged that? If so, it must be some kind of deadlock and
the backtraces from the core dump will probably tell us what happened.

 One way or the other the problem will be fixed soon (tonight). I'd like to 
 take advantage of the broken state we have to figure it out. Resurecting the 
 OSD that may unblock http://tracker.ceph.com/issues/9751 and may also unblock 
 http://tracker.ceph.com/issues/9750 and we'll lose a chance to diagnose this 
 rare condition.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Handling of network failures in the cluster network

2014-10-13 Thread Gregory Farnum

On Mon, Oct 13, 2014 at 11:32 AM, Martin Mailand mar...@tuxadero.com wrote:
 Hi List,

 I have a ceph cluster setup with two networks, one for public traffic
 and one for cluster traffic.
 Network failures in the public network are handled quite well, but
 network failures in the cluster network are handled very badly.

 I found several discussions on the ml about this topic and they stated
 that the problem should be fixed, but I still have problems.

 I use ceph v0.86 with a standard crushmap, 4 osds per host and 6 hosts
 in the root default therefore I have 24 osds overall.
 Each storage node has 2 10Gbit nics one for public and one for cluster
 traffic, if I take down ONE of the links in the cluster network the
 cluster stops working.

 I tested it several times and I could observe following different behaviors.

 1. Cluster stops forever.

 2. After a timeout of around 120 seconds all other osds gets marked
 down. The osds on the storage node with the link failure stays up. Then
 all other osds boot and come back and the osds on the node with the
 failure are marked down and the cluster starts to work again.

 3. After a timeout of around 120 seconds the osds on the node with the
 link failure gets marked down and the cluster starts to work again.

 Therefore a link failure in the cluster network has a very severe impact
 on the cluster availability.

 Is this a configuration mistake on my side?

 Any help would be greatly appreciated.

How did you test taking down the connection?
What config options have you specified on the OSDs and in the monitor?

None of the scenarios you're describing make much sense on a
semi-recent (post-dumpling-release) version of Ceph.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Misconfigured caps on client.admin key, anyway to recover from EAESS denied?

2014-10-13 Thread Gregory Farnum

On Mon, Oct 13, 2014 at 4:04 PM, Wido den Hollander w...@42on.com wrote:
 On 14-10-14 00:53, Anthony Alba wrote:
 Following the manual starter guide, I set up a Ceph cluster with HEALTH_OK,
 (1 mon, 2 osd). In testing out auth commands I misconfigured the
 client.admin key by accidentally deleting  mon 'allow *'.

 Now I'm getting EACESS denied for all ceph actions.

 Is there a way to recover or recreate a new client.admin key.


 You can disable cephx completely, fix the key and enable cephx again.

 auth_cluster_required, auth_service_required and auth_client_required

 Set it to 'none' and restart the monitors and OSDs. You can also inject
 it through the admin socket if you want to.

Mmm, I don't think that will work — Ceph still refers to the stored
client capabilities; it just doesn't validate them.

I *believe* if you grab the monitor key you can use that to make the
necessary changes, though. Otherwise hacking at the monitor stores is
an option.
-Greg


 Key was:

 client.admin
 key: ABCDEFG...
 caps: [mon] allow *
 caps: [osd] allow *

 Misconfigured
 key: ABCDEFG...
 caps: [osd] allow *

 ...now all ceph commands fail, so I'm not sure how to start fixing the
 key on the mons/osds.

 - anthony
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Wido den Hollander
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD very slow startup

2014-10-14 Thread Gregory Farnum

On Monday, October 13, 2014, Lionel Bouton lionel+c...@bouton.name wrote:

 Hi,

 # First a short description of our Ceph setup

 You can skip to the next section (Main questions) to save time and
 come back to this one if you need more context.

 We are currently moving away from DRBD-based storage backed by RAID
 arrays to Ceph for some of our VMs. Our focus is on resiliency and
 capacity (one VM was outgrowing the largest RAID10 we had) and not
 maximum performance (at least not yet). Our Ceph OSDs are fairly
 unbalanced because 2 are on 2 historic hosts each with 4 disks in a
 hardware RAID10 configuration and no place available for new disks in
 the chassis. 12 additional OSD are on 2 new systems with 6 disk drives
 dedicated to one OSD each (CPU and RAM configurations are nearly
 identical on the 4 hosts). All hosts are used for running VMs too, we
 took some precautions to avoid too much interference: each host has CPU
 and RAM to spare for the OSD. CPU usage exhibits some bursts on
 occasions but as we only have one or two VM on each host, they can't
 starve the OSD which have between 2 and 8 full fledge cores (4 to 16
 hardware threads) for them depending on the current load. We have at
 least 4GB of free RAM per OSD on each host at all times (including room
 for at least a 4GB OS cache).
 To sum up we have a total of 14 OSDs, the 2 largest ones on RAID10 are
 clearly our current bottleneck. That said until we have additional
 hardware they allow us to maintain availability even if 2 servers are
 down (default crushmap with pool configured with 3 replicas on 3
 different hosts) and performance is acceptable (backfilling/scrubing/...
 pgs required some tuning though and I'm eagerly waiting for 0.80.7 to
 begin tests of the new io priority tunables).
 Everything is based on SATA/SAS 7200t/min disk drives behind P410 Raid
 controllers (HP Proliant systems) with battery backed memory to help
 with write bursts.

 The OSDs are a mix of:
 - Btrfs on 3.17.0 kernels on individual disks, 450GB use on 2TB (3.17.0
 fixes a filesystem lockup we had with earlier kernels manifesting itself
 with concurrent accesses to several Btrfs filesystems according to
 recent lkml posts),
 - Btrfs on 3.12.21 kernels on the 2 systems with RAID10, 1.5TB used on
 3TB (no lockup on these yet but they will migrate to 3.17.0 when we'll
 have enough experience with it).
 - XFS for a minority of individual disks (with a dedicated partition for
 the journal).
 Most of them have the same history (all being created at the same time),
 only two of them have been created later (following Btrfs corruption
 and/or conversion to XFS) and are avoided when comparing behaviours.

 All Btrfs volumes use these mount options:
 rw,noatime,nodiratime,compress=lzo,space_cache,autodefrag,recovery
 All OSDs use a 5GB journal.

 We slowly add monitoring to the setup to see what are the benefits of
 Btrfs in our case (ceph osd perf, kernel io wait per devices, osd CPU
 usage, ...). One long term objective is to slowly raise the performance
 both by migrating to/adding more suitable hardware and tuning the
 software side. Detailed monitoring is supposed to help us study the
 behaviour of isolated OSDs with different settings and being warned
 early if they generate performance problems to take them out with next
 to no impact on the whole storage network (we are strong believers in
 slow, incremental and continuous change and distributed storage with
 redundancy makes it easy to implement).

 # Main questions

 The system works well but I just realised when restarting one of the 2
 large Btrfs OSD that it was very slow to rejoin the network (ceph osd
 set noout was used for the restart). I stopped the OSD init after 5
 minutes to investigate what was going on and didn't find any obvious
 problem (filesystem sane, no swapping, CPU hogs, concurrent IO not able
 to starve the system by itself, ...). Next restarts took between 43s
 (nearly no concurrent disk access and warm caches after an earlier
 restart without umounting the filesystem) and 3mn57s (one VM still on
 DRBD doing ~30 IO/s on the same volume and cold caches after a
 filesystem mount).

 It seems that the startup time is getting longer on the 2 large Btrfs
 filesystems (the other one gives similar results: 3mn48s on the first
 try for example). I noticed that it was a bit slow a week ago but not as
 much (there was ~half as much data on them at the time). OSDs on
 individual disks don't exhibit this problem (with warm caches init
 finishes in ~4s on the small Btrfs volumes, ~3s on the XFS volumes) but
 they are on dedicated disks with less data.

 With warm caches most of the time is spent between:
 osd.n osdmap load_pgs
 osd.n osdmap load_pgs opened m pgs
 log lines in /var/log/ceph/ceph-osd.n.log (m is ~650 for both OSD). So
 it seems most of the time is spent opening pgs.

 What could explain such long startup times? Is the OSD init doing a lot
 of random disk accesses? Is it dependant on

Re: [ceph-users] Misconfigured caps on client.admin key, anyway to recover from EAESS denied?

2014-10-14 Thread Gregory Farnum

On Monday, October 13, 2014, Anthony Alba ascanio.al...@gmail.com wrote:

  You can disable cephx completely, fix the key and enable cephx again.
 
  auth_cluster_required, auth_service_required and auth_client_required

 That did not work: i.e disabling cephx in the cluster conf and
 restarting the cluster.
 The cluster still complained about failed authentication.

 I *believe* if you grab the monitor key you can use that to make the
 necessary changes, though. Otherwise hacking at the monitor stores is
 an option.

 You mean use the mon. key but as the client.admin user?


It's been a while since I've done this, but once upon a time you could use
the mon key and the ID mon. and then send mon commands from the ceph cli.
I'd try that.
-Greg




 - anthony



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Handling of network failures in the cluster network

2014-10-14 Thread Gregory Farnum

On Mon, Oct 13, 2014 at 1:37 PM, Martin Mailand mar...@tuxadero.com wrote:
 Hi Greg,

 I took down the interface with ifconfig p7p1 down.
 I attached the config of the first monitor and the first osd.
 I created the cluster with ceph-deploy.
 The version is ceph version 0.86 (97dcc0539dfa7dac3de74852305d51580b7b1f82).

 On 13.10.2014 21:45, Gregory Farnum wrote:
 How did you test taking down the connection?
 What config options have you specified on the OSDs and in the monitor?

 None of the scenarios you're describing make much sense on a
 semi-recent (post-dumpling-release) version of Ceph.

 Best Regards,
  martin


Hmm, do you have any logs?
120 seconds is just way longer than the failure detection should
normally take, unless you've been playing with it enough to stretch
out the extra time the monitor waits to be certain.

But I did realize that in your configuration you probably want to set
one or both of mon_osd_min_down_reporters and mon_osd_min_down_reports
to a number greater than the number of OSDs you have on a single host.
(They default to 1 and 3, respectively.) That's probably how the
disconnected node managed to fail all of the other nodes — it's
failure reports to the monitor arrived first.

You can also run tests with mon_osd_adjust_heartbeat_grace option set
to false, to get more predictable results.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds isn't working anymore after osd's running full

2014-10-14 Thread Gregory Farnum

ceph-mds --undump-journal rank journal-file
Looks like it accidentally (or on purpose? you can break things with
it) got left out of the help text.

On Tue, Oct 14, 2014 at 8:19 AM, Jasper Siero
jasper.si...@target-holding.nl wrote:
 Hello Greg,

 I dumped the journal successful to a file:

 journal is 9483323613~134215459
 read 134213311 bytes at offset 9483323613
 wrote 134213311 bytes at offset 9483323613 to journaldumptgho
 NOTE: this is a _sparse_ file; you can
 $ tar cSzf journaldumptgho.tgz journaldumptgho
   to efficiently compress it while preserving sparseness.

 I see the option for resetting the mds journal but I can't find the option 
 for undumping /importing the journal:

  usage: ceph-mds -i name [flags] [[--journal_check 
 rank]|[--hot-standby][rank]]
   -m monitorip:port
 connect to monitor at given address
   --debug_mds n
 debug MDS level (e.g. 10)
   --dump-journal rank filename
 dump the MDS journal (binary) for rank.
   --dump-journal-entries rank filename
 dump the MDS journal (JSON) for rank.
   --journal-check rank
 replay the journal for rank, then exit
   --hot-standby rank
 start up as a hot standby for rank
   --reset-journal rank
 discard the MDS journal for rank, and replace it with a single
 event that updates/resets inotable and sessionmap on replay.

 Do you know how to undump the journal back into ceph?

 Jasper

 
 Van: Gregory Farnum [g...@inktank.com]
 Verzonden: vrijdag 10 oktober 2014 23:45
 Aan: Jasper Siero
 CC: ceph-users
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

 Ugh, debug journaler, not debug journaled.

 That said, the filer output tells me that you're missing an object out
 of the MDS log. (200.08f5) I think this issue should be resolved
 if you dump the journal to a file, reset it, and then undump it.
 (These are commands you can invoke from ceph-mds.)
 I haven't done this myself in a long time, so there may be some hard
 edges around it. In particular, I'm not sure if the dumped journal
 file will stop when the data stops, or if it will be a little too
 long. If so, we can fix that by truncating the dumped file to the
 proper length and resetting and undumping again.
 (And just to harp on it, this journal manipulation is a lot simpler in
 Giant... ;) )
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 On Wed, Oct 8, 2014 at 7:11 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Greg,

 No problem thanks for looking into the log. I attached the log to this email.
 I'm looking forward for the new release because it would be nice to have 
 more possibilities to diagnose problems.

 Kind regards,

 Jasper Siero
 
 Van: Gregory Farnum [g...@inktank.com]
 Verzonden: dinsdag 7 oktober 2014 19:45
 Aan: Jasper Siero
 CC: ceph-users
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running 
 full

 Sorry; I guess this fell off my radar.

 The issue here is not that it's waiting for an osdmap; it got the
 requested map and went into replay mode almost immediately. In fact
 the log looks good except that it seems to finish replaying the log
 and then simply fail to transition into active. Generate a new one,
 adding in debug journaled = 20 and debug filer = 20, and we can
 probably figure out how to fix it.
 (This diagnosis is much easier in the upcoming Giant!)
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Tue, Oct 7, 2014 at 7:55 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Gregory,

 We still have the same problems with our test ceph cluster and didn't 
 receive a reply from you after I send you the requested log files. Do you 
 know if it's possible to get our cephfs filesystem working again or is it 
 better to give up the files on cephfs and start over again?

 We restarted the cluster serveral times but it's still degraded:
 [root@th1-mon001 ~]# ceph -w
 cluster c78209f5-55ea-4c70-8968-2231d2b05560
  health HEALTH_WARN mds cluster is degraded
  monmap e3: 3 mons at 
 {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0},
  election epoch 432, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
  mdsmap e190: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
  osdmap e2248: 12 osds: 12 up, 12 in
   pgmap v197548: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
 124 GB used, 175 GB / 299 GB avail
  491 active+clean
1 active+clean+scrubbing+deep

 One placement group stays in the deep scrubbing fase.

 Kind regards,

 Jasper Siero


 
 Van: Jasper Siero
 Verzonden: donderdag 21 augustus 2014 16:43
 Aan: Gregory Farnum
 Onderwerp: RE: [ceph-users] mds isn't working anymore after osd's running 
 full

 I did restart it but you are right about

Re: [ceph-users] Firefly maintenance release schedule

2014-10-15 Thread Gregory Farnum

On Wed, Oct 15, 2014 at 9:39 AM, Dmitry Borodaenko
dborodae...@mirantis.com wrote:
 On Tue, Sep 30, 2014 at 6:49 PM, Dmitry Borodaenko
 dborodae...@mirantis.com wrote:
 Last stable Firefly release (v0.80.5) was tagged on July 29 (over 2
 months ago). Since then, there were twice as many commits merged into
 the firefly branch than there existed on the branch before v0.80.5:

 $ git log --oneline --no-merges v0.80..v0.80.5|wc -l
 122
 $ git log --oneline --no-merges v0.80.5..firefly|wc -l
 227

 Is this a one time aberration in the process or should we expect the
 gap between maintenance updates for LTS releases of Ceph to keep
 growing?

 I didn't get a response to that nag other than the v0.80.6 release
 announcement on the day after, so I guess it wasn't completely ignored
 :)

 Except it turned out v0.80.6 was slightly less than useful as a
 maintenance release:
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-October/043701.html

 Two weeks later we have v0.80.7 with 3 more commits that hopefully
 make it actually usable. There are many ways to look at that from
 release management perspective.

 Good: 2 weeks is much better than 2 months.
 Bad: that's 2.5 months since last *stable* Firefly release.
 Ugly: that's 2 weeks for 3 commits, and now we have 54 more waiting
 for the next release...

 Wait what?! Oh right, 54 more commits were merged from firefly-next as
 soon as v0.80.7 was tagged:
 $ git log --oneline --no-merges v0.80.7..firefly|wc -l
 54

 Some of these are fixes for Urgent priority bugs, crashes, and data loss:
 http://tracker.ceph.com/issues/9492
 http://tracker.ceph.com/issues/9039
 http://tracker.ceph.com/issues/9582
 http://tracker.ceph.com/issues/9307
 etc.

 So what a Ceph deployer supposed to do with this? Wait another couple
 of weeks (hopefully) for v0.80.8? Take v0.80.7 and hope not to
 encounter any of these bugs? Or label Firefly as not production ready
 yet and go back to Dumpling? My personal preference obviously would
 be the first option, but waiting for 2.5 more months is not going to
 fit my schedule :(

Take .80.7. All of the bugs you've cited, you are supremely unlikely
to run into. The Urgent tag is a measure of planning priority, not
of impact to users; here it generally means we found a bug on a
stable branch that we can reproduce. Taking them in order:
http://tracker.ceph.com/issues/9492: only happens if you try and cheat
with your CRUSH rules, and obviously nobody did that until Sage
suggested it as a solution to the problem somebody had 29 days ago
when this was discovered.
http://tracker.ceph.com/issues/9039: The most serious here, but only
happens if you're using RGW, and storing user data in multiple pools,
and issue a COPY command to copy data between different pools.
http://tracker.ceph.com/issues/9582: Only happens if you're using the
op timeout feature of librados with the C bindings OR the op timeout
feature *and* the user-provided buffers in the C++ interface. (To the
best of my knowledge, the people who discovered this are the only ones
using op timeouts.)
http://tracker.ceph.com/issues/9307: I'm actually not sure what's
going on here; looks like some kind of extremely rare race when
authorizing requests? (ie, fixed by a retry)

We messed up the v0.80.6 release in a very specific way (and if you
were deploying a new cluster it wasn't a problem), but you're
extrapolating too much from the presence of patches about what their
impact is and what the system's stability is. These are largely
cleaning up rough edges around user interfaces, and smoothing out
issues in the new functionality that a standard deployment isn't going
to experience. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance doesn't scale well on a full ssd cluster.

2014-10-16 Thread Gregory Farnum

[Re-added the list.]

I assume you added more clients and checked that it didn't scale past
that? You might look through the list archives; there are a number of
discussions about how and how far you can scale SSD-backed cluster
performance.
Just scanning through the config options you set, you might want to
bump up all the filestore and journal queue values a lot farther.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Oct 16, 2014 at 9:51 AM, Mark Wu wud...@gmail.com wrote:
 Thanks for the reply. I am not using single client. Writing 5 rbd volumes on
 3 host can reach the peak. The client is fio and also running on osd nodes.
 But there're no bottlenecks on cpu or network. I also tried running client
 on two non osd servers, but the same result.

 2014 年 10 月 17 日 上午 12:29于 Gregory Farnum g...@inktank.com写道：

 If you're running a single client to drive these tests, that's your
 bottleneck. Try running multiple clients and aggregating their numbers.
 -Greg

 On Thursday, October 16, 2014, Mark Wu wud...@gmail.com wrote:

 Hi list,

 During my test, I found ceph doesn't scale as I expected on a 30 osds
 cluster.
 The following is the information of my setup:
 HW configuration:
15 Dell R720 servers, and each server has:
   Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and
 hyper-thread enabled.
   128GB memory
   two Intel 3500 SSD disks, connected with MegaRAID SAS 2208
 controller, each disk is configured as raid0 separately.
   bonding with two 10GbE nics, used for both the public network and
 cluster network.

 SW configuration:
OS CentOS 6.5, Kernel 3.17,  Ceph 0.86
XFS as file system for data.
each SSD disk has two partitions, one is osd data and the other is osd
 journal.
the pool has 2048 pgs. 2 replicas.
5 monitors running on 5 of the 15 servers.
Ceph configuration (in memory debugging options are disabled)

 [osd]
 osd data = /var/lib/ceph/osd/$cluster-$id
 osd journal = /var/lib/ceph/osd/$cluster-$id/journal
 osd mkfs type = xfs
 osd mkfs options xfs = -f -i size=2048
 osd mount options xfs = rw,noatime,logbsize=256k,delaylog
 osd journal size = 20480
 osd mon heartbeat interval = 30 # Performance tuning filestore
 osd_max_backfills = 10
 osd_recovery_max_active = 15
 merge threshold = 40
 filestore split multiple = 8
 filestore fd cache size = 1024
 osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd max
 backfills = 1
 osd recovery op priority = 1
 throttler perf counter = false
 osd enable op tracker = false
 filestore_queue_max_ops = 5000
 filestore_queue_committing_max_ops = 5000
 journal_max_write_entries = 1000
 journal_queue_max_ops = 5000
 objecter_inflight_ops = 8192


   When I test with 7 servers (14 osds),  the maximum iops of 4k random
 write I saw is 17k on single volume and 44k on the whole cluster.
 I expected the number of 30 osds cluster could approximate 90k. But
 unfornately,  I found that with 30 osds, it almost provides the performce
 as 14 osds, even worse sometime. I checked the iostat output on all the
 nodes, which have similar numbers. It's well distributed but disk
 utilization is low.
 In the test with 14 osds, I can see higher utilization of disk (80%~90%).
 So do you have any tunning suggestion to improve the performace with 30
 osds?
 Any feedback is appreciated.


 iostat output:
 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda   0.00 0.000.000.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 sdb   0.0088.500.00 5188.00 0.00 93397.00
 18.00 0.900.17   0.09  47.85
 sdc   0.00   443.500.00 5561.50 0.00 97324.00
 17.50 4.060.73   0.09  47.90
 dm-0  0.00 0.000.000.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 dm-1  0.00 0.000.000.00 0.00 0.00
 0.00 0.000.00   0.00   0.00

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda   0.0017.500.00   28.00 0.00  3948.00
 141.00 0.010.29   0.05   0.15
 sdb   0.0069.500.00 4932.00 0.00 87067.50
 17.65 2.270.46   0.09  43.45
 sdc   0.0069.000.00 4855.50 0.00 105771.50
 21.78 0.950.20   0.10  46.40
 dm-0  0.00 0.000.000.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 dm-1  0.00 0.000.00   42.50 0.00  3948.00
 92.89 0.010.19   0.04   0.15

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda   0.0012.000.008.00 0.00   568.00
 71.00 0.000.12   0.12   0.10
 sdb   0.0072.500.00 5046.50 0.00 113198.50
 22.43 1.090.22   0.10  51.40
 sdc   0.0072.50

Re: [ceph-users] why the erasure code pool not support random write?

2014-10-20 Thread Gregory Farnum

This is a common constraint in many erasure coding storage system. It
arises because random writes turn into a read-modify-write cycle (in order
to redo the parity calculations). So we simply disallow them in EC pools,
which works fine for the target use cases right now.
-Greg

On Monday, October 20, 2014, 池信泽 xmdx...@gmail.com wrote:

 hi, cephers:

   When I look into the ceph source code, I found the erasure code pool
 not support
 the random write, it only support the append write. Why? Is that random
 write of is erasure code high cost and the performance of the deep scrub is
 very poor?

  Thanks.



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD very slow startup

2014-10-20 Thread Gregory Farnum

On Mon, Oct 20, 2014 at 8:25 AM, Lionel Bouton lionel+c...@bouton.name wrote:
 Hi,

 More information on our Btrfs tests.

 Le 14/10/2014 19:53, Lionel Bouton a écrit :



 Current plan: wait at least a week to study 3.17.0 behavior and upgrade the
 3.12.21 nodes to 3.17.0 if all goes well.


 3.17.0 and 3.17.1 have a bug which remounts Btrfs filesystems read-only (no
 corruption but OSD goes down) on some access patterns with snapshots:
 https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36483.html

 The bug may be present in earlier kernels (at least the 3.16.4 code in
 fs/btrfs/qgroup.c doesn't handle the case differently than 3.17.0 and
 3.17.1) but seems at least less likely to show up (never saw it with 3.16.4
 in several weeks but it happened with 3.17.1 three times in just a few
 hours). As far as I can tell from its Changelog, 3.17.1 didn't patch any
 vfs/btrfs path vs 3.17.0 so I assume 3.17.0 has the same behaviour.

 I switched all servers to 3.16.4 which I had previously tested without any
 problem.

 The performance problem is still there with 3.16.4. In fact one of the 2
 large OSD was so slow it was repeatedly marked out and generated lots of
 latencies when in. I just had to remove it: when this OSD is shut down with
 noout to avoid backfills slowing down the storage network, latencies are
 back to normal. I chose to reformat this one with XFS.

 The other big node has a nearly perfectly identical system (same hardware,
 same software configuration, same logical volume configuration, same weight
 in the crush map, comparable disk usage in the OSD fs, ...) but is behaving
 itself (maybe slower than our smaller XFS and Btrfs OSD, but usable). The
 only notable difference is that it was formatted more recently. So the
 performance problem might be linked to the cumulative amount of data access
 to the OSD over time.

Yeah; we've seen this before and it appears to be related to our
aggressive use of btrfs snapshots; it seems that btrfs doesn't defrag
well under our use case. The btrfs developers make sporadic concerted
efforts to improve things (and succeed!), but it apparently still
hasn't gotten enough better yet. :(
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CRUSH depends on host + OSD?

2014-10-21 Thread Gregory Farnum

On Tuesday, October 21, 2014, Chad Seys cws...@physics.wisc.edu wrote:

 Hi Craig,

  It's part of the way the CRUSH hashing works.  Any change to the CRUSH
 map
  causes the algorithm to change slightly.

 Dan@cern could not replicate my observations, so I plan to follow his
 procedure (fake create an OSD, wait for rebalance, remove fake OSD) in the
 near future to see if I can replicate his! :)


  BTW, it's safer to remove OSDs and hosts by first marking the OSDs UP and
  OUT (ceph osd out OSDID).  That will trigger the remapping, while keeping
  the OSDs in the pool so you have all of your replicas.

 I am under the impression that the procedure I posted does leave the OSDs
 in
 the pool while an additional replication takes place: After ceph osd crush
 remove osd.osdnum I see that the used % on the removed OSD slowly
 decreases
 as the relocation of blocks takes place.

 If my ceph-fu were strong enough I would try to find some block replicated
 num_replicas+1 times so that my belief would be well-founded. :)

 Also ceph osd crush remove osd.osdnum still shows the OSD in ceph osd
 tree, but it is not attached to any server.  I think it might even be
 marked
 UP and DOWN, but I cannot confirm.

 So I believe so far the approaches are equivalent.

 BUT, I think that to keep an OSD out after using ceph osd out OSDID one
 needs to turn off auto in or something.

 I don't want to turn that off b/c in the past I had some slow drives which
 would occasionally be marked out.  If they stayed out that could
 increase
 load on other drives, making them unresponsive, getting them marked out
 as
 well, leading to a domino effect where too many drives get marked out and
 the cluster goes down.

 Now I have better hardware, but since the scenario exists, I'd rather avoid
 it! :)


There are separate options for automatically marking new drives in versus
marking in established ones. Should be in the docs! :)
-Greg





  If you mark the OSDs OUT, wait for the remapping to finish, and remove
 the
  OSDs and host from the CRUSH map, there will still be some data
 migration.

 Yep, this is what I see.  But I find it weird.

 
 
  Ceph is also really good at handling multiple changes in a row.  For
  example, I had to reformat all of my OSDs because I chose my mkfs.xfs
  parameters poorly.   I removed the OSDS, without draining them first,
 which
  caused a lot of remapping.  I then quickly formatted the OSDs, and put
 them
  back in.  The CRUSH map went back to what it started with, and the only
  remapping required was to re-populate the newly formatted OSDs.

 In this case you'd be living with num_replicas-1 for a while.  Sounds
 exciting!  :)

 Thanks,
 Chad.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com javascript:;
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Question/idea about performance problems with a few overloaded OSDs

2014-10-21 Thread Gregory Farnum

On Tue, Oct 21, 2014 at 10:15 AM, Lionel Bouton lionel+c...@bouton.name wrote:
 Hi,

 I've yet to install 0.80.7 on one node to confirm its stability and use
 the new IO prirority tuning parameters enabling prioritized access to
 data from client requests.

 In the meantime, faced with large slowdowns caused by resync or external
 IO load (although external IO load is not expected it can happen in
 migrations from other storage solutions like in our recent experience)
 I've got an idea related to the underlying problem (IO load concurrent
 with client requests or even concentrated client-requests) that might
 already be implemented (or not being of much value) so I'll write it
 down to get feedback.

 When IO load is not balanced correctly across OSDs the most loaded OSD
 becomes a bottleneck in both write and read requests and for many
 (most?) workloads will become a bottleneck for the whole storage network
 as seen by the client. This happened to us on numerous occasions (low
 filesystem performance, OSD restarts triggering backfills or recoveries)
 For read requests would it be beneficial for OSDs to communicate with
 their peers to find out their recent IO mean/median/... service time and
 make OSDs able to proxy requests to less loaded nodes when they are
 substantially more loaded than their peers?
 If the additional network load generated by proxying requests proves
 detrimental to the overall performance, maybe an update to librados to
 accept a hint to redirect read requests for a given PG and a given
 period might be a solution.

 I understand that even if this is possible for read requests this
 doesn't apply to write requests because they are synchronized across all
 replicas. That said diminishing read load on one OSD without modifying
 write behavior will obviously help the OSD process write requests faster.
 If the general idea isn't bad or already obsoleted by another it's
 obviously not trivial. For example it can create unstable feedback loops
 so if I were to try and implement it I'll probably start with a
 selective proxy/redirect with a probability of proxying/redirecting
 being computed from the respective loads of all OSDs storing a given PG
 to avoid ping-pong situations where read requests overload OSDs before
 overloading another and coming round again.

 Any thought? Is it based on wrong assumptions? Would it prove to be a
 can of worms if someone tried to implement it?

Yeah, there's one big thing you're missing: we strictly order reads
and writes to an object, and the primary is the serialization point.
If we were to proxy reads to another replica it would be easy enough
for the primary to continue handling the ordering, but if it were just
a redirect it wouldn't be able to do so (the primary doesn't know when
the read is completed, allowing it to start a write). Setting up the
proxy of course requires a lot more code, but more importantly it's
more resource-intensive on the primary, so I'm not sure if it's worth
it. :/
The primary affinity value we recently introduced is designed to
help alleviate persistent balancing problems around this by letting
you reduce how many PGs an OSD is primary for without changing the
location of the actual data in the cluster. But dynamic updates to
that aren't really feasible either (it's a map change and requires
repeering).

There are also relaxed consistency mechanisms that let clients read
from a replica (randomly, or the one closest to them, etc), but with
these there's no good way to get load data from the OSDs to the
clients.

So redirects of some kind sound like a good feature, but I'm not sure
how one could go about implementing them reasonably. I think the
actual proxy is probably the best bet, but that's an awful lot of code
in critical places and with lots of dependencies whose
performance/balancing benefits I'm a little dubious of. :/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Extremely slow small files rewrite performance

2014-10-21 Thread Gregory Farnum

Are these tests conducted using a local fs on RBD, or using CephFS?
If CephFS, do you have multiple clients mounting the FS, and what are
they doing? What client (kernel or ceph-fuse)?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Oct 21, 2014 at 9:05 AM, Sergey Nazarov nataraj...@gmail.com wrote:
 Hi

 I just built a new cluster using this quickstart instructions:
 http://ceph.com/docs/master/start/

 And here is what I am seeing:

 # time for i in {1..10}; do echo $i  $i.txt ; done
 real 0m0.081s
 user 0m0.000s
 sys 0m0.004s

 And if I try to repeat the same command (when files already created):

 # time for i in {1..10}; do echo $i  $i.txt ; done
 real 0m48.894s
 user 0m0.000s
 sys 0m0.004s

 I was very surprised and then just tried to rewrite a single file:

 # time echo 1  1.txt
 real 0m3.133s
 user 0m0.000s
 sys 0m0.000s

 BTW, I dont think it is the problem with OSD speed or network:

 # time sysbench --num-threads=1 --test=fileio --file-total-size=1G
 --file-test-mode=rndrw prepare
 1073741824 bytes written in 23.52 seconds (43.54 MB/sec).

 Here is my ceph cluster status and verion:

 # ceph -w
 cluster d3dcacc3-89fb-4db0-9fa9-f1f6217280cb
  health HEALTH_OK
  monmap e4: 4 mons at
 {atl-fs10=10.44.101.70:6789/0,atl-fs11=10.44.101.91:6789/0,atl-fs12=10.44.101.92:6789/0,atl-fs9=10.44.101.69:6789/0},
 election epoch 40, quorum 0,1,2,3 atl-fs9,atl-fs10,atl-fs11,atl-fs12
  mdsmap e33: 1/1/1 up {0=atl-fs12=up:active}, 3 up:standby
  osdmap e92: 4 osds: 4 up, 4 in
   pgmap v8091: 192 pgs, 3 pools, 123 MB data, 1658 objects
 881 GB used, 1683 GB / 2564 GB avail
  192 active+clean
   client io 1820 B/s wr, 1 op/s

 # ceph -v
 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)

 All nodes connected with gigabit network.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Extremely slow small files rewrite performance

2014-10-21 Thread Gregory Farnum

Can you enable debugging on the client (debug ms = 1, debug client
= 20) and mds (debug ms = 1, debug mds = 20), run this test
again, and post them somewhere for me to look at?

While you're at it, can you try rados bench and see what sort of
results you get?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Oct 21, 2014 at 10:57 AM, Sergey Nazarov nataraj...@gmail.com wrote:
 It is CephFS mounted via ceph-fuse.
 I am getting the same results not depending on how many other clients
 are having this fs mounted and their activity.
 Cluster is working on Debian Wheezy, kernel 3.2.0-4-amd64.

 On Tue, Oct 21, 2014 at 1:44 PM, Gregory Farnum g...@inktank.com wrote:
 Are these tests conducted using a local fs on RBD, or using CephFS?
 If CephFS, do you have multiple clients mounting the FS, and what are
 they doing? What client (kernel or ceph-fuse)?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Tue, Oct 21, 2014 at 9:05 AM, Sergey Nazarov nataraj...@gmail.com wrote:
 Hi

 I just built a new cluster using this quickstart instructions:
 http://ceph.com/docs/master/start/

 And here is what I am seeing:

 # time for i in {1..10}; do echo $i  $i.txt ; done
 real 0m0.081s
 user 0m0.000s
 sys 0m0.004s

 And if I try to repeat the same command (when files already created):

 # time for i in {1..10}; do echo $i  $i.txt ; done
 real 0m48.894s
 user 0m0.000s
 sys 0m0.004s

 I was very surprised and then just tried to rewrite a single file:

 # time echo 1  1.txt
 real 0m3.133s
 user 0m0.000s
 sys 0m0.000s

 BTW, I dont think it is the problem with OSD speed or network:

 # time sysbench --num-threads=1 --test=fileio --file-total-size=1G
 --file-test-mode=rndrw prepare
 1073741824 bytes written in 23.52 seconds (43.54 MB/sec).

 Here is my ceph cluster status and verion:

 # ceph -w
 cluster d3dcacc3-89fb-4db0-9fa9-f1f6217280cb
  health HEALTH_OK
  monmap e4: 4 mons at
 {atl-fs10=10.44.101.70:6789/0,atl-fs11=10.44.101.91:6789/0,atl-fs12=10.44.101.92:6789/0,atl-fs9=10.44.101.69:6789/0},
 election epoch 40, quorum 0,1,2,3 atl-fs9,atl-fs10,atl-fs11,atl-fs12
  mdsmap e33: 1/1/1 up {0=atl-fs12=up:active}, 3 up:standby
  osdmap e92: 4 osds: 4 up, 4 in
   pgmap v8091: 192 pgs, 3 pools, 123 MB data, 1658 objects
 881 GB used, 1683 GB / 2564 GB avail
  192 active+clean
   client io 1820 B/s wr, 1 op/s

 # ceph -v
 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)

 All nodes connected with gigabit network.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fio rbd stalls during 4M reads

2014-10-24 Thread Gregory Farnum

There's an issue in master branch temporarily that makes rbd reads
greater than the cache size hang (if the cache was on). This might be
that. (Jason is working on it: http://tracker.ceph.com/issues/9854)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Oct 23, 2014 at 5:09 PM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz wrote:
 I'm doing some fio tests on Giant using fio rbd driver to measure
 performance on a new ceph cluster.

 However with block sizes  1M (initially noticed with 4M) I am seeing
 absolutely no IOPS for *reads* - and the fio process becomes non
 interrupteable (needs kill -9):

 $ ceph -v
 ceph version 0.86-467-g317b83d (317b831a917f70838870b31931a79bdd4dd0)

 $ fio --version
 fio-2.1.11-20-g9a44

 $ fio read-busted.fio
 env-read-4M: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=rbd, iodepth=32
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.8
 Jobs: 1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
 1158050441d:06h:58m:03s]

 This appears to be a pure fio rbd driver issue, as I can attach the relevant
 rbd volume to a vm and dd from it using 4M blocks no problem.

 Any ideas?

 Cheers

 Mark

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] error when executing ceph osd pool set foo-hot cache-mode writeback

2014-10-28 Thread Gregory Farnum

On Tue, Oct 28, 2014 at 3:24 AM, Cristian Falcas
cristi.fal...@gmail.com wrote:
 Hello,

 In the documentation about creating an cache pool, you find this:

 Cache mode

 The most important policy is the cache mode:

 ceph osd pool set foo-hot cache-mode writeback

 But when trying to run the above command, I get an error:

 ceph osd pool set ssd_cache cache-mode writeback
 Invalid command:  cache-mode not in
 size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid
 osd pool set poolname
 size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid
 val {--yes-i-really-mean-it} :  set pool parameter var to val
 Error EINVAL: invalid command

 Is this deprecated? I'm using version 0.80.

 Those are the commands I used to create the cache:

 # Set up a read/write cache pool ssd_cache for pool images:
 ceph osd tier add images ssd_cache
 ceph osd tier cache-mode ssd_cache writeback
 # Direct all traffic for images to ssd_cache:
 ceph osd tier set-overlay images ssd_cache
 ceph osd pool set ssd_cache cache-mode writeback
 # Set the target size and enable the tiering agent for ssd_cache:
 ceph osd pool set ssd_cache hit_set_type bloom
 ceph osd pool set ssd_cache hit_set_count 1
 ceph osd pool set ssd_cache hit_set_period 3600   # 1 hour
 ceph osd pool set ssd_cache target_max_bytes 4000  # 500 GB
 # will begin flushing dirty objects when 40% of the pool is dirty and
 begin evicting clean objects when we reach 80% of the target size.
 ceph osd pool set ssd_cache cache_target_dirty_ratio .4
 ceph osd pool set ssd_cache cache_target_full_ratio .8

Where are you seeing the ceph osd pool set ssd_cache cache-mode
writeback from? You're setting that with the ceph osd tier
cache-mode ssd_cache writeback command. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding a monitor to

2014-10-28 Thread Gregory Farnum

On Mon, Oct 27, 2014 at 11:37 AM, Patrick Darley
patrick.dar...@codethink.co.uk wrote:
 Hi there

 Over the last week or so, I've been trying to connect a ceph monitor node
 running on a baserock system
 to connect to a simple 3-node ubuntu ceph cluster.

 The 3 node ubunutu cluster was created by following the documented Quick
 installation guide using 3 VMs running ubuntu Trusty.

 After the ubuntu cluster has been deployed I would then follow the
 directions below, which I derived from comparing the ceph-deploy debug
 information, the ceph documentation on adding monitor nodes to an existing
 system and the ceph documentation on bootstrapping monitor nodes.

  1. scp the /etc/ceph/* from admin node
  2. create the dir: mkdir /var/lib/ceph/mon/ceph-bcc08
  3. generate mon keyring: sudo ceph auth get mon. -o
 /var/lib/ceph/tmp/ceph-bcc08.mon.keyring
  4. generate monmap: sudo ceph mon getmap -o /var/lib/ceph/tmp/monmap

Yeah, this is wrong. You're here giving the monitor its own keyring
which it is going to expect anybody to talk to to be encrypting with.
The docs have a section on adding monitors which should work verbatim;
if not it's a doc bug:
http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#adding-monitors

-Greg
  5. That filesystem thingy: sudo ceph-mon --cluster ceph --mkfs -i bcc08
 --keyring /var/lib/ceph/tmp/ceph-bcc08.mon.keyring --monmap
 /var/lib/ceph/tmp/monmap
  6. Unlink keys and old monmap: rm /var/lib/ceph/tmp/*
  7. touch things: touch /var/lib/ceph/mon/ceph-bcc08/done and touch
 /var/lib/ceph/mon/ceph-bcc08/sysvinit
  8. Then start the mon: sudo /etc/init.d/ceph start mon.bcc08

 When I carry out these steps in the attempt to add a baserock system to the
 ubuntu cluster, the monitor node has not been added to the cluster and the
 admin socket mon_status gives the following output.

   ~ # ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.bcc07.asok
 mon_status
   { name: bcc07,
 rank: -1,
 state: probing,
 election_epoch: 0,
 quorum: [],
 outside_quorum: [],
 extra_probe_peers: [],
 sync_provider: [],
 monmap: { epoch: 0,
 fsid: 4460079d-42f4-4e3a-8ce3-e2a7fa2685e6,
 modified: 2014-10-27 12:37:25.531542,
 created: 2014-10-27 12:37:25.531542,
 mons: [
   { rank: 0,
 name: ucc01,
 addr: 192.168.122.95:6789\/0}]}}


 And the newly added monitor remains stuck in the probing state indefinitely.
 To try and resolve
 this issue I have looked at the problems monitor troubleshooting page of the
 ceph documentation, eg. ntp sychronisation and checking network connectivity
 (to the best of my ability :-s ).

 It is also worth mentioning that I have created a 3 node ceph cluster on
 baserock machines (1 mon, 2 osds) then successfully added monitor nodes
 running baserock and ubuntu systems using the same 8 step process given
 above.

 This leaves me confused as to why adding the monitor run on baserock to the
 all ubuntu cluster specifically is causing problems.

 Are there any reasons why this 'probing' problem could be occuring? Im
 feeling a little stuck of how to proceed and would welcome any suggestions.

 Thanks for your help,

 Patrick
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mds isn't working anymore after osd's running full

2014-10-28 Thread Gregory Farnum

You'll need to gather a log with the offsets visible; you can do this
with debug ms = 1; debug mds = 20; debug journaler = 20.
-Greg

On Fri, Oct 24, 2014 at 7:03 AM, Jasper Siero
jasper.si...@target-holding.nl wrote:
 Hello Greg and John,

 I used the patch on the ceph cluster and tried it again:
  /usr/bin/ceph-mds -i th1-mon001 -c /etc/ceph/ceph.conf --cluster ceph 
 --undump-journal 0 journaldumptgho-mon001
 undump journaldumptgho-mon001
 start 9483323613 len 134213311
 writing header 200.
 writing 9483323613~1048576
 writing 9484372189~1048576
 
 
 writing 9614395613~1048576
 writing 9615444189~1048576
 writing 9616492765~1044159
 done.

 It went well without errors and after that I restarted the mds.
 The status went from up:replay to up:reconnect to up:rejoin(lagged or crashed)

 In the log there is an error about trim_to  trimming_pos and its like Greg 
 mentioned that maybe the dumpfile needs to be truncated to the proper length 
 and resetting and undumping again.

 How can I truncate the dumped file to the correct length?

 The mds log during the undumping and starting the mds:
 http://pastebin.com/y14pSvM0

 Kind Regards,

 Jasper
 
 Van: john.sp...@inktank.com [john.sp...@inktank.com] namens John Spray 
 [john.sp...@redhat.com]
 Verzonden: donderdag 16 oktober 2014 12:23
 Aan: Jasper Siero
 CC: Gregory Farnum; ceph-users
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

 Following up: firefly fix for undump is: 
 https://github.com/ceph/ceph/pull/2734

 Jasper: if you still need to try undumping on this existing firefly
 cluster, then you can download ceph-mds packages from this
 wip-firefly-undump branch from
 http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/

 Cheers,
 John

 On Wed, Oct 15, 2014 at 8:15 PM, John Spray john.sp...@redhat.com wrote:
 Sadly undump has been broken for quite some time (it was fixed in
 giant as part of creating cephfs-journal-tool).  If there's a one line
 fix for this then it's probably worth putting in firefly since it's a
 long term supported branch -- I'll do that now.

 John

 On Wed, Oct 15, 2014 at 8:23 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hello Greg,

 The dump and reset of the journal was succesful:

 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file 
 /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph 
 --dump-journal 0 journaldumptgho-mon001
 journal is 9483323613~134215459
 read 134213311 bytes at offset 9483323613
 wrote 134213311 bytes at offset 9483323613 to journaldumptgho-mon001
 NOTE: this is a _sparse_ file; you can
 $ tar cSzf journaldumptgho-mon001.tgz journaldumptgho-mon001
   to efficiently compress it while preserving sparseness.

 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file 
 /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph 
 --reset-journal 0
 old journal was 9483323613~134215459
 new journal start will be 9621733376 (4194304 bytes past old end)
 writing journal head
 writing EResetJournal entry
 done


 Undumping the journal was not successful and looking into the error 
 client_lock.is_locked() is showed several times. The mds is not running 
 when I start the undumping so maybe have forgot something?

 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file 
 /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph 
 --undump-journal 0 journaldumptgho-mon001
 undump journaldumptgho-mon001
 start 9483323613 len 134213311
 writing header 200.
 osdc/Objecter.cc: In function 'ceph_tid_t 
 Objecter::op_submit(Objecter::Op*)' thread 7fec3e5ad7a0 time 2014-10-15 
 09:09:32.020287
 osdc/Objecter.cc: 1225: FAILED assert(client_lock.is_locked())
  ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
  1: /usr/bin/ceph-mds() [0x80f15e]
  2: (Dumper::undump(char const*)+0x65d) [0x56c7ad]
  3: (main()+0x1632) [0x569c62]
  4: (__libc_start_main()+0xfd) [0x7fec3ca68d5d]
  5: /usr/bin/ceph-mds() [0x567d99]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed 
 to interpret this.
 2014-10-15 09:09:32.021313 7fec3e5ad7a0 -1 osdc/Objecter.cc: In function 
 'ceph_tid_t Objecter::op_submit(Objecter::Op*)' thread 7fec3e5ad7a0 time 
 2014-10-15 09:09:32.020287
 osdc/Objecter.cc: 1225: FAILED assert(client_lock.is_locked())

  ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
  1: /usr/bin/ceph-mds() [0x80f15e]
  2: (Dumper::undump(char const*)+0x65d) [0x56c7ad]
  3: (main()+0x1632) [0x569c62]
  4: (__libc_start_main()+0xfd) [0x7fec3ca68d5d]
  5: /usr/bin/ceph-mds() [0x567d99]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed 
 to interpret this.

  0 2014-10-15 09:09:32.021313 7fec3e5ad7a0 -1 osdc/Objecter.cc: In 
 function 'ceph_tid_t Objecter::op_submit(Objecter::Op*)' thread 
 7fec3e5ad7a0 time 2014-10-15 09:09:32.020287
 osdc/Objecter.cc: 1225: FAILED

Re: [ceph-users] Adding a monitor to

2014-10-28 Thread Gregory Farnum

I'm sorry, you're right — I misread it. :(
But indeed step 6 is the crucial one, which tells the existing
monitors to accept the new one into the cluster. You'll need to run it
with an admin client keyring that can connect to the existing cluster;
that's probably the part that has gone wrong. You don't need to run it
from the new monitor, so if you're having trouble getting the keys to
behave I'd just run it from an existing system. :)
-Greg

On Tue, Oct 28, 2014 at 10:11 AM, Patrick Darley
patrick.dar...@codethink.co.uk wrote:
 On 2014-10-28 16:08, Gregory Farnum wrote:

 On Mon, Oct 27, 2014 at 11:37 AM, Patrick Darley
 patrick.dar...@codethink.co.uk wrote:

 Hi there

 Over the last week or so, I've been trying to connect a ceph monitor node
 running on a baserock system
 to connect to a simple 3-node ubuntu ceph cluster.

 The 3 node ubunutu cluster was created by following the documented Quick
 installation guide using 3 VMs running ubuntu Trusty.

 After the ubuntu cluster has been deployed I would then follow the
 directions below, which I derived from comparing the ceph-deploy debug
 information, the ceph documentation on adding monitor nodes to an
 existing
 system and the ceph documentation on bootstrapping monitor nodes.

  1. scp the /etc/ceph/* from admin node
  2. create the dir: mkdir /var/lib/ceph/mon/ceph-bcc08
  3. generate mon keyring: sudo ceph auth get mon. -o
 /var/lib/ceph/tmp/ceph-bcc08.mon.keyring
  4. generate monmap: sudo ceph mon getmap -o /var/lib/ceph/tmp/monmap


 Yeah, this is wrong. You're here giving the monitor its own keyring
 which it is going to expect anybody to talk to to be encrypting with.


 If you are referring to steps 3 and 4 above, I believe these are synonymous
 with
 steps 3 and 4 of the documentation you recommended. The monitor keyring and
 the
 current monmap are retrieved from the initial monitor. they are then used in
 step 5
 to prepare the monitor's data directory.


 The docs have a section on adding monitors which should work verbatim;
 if not it's a doc bug:


 http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#adding-monitors


 Thanks for the recommendation. I have tried to use this procedure a couple
 of times but got stuck at step number
 6 of this method. The command given hangs then times out, causing the rest
 of the cluster to fail.

 -Greg



 Thanks for the reply!

 Much appreciated,

 Patrick






  5. That filesystem thingy: sudo ceph-mon --cluster ceph --mkfs -i bcc08
 --keyring /var/lib/ceph/tmp/ceph-bcc08.mon.keyring --monmap
 /var/lib/ceph/tmp/monmap
  6. Unlink keys and old monmap: rm /var/lib/ceph/tmp/*
  7. touch things: touch /var/lib/ceph/mon/ceph-bcc08/done and touch
 /var/lib/ceph/mon/ceph-bcc08/sysvinit
  8. Then start the mon: sudo /etc/init.d/ceph start mon.bcc08

 When I carry out these steps in the attempt to add a baserock system to
 the
 ubuntu cluster, the monitor node has not been added to the cluster and
 the
 admin socket mon_status gives the following output.

   ~ # ceph --cluster=ceph --admin-daemon
 /var/run/ceph/ceph-mon.bcc07.asok
 mon_status
   { name: bcc07,
 rank: -1,
 state: probing,
 election_epoch: 0,
 quorum: [],
 outside_quorum: [],
 extra_probe_peers: [],
 sync_provider: [],
 monmap: { epoch: 0,
 fsid: 4460079d-42f4-4e3a-8ce3-e2a7fa2685e6,
 modified: 2014-10-27 12:37:25.531542,
 created: 2014-10-27 12:37:25.531542,
 mons: [
   { rank: 0,
 name: ucc01,
 addr: 192.168.122.95:6789\/0}]}}


 And the newly added monitor remains stuck in the probing state
 indefinitely.
 To try and resolve
 this issue I have looked at the problems monitor troubleshooting page of
 the
 ceph documentation, eg. ntp sychronisation and checking network
 connectivity
 (to the best of my ability :-s ).

 It is also worth mentioning that I have created a 3 node ceph cluster on
 baserock machines (1 mon, 2 osds) then successfully added monitor nodes
 running baserock and ubuntu systems using the same 8 step process given
 above.

 This leaves me confused as to why adding the monitor run on baserock to
 the
 all ubuntu cluster specifically is causing problems.

 Are there any reasons why this 'probing' problem could be occuring? Im
 feeling a little stuck of how to proceed and would welcome any
 suggestions.

 Thanks for your help,

 Patrick
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Troubleshooting Incomplete PGs

2014-10-28 Thread Gregory Farnum

On Thu, Oct 23, 2014 at 6:41 AM, Chris Kitzmiller
ckitzmil...@hampshire.edu wrote:
 On Oct 22, 2014, at 8:22 PM, Craig Lewis wrote:

 Shot in the dark: try manually deep-scrubbing the PG.  You could also try
 marking various osd's OUT, in an attempt to get the acting set to include
 osd.25 again, then do the deep-scrub again.  That probably won't help
 though, because the pg query says it probed osd.25 already... actually , it
 doesn't.  osd.25 is in probing_osds not probed_osds. The deep-scrub
 might move things along.

 Re-reading your original post, if you marked the slow osds OUT, but left
 them running, you should not have lost data.


 That's true. I just marked them out. I did lose osd.10 (in addition to
 out'ting those other two OSDs) so I'm not out of the woods yet.

 If the scrubs don't help, it's probably time to hop on IRC.


 When I issue the deep-scrub command the cluster just doesn't scrub it. Same
 for regular scrub. :(

 This pool was offering an RBD which I've lost my connection to and it won't
 remount so my data is totally inaccessible at the moment. Thanks for your
 help so far!

It looks like you are suffering from
http://tracker.ceph.com/issues/9752, which we've not yet seen in-house
but have had reported a few times. I suspect that Loic (CC'ed) would
like to discuss your cluster's history with you to try and narrow it
down.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding a monitor to

2014-10-29 Thread Gregory Farnum

[Re-adding the list, so this is archived for future posterity.]

On Wed, Oct 29, 2014 at 6:11 AM, Patrick Darley
patrick.dar...@codethink.co.uk wrote:

 Thanks again for the reply Greg!

 On 2014-10-28 17:39, Gregory Farnum wrote:

 I'm sorry, you're right — I misread it. :(


 No worries, I had included some misleading words like generate in my rough
 description where retrive would have been more appropriate. Sorry!

 But indeed step 6 is the crucial one, which tells the existing
 monitors to accept the new one into the cluster. You'll need to run it
 with an admin client keyring that can connect to the existing cluster;
 that's probably the part that has gone wrong. You don't need to run it
 from the new monitor,


 I think, in order to carry out the 5th step you also need the client.admin
 keyring present, that'd be preparing the monitors data directory. I had
 scp-ed it across to the monitor along with the ceph.conf file and pu them in
 the expected location, /etc/ceph/, prior to running that command.

 so if you're having trouble getting the keys to
 behave I'd just run it from an existing system. :)


 I tried running this command, step 6, from the admin node of my ubuntu ceph
 cluster.
 As I had experienced before, the command hung. Then trying to run any ceph
 commands on the
 rest of the cluster I get a long hang then the following error:

 cc@ucc01:~$ ceph -s
 2014-10-29 10:40:33.748334 7ffaec051700  0 monclient(hunting):
 authenticate timed out after 300
 2014-10-29 10:40:33.748499 7ffaec051700  0 librados: client.admin
 authentication error (110) Connection timed out
 Error connecting to cluster: TimedOut


 The monitor that I was trying to add can be started ok after this (once I
 have touched the done and sysvinit files) but also gives the
 above error when attempting to run the ceph -s. Checking the log file I see
 the following lines repeated:


 2014-10-29 10:01:01.721905 7ffd548ac700  0 mon.bcc07@-1(probing) e0
 handle_probe ignoring fsid 5021163c-3c0b-4ec5-83fe-f0622c0e9447 !=
 f2d609ef-2065-4862-a821-55c484d61dca
 2014-10-29 10:01:01.809991 7ffd550ad700  1
 mon.bcc07@-1(probing).paxos(paxos recovering c 0..0) is_readable
 now=2014-10-29 10:01:01.809996 lease_expire=0.00 has v0 lc 0
 2014-10-29 10:01:03.721559 7ffd548ac700  0 mon.bcc07@-1(probing) e0
 handle_probe ignoring fsid 5021163c-3c0b-4ec5-83fe-f0622c0e9447 !=
 f2d609ef-2065-4862-a821-55c484d61dca
 2014-10-29 10:01:03.810466 7ffd550ad700  1
 mon.bcc07@-1(probing).paxos(paxos recovering c 0..0) is_readable
 now=2014-10-29 10:01:03.810467 lease_expire=0.00 has v0 lc 0


 The initial monitor has the following log at around a similar time:


 2014-10-29 10:01:02.169655 7f52e7408700  0 mon.ucc01@1(probing) e2
 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
 5021163c-3c0b-4ec5-83fe-f0622c0e9447
 2014-10-29 10:01:04.170153 7f52e7408700  0 mon.ucc01@1(probing) e2
 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
 5021163c-3c0b-4ec5-83fe-f0622c0e9447
 2014-10-29 10:01:06.169300 7f52e7408700  0 mon.ucc01@1(probing) e2
 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
 5021163c-3c0b-4ec5-83fe-f0622c0e9447


 It looks to me like there might be conflicting fsid values being compared
 somewhere, but checking the ceph.conf files on the
 nodes I found them to be declared as the same. The log files recorded a
 similar output on both monitors for some time.

 I then turned off the monitor I was attempting to add at approximately
 12:39:30 and the log file of the initial
 monitor has the following output around this time:


 2014-10-29 12:39:30.304639 7f52e7408700  0 mon.ucc01@1(probing) e2
 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
 5021163c-3c0b-4ec5-83fe-f0622c0e9447

Okay, that's indeed not right. I suspect this is your issue but I'm
not entirely certain because your other symptoms are a bit weird. I
bet Joao can help though; he maintains the monitor and deals with
these issues a lot more often than I do. :)
-Greg

 2014-10-29 12:39:32.023964 7f52e7c09700  0
 mon.ucc01@1(probing).data_health(1) update_stats avail 68% total 14318640
 used 3748076 avail 9820180
 2014-10-29 12:39:32.303740 7f52e7408700  0 mon.ucc01@1(probing) e2
 handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
 5021163c-3c0b-4ec5-83fe-f0622c0e9447
 2014-10-29 12:39:32.394606 7f52e53fd700  0 -- 192.168.122.95:6789/0 
 192.168.122.42:6789/0 pipe(0x55e5180 sd=24 :6789 s=2 pgs=1 cs=1 l=0
 c=0x39bfde0).fault with nothing to send, going to standby
 2014-10-29 12:39:33.862400 7f52e5902700  0 -- 192.168.122.95:6789/0 
 192.168.122.42:6789/0 pipe(0x55e5180 sd=13 :6789 s=1 pgs=1 cs=2 l=0
 c=0x39bfde0).fault
 2014-10-29 12:40:32.024807 7f52e7c09700  0
 mon.ucc01@1(probing).data_health(1) update_stats avail 68% total 14318640
 used 3748072 avail 9820184
 2014-10-29 12:41:32.025632 7f52e7c09700  0
 mon.ucc01@1

Re: [ceph-users] mds isn't working anymore after osd's running full

2014-10-29 Thread Gregory Farnum

On Wed, Oct 29, 2014 at 7:51 AM, Jasper Siero
jasper.si...@target-holding.nl wrote:
 Hello Greg,

 I added the debug options which you mentioned and started the process again:

 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file 
 /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph 
 --reset-journal 0
 old journal was 9483323613~134233517
 new journal start will be 9621733376 (4176246 bytes past old end)
 writing journal head
 writing EResetJournal entry
 done
 [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 -c /etc/ceph/ceph.conf 
 --cluster ceph --undump-journal 0 journaldumptgho-mon001
 undump journaldumptgho-mon001
 start 9483323613 len 134213311
 writing header 200.
  writing 9483323613~1048576
  writing 9484372189~1048576
  writing 9485420765~1048576
  writing 9486469341~1048576
  writing 9487517917~1048576
  writing 9488566493~1048576
  writing 9489615069~1048576
  writing 9490663645~1048576
  writing 9491712221~1048576
  writing 9492760797~1048576
  writing 9493809373~1048576
  writing 9494857949~1048576
  writing 9495906525~1048576
  writing 9496955101~1048576
  writing 9498003677~1048576
  writing 9499052253~1048576
  writing 9500100829~1048576
  writing 9501149405~1048576
  writing 9502197981~1048576
  writing 9503246557~1048576
  writing 9504295133~1048576
  writing 9505343709~1048576
  writing 9506392285~1048576
  writing 9507440861~1048576
  writing 9508489437~1048576
  writing 9509538013~1048576
  writing 9510586589~1048576
  writing 9511635165~1048576
  writing 9512683741~1048576
  writing 9513732317~1048576
  writing 9514780893~1048576
  writing 9515829469~1048576
  writing 9516878045~1048576
  writing 9517926621~1048576
  writing 9518975197~1048576
  writing 9520023773~1048576
  writing 9521072349~1048576
  writing 9522120925~1048576
  writing 9523169501~1048576
  writing 9524218077~1048576
  writing 9525266653~1048576
  writing 9526315229~1048576
  writing 9527363805~1048576
  writing 9528412381~1048576
  writing 9529460957~1048576
  writing 9530509533~1048576
  writing 9531558109~1048576
  writing 9532606685~1048576
  writing 9533655261~1048576
  writing 9534703837~1048576
  writing 9535752413~1048576
  writing 9536800989~1048576
  writing 9537849565~1048576
  writing 9538898141~1048576
  writing 9539946717~1048576
  writing 9540995293~1048576
  writing 9542043869~1048576
  writing 9543092445~1048576
  writing 9544141021~1048576
  writing 9545189597~1048576
  writing 9546238173~1048576
  writing 9547286749~1048576
  writing 9548335325~1048576
  writing 9549383901~1048576
  writing 9550432477~1048576
  writing 9551481053~1048576
  writing 9552529629~1048576
  writing 9553578205~1048576
  writing 9554626781~1048576
  writing 9555675357~1048576
  writing 9556723933~1048576
  writing 9557772509~1048576
  writing 9558821085~1048576
  writing 9559869661~1048576
  writing 9560918237~1048576
  writing 9561966813~1048576
  writing 9563015389~1048576
  writing 9564063965~1048576
  writing 9565112541~1048576
  writing 9566161117~1048576
  writing 9567209693~1048576
  writing 9568258269~1048576
  writing 9569306845~1048576
  writing 9570355421~1048576
  writing 9571403997~1048576
  writing 9572452573~1048576
  writing 9573501149~1048576
  writing 9574549725~1048576
  writing 9575598301~1048576
  writing 9576646877~1048576
  writing 9577695453~1048576
  writing 9578744029~1048576
  writing 9579792605~1048576
  writing 9580841181~1048576
  writing 9581889757~1048576
  writing 9582938333~1048576
  writing 9583986909~1048576
  writing 9585035485~1048576
  writing 9586084061~1048576
  writing 9587132637~1048576
  writing 9588181213~1048576
  writing 9589229789~1048576
  writing 9590278365~1048576
  writing 9591326941~1048576
  writing 9592375517~1048576
  writing 9593424093~1048576
  writing 9594472669~1048576
  writing 9595521245~1048576
  writing 9596569821~1048576
  writing 9597618397~1048576
  writing 9598666973~1048576
  writing 9599715549~1048576
  writing 9600764125~1048576
  writing 9601812701~1048576
  writing 9602861277~1048576
  writing 9603909853~1048576
  writing 9604958429~1048576
  writing 9606007005~1048576
  writing 9607055581~1048576
  writing 9608104157~1048576
  writing 9609152733~1048576
  writing 9610201309~1048576
  writing 9611249885~1048576
  writing 9612298461~1048576
  writing 9613347037~1048576
  writing 9614395613~1048576
  writing 9615444189~1048576
  writing 9616492765~1044159
 done.
 [root@th1-mon001 ~]# service ceph start mds
 === mds.th1-mon001 ===
 Starting Ceph mds.th1-mon001 on th1-mon001...
 starting mds.th1-mon001 at :/0


 The new logs:
 http://pastebin.com/wqqjuEpy

These don't have the increased debugging levels set. :( I'm not sure
where you could have put them that they didn't get picked up, but make
sure it's in the ceph.conf that this mds daemon is referring to. (You
can see the debug levels in use in the --- logging levels ---
section; they appear to all be default.)
-Greg

Re: [ceph-users] Delete pools with low priority?

2014-10-29 Thread Gregory Farnum

Dan (who wrote that slide deck) is probably your best bet here, but I
believe pool deletion is not very configurable and fairly expensive
right now. I suspect that it will get better in Hammer or Infernalis,
once we have a unified op work queue that we can independently
prioritize all IO through (this was a blueprint in CDS today!).
Similar problems with snap trimming and scrubbing were resolved by
introducing sleeps between ops, but that's a bit of a hack itself and
should be going away once proper IO prioritization is available.
-Greg

On Wed, Oct 29, 2014 at 8:19 AM, Daniel Schneller
daniel.schnel...@centerdevice.com wrote:
 Bump :-)

 Any ideas on this? They would be much appreciated.

 Also: Sorry for a possible double post, client had forgotten its email
 config.

 On 2014-10-22 21:21:54 +, Daniel Schneller said:

 We have been running several rounds of benchmarks through the Rados
 Gateway. Each run creates several hundred thousand objects and similarly
 many containers.

 The cluster consists of 4 machines, 12 OSD disks (spinning, 4TB) — 48
 OSDs total.

 After running a set of benchmarks we renamed the pools used by the
 gateway pools to get a clean baseline. In total we now have several
 million objects and containers in 3 pools. Redundancy for all pools is
 set to 3.

 Today we started deleting the benchmark data. Once the first renamed set
 of RGW pools was executed, cluster performance started to go down the
 drain. Using iotop we can see that the disks are all working furiously.
 As the command to delete the pools came back very quickly, the
 assumption is that we are now seeing the effects of the actual objects
 being removed, causing lots and lots of IO activity on the disks,
 negatively impacting regular operations.

 We are running OpenStack on top of Ceph, and we see drastic reduction in
 responsiveness of these machines as well as in CephFS.

 Fortunately this is still a test setup, so no production systems are
 affected. Nevertheless I would like to ask a few questions:

 1) Is it possible to have the object deletion run in some low-prio mode?
 2) If not, is there another way to delete lots and lots of objects
 without affecting the rest of the cluster so badly? 3) Can we somehow
 determine the progress of the deletion so far? We would like to estimate
 if this is going to take hours, days or weeks? 4) Even if not possible
 for the already running deletion, could be get a progress for the
 remaining pools we still want to delete? 5) Are there any parameters
 that we might tune — even if just temporarily - to speed this up?

 Slide 18 of http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
 describes a very similar situation.

 Thanks, Daniel

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crash with rados cppool and snapshots

2014-10-29 Thread Gregory Farnum

On Wed, Oct 29, 2014 at 7:49 AM, Daniel Schneller
daniel.schnel...@centerdevice.com wrote:
 Hi!

 We are exploring options to regularly preserve (i.e. backup) the
 contents of the pools backing our rados gateways. For that we create
 nightly snapshots of all the relevant pools when there is no activity
 on the system to get consistent states.

 In order to restore the whole pools back to a specific snapshot state,
 we tried to use the rados cppool command (see below) to copy a snapshot
 state into a new pool. Unfortunately this causes a segfault. Are we
 doing anything wrong?

 This command:

 rados cppool --snap snap-1 deleteme.lp deleteme.lp2 2 segfault.txt

 Produces this output:

 *** Caught signal (Segmentation fault) ** in thread 7f8f49a927c0 ceph
 version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: rados()
 [0x43eedf] 2: (()+0x10340) [0x7f8f48738340] 3:
 (librados::IoCtxImpl::snap_lookup(char const*, unsigned long*)+0x17)
 [0x7f8f48aff127] 4: (main()+0x1385) [0x411e75] 5:
 (__libc_start_main()+0xf5) [0x7f8f4795fec5] 6: rados() [0x41c6f7]
 2014-10-29 12:03:22.761653 7f8f49a927c0 -1 *** Caught signal
 (Segmentation fault) ** in thread 7f8f49a927c0

  ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1:
  rados() [0x43eedf] 2: (()+0x10340) [0x7f8f48738340] 3:
  (librados::IoCtxImpl::snap_lookup(char const*, unsigned long*)+0x17)
  [0x7f8f48aff127] 4: (main()+0x1385) [0x411e75] 5:
  (__libc_start_main()+0xf5) [0x7f8f4795fec5] 6: rados() [0x41c6f7] NOTE:
  a copy of the executable, or `objdump -rdS executable` is needed to
  interpret this.

 Full segfault file and the objdump output for the rados command can be
 found here:

 - https://public.centerdevice.de/53bddb80-423e-4213-ac62-59fe8dbb9bea
 - https://public.centerdevice.de/50b81566-41fb-439a-b58b-e1e32d75f32a

 We updated to the 0.80.7 release (saw the issue with 0.80.5 before and
 had hoped that the long list of bugfixes in the release notes would
 include a fix for this) but are still seeing it. Rados gateways, OSDs,
 MONs etc. have all been restarted after the update. Package versions
 as follows:

 daniel.schneller@node01 [~] $
 ➜  dpkg -l | grep ceph
 ii  ceph0.80.7-1trusty
 ii  ceph-common 0.80.7-1trusty
 ii  ceph-fs-common  0.80.7-1trusty
 ii  ceph-fuse   0.80.7-1trusty
 ii  ceph-mds0.80.7-1trusty
 ii  libcephfs1  0.80.7-1trusty
 ii  python-ceph 0.80.7-1trusty

 daniel.schneller@node01 [~] $
 ➜  uname -a
 Linux node01 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

 Copying without the snapshot works. Should this work at least in
 theory?

Well, that's interesting. I'm not sure if this can be expected to work
properly, but it certainly shouldn't crash there. Looking at it a bit,
you can make it not crash by specifying -p deleteme.lp as well, but
it simply copies the current state of the pool, not the snapped state.
If you could generate a ticket or two at tracker.ceph.com, that would
be helpful!
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Swift + radosgw: How do I find accounts/containers/objects limitation?

2014-10-31 Thread Gregory Farnum

On Fri, Oct 31, 2014 at 9:55 AM, Narendra Trivedi (natrived)
natri...@cisco.com wrote:
 Hi All,



 I have been working with Openstack Swift + radosgw to stress the whole
 object storage from the Swift side (I have been creating containers and
 objects for days now) but can’t actually find the limitation when it comes
 to the number of accounts, containers, objects that can be created in the
 entire object storage. I have tried radosgw-admin but without any luck. Does
 how this can be found?

There are no hard limits on any of these entities, except for a
configurable one on the number of buckets per user. There is slow
performance degradation as things like the number of objects in a
bucket or number of buckets per user grows too large, but the
thresholds will vary depending on your cluster.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Swift + radosgw: How do I find accounts/containers/objects limitation?

2014-10-31 Thread Gregory Farnum

It defaults to 1000 and can be set via the rgw_admin utility or the
admin API when via the max-buckets param.

On Fri, Oct 31, 2014 at 10:01 AM, Narendra Trivedi (natrived)
natri...@cisco.com wrote:
 Thanks,  Gregory. Do you know how can I find out where the number of buckets 
 for a particular user has been configured?

 --Narendra

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Friday, October 31, 2014 11:58 AM
 To: Narendra Trivedi (natrived)
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Swift + radosgw: How do I find 
 accounts/containers/objects limitation?

 On Fri, Oct 31, 2014 at 9:55 AM, Narendra Trivedi (natrived) 
 natri...@cisco.com wrote:
 Hi All,



 I have been working with Openstack Swift + radosgw to stress the whole
 object storage from the Swift side (I have been creating containers
 and objects for days now) but can’t actually find the limitation when
 it comes to the number of accounts, containers, objects that can be
 created in the entire object storage. I have tried radosgw-admin but
 without any luck. Does how this can be found?

 There are no hard limits on any of these entities, except for a configurable 
 one on the number of buckets per user. There is slow performance degradation 
 as things like the number of objects in a bucket or number of buckets per 
 user grows too large, but the thresholds will vary depending on your cluster.
 -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] giant release osd down

2014-11-02 Thread Gregory Farnum

What happened when you did the OSD prepare and activate steps?
Since your OSDs are either not running or can't communicate with the
monitors, there should be some indication from those steps.
-Greg
On Sun, Nov 2, 2014 at 6:44 AM Shiv Raj Singh virk.s...@gmail.com wrote:

 Hi All

 I am new to ceph and I have been trying to configure 3 node ceph cluster
 with 1 monitor and 2 osd nodes. I have reinstall and recreated the cluster
 three teams and I ma stuck against the wall . My monitor is working as
 desired (I guess) but the status of the ods is down. I am following this
 link http://docs.ceph.com/docs/v0.80.5/install/manual-deployment/ for
 configuring the osd. The reason why I am not using ceph-deply is because I
 want to understand the technology.

 can someone please help e udnerstand what im doing wrong !! :-) !!

 *Some useful diagnostic information *
 ceph2:~$ ceph osd tree
 # idweight  type name   up/down reweight
 -1  2   root default
 -3  1   host ceph2
 0   1   osd.0   down0
 -2  1   host ceph3
 1   1   osd.1   down0

 ceph health detail
 HEALTH_WARN 64 pgs stuck inactive; 64 pgs stuck unclean
 pg 0.22 is stuck inactive since forever, current state creating, last
 acting []
 pg 0.21 is stuck inactive since forever, current state creating, last
 acting []
 pg 0.20 is stuck inactive since forever, current state creating, last
 acting []


 ceph -s
 cluster a04ee359-82f8-44c4-89b5-60811bef3f19
  health HEALTH_WARN 64 pgs stuck inactive; 64 pgs stuck unclean
  monmap e1: 1 mons at {ceph1=192.168.101.41:6789/0}, election epoch
 1, quorum 0 ceph1
  osdmap e9: 2 osds: 0 up, 0 in
   pgmap v10: 64 pgs, 1 pools, 0 bytes data, 0 objects
 0 kB used, 0 kB / 0 kB avail
   64 creating


 My configurations are as below:

 sudo nano /etc/ceph/ceph.conf

 [global]

 fsid = a04ee359-82f8-44c4-89b5-60811bef3f19
 mon initial members = ceph1
 mon host = 192.168.101.41
 public network = 192.168.101.0/24

 auth cluster required = cephx
 auth service required = cephx
 auth client required = cephx



 [osd]
 osd journal size = 1024
 filestore xattr use omap = true

 osd pool default size = 2
 osd pool default min size = 1
 osd pool default pg num = 333
 osd pool default pgp num = 333
 osd crush chooseleaf type = 1

 [mon.ceph1]
 host = ceph1
 mon addr = 192.168.101.41:6789


 [osd.0]
 host = ceph2
 #devs = {path-to-device}

 [osd.1]
 host = ceph3
 #devs = {path-to-device}


 ..

 OSD mount location

 On ceph2
 /dev/sdb1  5.0G  1.1G  4.0G  21%
 /var/lib/ceph/osd/ceph-0

 on Ceph3
 /dev/sdb1  5.0G  1.1G  4.0G  21%
 /var/lib/ceph/osd/ceph-1

 My Linux OS

 lsb_release -a
 No LSB modules are available.
 Distributor ID: Ubuntu
 Description:Ubuntu 14.04 LTS
 Release:14.04
 Codename:   trusty

 Regards

 Shiv


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-03 Thread Gregory Farnum

On Mon, Nov 3, 2014 at 7:46 AM, Chad Seys cws...@physics.wisc.edu wrote:
 Hi All,
I upgraded from emperor to firefly.  Initial upgrade went smoothly and all
 placement groups were active+clean .
   Next I executed
 'ceph osd crush tunables optimal'
   to upgrade CRUSH mapping.

Okay...you know that's a data movement command, right? So you should
expect it to impact operations. (Although not the crashes you're
witnessing.)

   Now I keep having OSDs go down or have requests blocked for long periods of
 time.
   I start back up the down OSDs and recovery eventually stops, but with 100s
 of incomplete and down+incomplete pgs remaining.
   The ceph web page says If you see this state [incomplete], report a bug,
 and try to start any failed OSDs that may contain the needed information.
 Well, all the OSDs are up, though some have blocked requests.

 Also, the logs of the OSDs which go down have this message:
 2014-11-02 21:46:33.615829 7ffcf0421700  0 -- 192.168.164.192:6810/31314 
 192.168.164.186:6804/20934 pipe(0x2faa0280 sd=261 :6810 s=2 pgs=9
 19 cs=25 l=0 c=0x2ed022c0).fault with nothing to send, going to standby
 2014-11-02 21:49:11.440142 7ffce4cf3700  0 -- 192.168.164.192:6810/31314 
 192.168.164.186:6804/20934 pipe(0xe512a00 sd=249 :6810 s=0 pgs=0
 cs=0 l=0 c=0x2a308b00).accept connect_seq 26 vs existing 25 state standby
 2014-11-02 21:51:20.085676 7ffcf6e3e700 -1 osd/PG.cc: In function
 'PG::RecoveryState::Crashed::Crashed(boost::statechart::statePG::RecoveryS
 tate::Crashed, PG::RecoveryState::RecoveryMachine::my_context)' thread
 7ffcf6e3e700 time 2014-11-02 21:51:20.052242
 osd/PG.cc: 5424: FAILED assert(0 == we got a bad state machine event)

These failures are usually the result of adjusting tunables without
having upgraded all the machines in the cluster — although they should
also be fixed in v0.80.7. Are you still seeing crashes, or just the PG
state issues?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 0.87 rados df fault

2014-11-03 Thread Gregory Farnum

On Mon, Nov 3, 2014 at 4:40 AM, Thomas Lemarchand
thomas.lemarch...@cloud-solutions.fr wrote:
 Update :

 /var/log/kern.log.1:Oct 31 17:19:17 c-mon kernel: [17289149.746084]
 [21787] 0 21780   492110   185044 920   240143 0
 ceph-mon
 /var/log/kern.log.1:Oct 31 17:19:17 c-mon kernel: [17289149.746115]
 [13136] 0 1313652172 1753  590 0
 ceph
 /var/log/kern.log.1:Oct 31 17:19:17 c-mon kernel: [17289149.746126] Out
 of memory: Kill process 21787 (ceph-mon) score 827 or sacrifice child
 /var/log/kern.log.1:Oct 31 17:19:17 c-mon kernel: [17289149.746262]
 Killed process 21787 (ceph-mon) total-vm:1968440kB, anon-rss:740176kB,
 file-rss:0kB

 OOM kill.
 I have 1GB memory on my mons, and 1GB swap.
 It's the only mon that crashed. Is there a change in memory requirement
 from Firefly ?

There generally shouldn't be, but I don't think it's something we
monitored closely.
More likely your monitor was running near its memory limit already and
restarting all the OSDs (and servicing the resulting changes) pushed
it over the edge.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-03 Thread Gregory Farnum

[ Re-adding the list. ]

On Mon, Nov 3, 2014 at 10:49 AM, Chad Seys cws...@physics.wisc.edu wrote:

Next I executed
 
  'ceph osd crush tunables optimal'
 
to upgrade CRUSH mapping.

 Okay...you know that's a data movement command, right?

 Yes.

 So you should expect it to impact operations.


 These failures are usually the result of adjusting tunables without
 having upgraded all the machines in the cluster — although they should
 also be fixed in v0.80.7. Are you still seeing crashes, or just the PG
 state issues?

 Still getting crashes. I believe all nodes are running 0.80.7 .  Does ceph
 have a command to check this?  (Otherwise I'll do an ssh-many to check.)

There's a ceph osd metadata command, but i don't recall if it's in
Firefly or only giant. :)


 Thanks!
 C.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-03 Thread Gregory Farnum

Okay, assuming this is semi-predictable, can you start up one of the
OSDs that is going to fail with debug osd = 20, debug filestore =
20, and debug ms = 1 in the config file and then put the OSD log
somewhere accessible after it's crashed?

Can you also verify that all of your monitors are running firefly, and
then issue the command ceph scrub and report the output?
-Greg

On Mon, Nov 3, 2014 at 11:07 AM, Chad Seys cws...@physics.wisc.edu wrote:

 There's a ceph osd metadata command, but i don't recall if it's in
 Firefly or only giant. :)

 It's in firefly.  Thanks, very handy.

 All the OSDs are running 0.80.7 at the moment.

 What next?

 Thanks again,
 Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-03 Thread Gregory Farnum

On Mon, Nov 3, 2014 at 11:41 AM, Chad Seys cws...@physics.wisc.edu wrote:
 On Monday, November 03, 2014 13:22:47 you wrote:
 Okay, assuming this is semi-predictable, can you start up one of the
 OSDs that is going to fail with debug osd = 20, debug filestore =
 20, and debug ms = 1 in the config file and then put the OSD log
 somewhere accessible after it's crashed?

 Alas, I have not yet noticed a pattern.  Only thing I think is true is that
 they go down when I first make CRUSH changes.  Then after restarting, they run
 without going down again.
 All the OSDs are running at the moment.

Oh, interesting. What CRUSH changes exactly are you making that are
spawning errors?

 What I've been doing is marking OUT the OSDs on which a request is blocked,
 letting the PGs recover, (drain the OSD of PGs completely), then remove and
 readd the OSD.

 So far OSDs treated this way no longer have blocked requests.

 Also, seems as though that slowly decreases the number of incomplete and
 down+incomplete PGs .


 Can you also verify that all of your monitors are running firefly, and
 then issue the command ceph scrub and report the output?

 Sure, should I wait until the current rebalancing is finished?

I don't think it should matter, although I confess I'm not sure how
much monitor load the scrubbing adds. (It's a monitor check; doesn't
hit the OSDs at all.)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2084 matches

Mail list logo