Re: [ceph-users] Testing CephFS

2015-08-24 Thread Shinobu
Need to be more careful but probably you're right -;

./net/ceph/messenger.c

Shinobu

On Mon, Aug 24, 2015 at 8:53 PM, Simon Hallam s...@pml.ac.uk wrote:

 The clients are:
 [root@gridnode50 ~]# uname -a
 Linux gridnode50 4.0.8-200.fc21.x86_64 #1 SMP Fri Jul 10 21:09:54 UTC 2015
 x86_64 x86_64 x86_64 GNU/Linux
 [root@gridnode50 ~]# ceph -v
 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)

 I don't think it is a reconnect timeout, as they don't even attempt to
 reconnect until I plug the Ethernet cable back into the original MDS?

 Cheers,

 Simon

  -Original Message-
  From: Yan, Zheng [mailto:z...@redhat.com]
  Sent: 24 August 2015 12:28
  To: Simon Hallam
  Cc: ceph-users@lists.ceph.com; Gregory Farnum
  Subject: Re: [ceph-users] Testing CephFS
 
 
   On Aug 24, 2015, at 18:38, Gregory Farnum gfar...@redhat.com wrote:
  
   On Mon, Aug 24, 2015 at 11:35 AM, Simon  Hallam s...@pml.ac.uk wrote:
   Hi Greg,
  
   The MDS' detect that the other one went down and started the replay.
  
   I did some further testing with 20 client machines. Of the 20 client
  machines, 5 hung with the following error:
  
   [Aug24 10:53] ceph: mds0 caps stale
   [Aug24 10:54] ceph: mds0 caps stale
   [Aug24 10:58] ceph: mds0 hung
   [Aug24 11:03] ceph: mds0 came back
   [  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state
 OPEN)
   [  +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for
 new
  mon
   [Aug24 11:04] ceph: mds0 reconnect start
   [  +0.084938] libceph: mon2 10.15.0.3:6789 session established
   [  +0.008475] ceph: mds0 reconnect denied
  
   Oh, this might be a kernel bug, failing to ask for mdsmap updates when
   the connection goes away. Zheng, does that sound familiar?
   -Greg
 
  This seems like reconnect timeout. you can try enlarging
  mds_reconnect_timeout config option.
 
  Which version of kernel are you using?
 
  Yan, Zheng
 
  
  
   10.15.0.3 was the active MDS at the time I unplugged the Ethernet
 cable.
  
  
   This was the output of ceph -w as I ran the test (I've removed a lot
 of the
  pg remapping):
  
   2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor
  election
   2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor
  election
   2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader
  election with quorum 0,1
   2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down,
  quorum 0,1 ceph1,ceph2
   2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at
  {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
   2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256
  active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s
 wr, 0
  op/s
   2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up
  {0=ceph3=up:active}, 2 up:standby
   2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36
  in
   2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up
  {0=ceph2=up:replay}, 1 up:standby
   2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644
  up:reconnect
   2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up
  {0=ceph2=up:reconnect}, 1 up:standby
   2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644
  up:rejoin
   2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644
  up:active
   2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up
  {0=ceph2=up:active}, 1 up:standby
   2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644
  up:active
   2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up
  {0=ceph2=up:active}, 1 up:standby
   *cable plugged back in*
   2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565
 boot
   2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25
  in
   2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993
  up:boot
   2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up
  {0=ceph2=up:active}, 2 up:standby
   2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor
  election
   2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor
  election
   2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader
  election with quorum 0,1,2
   2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up
  {0=ceph2=up:active}, 2 up:standby
   2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds
 is
  up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507
  (allowed interval 45)
   2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds
 is
  up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952
  (allowed interval 45)
   2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds
 is
  up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302
  (allowed interval 45)
   2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds
 is
  up:active) from 

[ceph-users] Ceph for multi-site operation

2015-08-24 Thread Julien Escario
Hello,
First, let me advise I'm really a noob with Cephsince I have only read some
documentation.

I'm now trying to deploy a Ceph cluster for testing purposes. The cluster is
based on 3 (more if necessary) hypervisors running proxmox 3.4.

Before going futher, I have an essential question : is Ceph usable in a case of
multiple sites storage ?

Long story :
My goal is to run hypervisors on 2 datacenters separated by 4ms latency.
Bandwidth is 1Gbps actually but will be upgraded in a near future.

So is it possible to run a an active/active Ceph cluster to get a shared storage
between the two sites. Of course, I'll have to be sure that no machien is
running at the same time on both sites. Hypervisor will be in charge of this.

Is there a mean to ask Ceph to keep at least one copy (or two) in each site and
ask it to make all blocs reads from the nearest location ?
I'm aware that writes would have to be replicated and there's only a synchronous
mode for this.

I've read many documentation and use cases about Ceph and it seems some are
saying it could be used in such replication and others are not. Need of erasure
coding isn't clear too.

Just hoping my english is clear enough to explain my case ;-)

Thanks for your help,
Julien Escario



smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Opensource plugin for pulling out cluster recovery and client IO metric

2015-08-24 Thread Vickey Singh
Hello Ceph Geeks

I am planning to develop a python plugin that pulls out cluster *recovery
IO* and *client IO* operation metrics , that can be further used with
collectd.

*For example , i need to take out these values*

*recovery io 814 MB/s, 101 objects/s*
*client io 85475 kB/s rd, 1430 kB/s wr, 32 op/s*


Could you please help me in understanding how *ceph -s*  and *ceph -w*
outputs *prints cluster recovery IO and client IO information*.
Where this information is coming from. *Is it coming from perf dump* ? If
yes then which section of perf dump output is should focus on. If not then
how can i get this values.

I tried *ceph --admin-daemon /var/run/ceph/ceph-osd.48.asok perf dump* ,
but it generates hell lot of information and i am confused which section of
output should i use.


Please help

Thanks in advance
Vickey
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw secret_key

2015-08-24 Thread Luis Periquito
When I create a new user using radosgw-admin most of the time the secret
key gets escaped with a backslash, making it not work. Something like
secret_key: xx\/\/.

Why would the / need to be escaped? Why is it printing the \/ instead
of / that does work?

Usually I just remove the backslash and it works fine. I've seen this on
several different clusters.

Is it just me?

This may require opening a bug in the tracking tool, but just asking here
first.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph for multi-site operation

2015-08-24 Thread Lionel Bouton
Le 24/08/2015 15:11, Julien Escario a écrit :
 Hello,
 First, let me advise I'm really a noob with Cephsince I have only read some
 documentation.

 I'm now trying to deploy a Ceph cluster for testing purposes. The cluster is
 based on 3 (more if necessary) hypervisors running proxmox 3.4.

 Before going futher, I have an essential question : is Ceph usable in a case 
 of
 multiple sites storage ?

It depends on what you really need it to do (access patterns and
behaviour when a link goes down).


 Long story :
 My goal is to run hypervisors on 2 datacenters separated by 4ms latency.

Note : unless you are studying Ceph behaviour in this case this goal is
in fact a method to reach a goal. If you describe the actual goal you
might get different suggestions.

 Bandwidth is 1Gbps actually but will be upgraded in a near future.

 So is it possible to run a an active/active Ceph cluster to get a shared 
 storage
 between the two sites.

It is but it probably won't behave correctly in your case. The latency
and the bandwidth will hurt a lot. Any application requiring that data
is confirmed stored on disk will be hit by the 4ms latency and 1Gbps
will have to be shared between inter-site replication traffic and
regular VM disk accesses. Your storage will most probably behave like a
very slow single hard drive shared between all your VMs.
Some workloads might work correctly (if you don't have any significant
writes and most of your data will fit in caches for example).

When the link between your 2 datacenters is severed, in the worst case
(no quorum reachable or a crushmap that won't allow each pg to reach
min_size with only one datacenter) everything will freeze, in the best
case (giving priority to a single datacenter by running more monitors on
it and a crushmap storing at least min_size replicas on it) when the
link will be going down everything will run on this datacenter.

You can get around a part of the performance problems by going with a
3-way replication, 2 replicas on your primary datacenter and 1 on the
secondary where all OSD are configured with primary affinity 0. All
reads will be served from the primary datacenter and only writes would
go to the secondary. You'll have to run all your VM on the primary
datacenter and setup your monitors such that the elected master will be
in the primary datacenter (I believe it is chosen by the first name
according to alphabetical order). You'll have a copy of your data on the
secondary datacenter in case of a disaster on the primary but recovering
will be hard (you'll have to reach a quorum of monitors in the secondary
datacenter and I'm not sure how to proceed if you only have one out of 3
for example).


  Of course, I'll have to be sure that no machien is
 running at the same time on both sites.

With your bandwidth and latency, without knowing more about your
workloads it's probable that running VM on both sites will get you very
slow IOs. Multi datacenter for simple object storage using RGW seems to
work, but RBD volumes accesses are usually more demanding.

  Hypervisor will be in charge of this.

 Is there a mean to ask Ceph to keep at least one copy (or two) in each site 
 and
 ask it to make all blocs reads from the nearest location ?
 I'm aware that writes would have to be replicated and there's only a 
 synchronous
 mode for this.

 I've read many documentation and use cases about Ceph and it seems some are
 saying it could be used in such replication and others are not. Need of 
 erasure
 coding isn't clear too.

Don't use erasure coding for RBD volumes. You'll need a caching tier and
it seems tricky to get right and might not be fully tested (I've seen a
snapshot bug discussed here last week).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] TRIM / DISCARD run at low priority by the OSDs?

2015-08-24 Thread Alexandre DERUMIER
Hi,

I'm not sure for krbd, but with librbd, using trim/discard on the client,

don't do trim/discard on the osd physical disk.

It's simply write zeroes in the rbd image.

zeores write can be skipped since this commit (librbd related)
https://github.com/xiaoxichen/ceph/commit/e7812b8416012141cf8faef577e7b27e1b29d5e3
+OPTION(rbd_skip_partial_discard, OPT_BOOL, false)


Then you can still manage fstrim manually on the osd servers

- Mail original -
De: Chad William Seys cws...@physics.wisc.edu
À: ceph-users ceph-us...@ceph.com
Envoyé: Samedi 22 Août 2015 04:26:38
Objet: [ceph-users] TRIM / DISCARD run at low priority by the OSDs?

Hi All, 

Is it possible to give TRIM / DISCARD initiated by krbd low priority on the 
OSDs? 

I know it is possible to run fstrim at Idle priority on the rbd mount point, 
e.g. ionice -c Idle fstrim -v $MOUNT . 

But this Idle priority (it appears) only is within the context of the node 
executing fstrim . If the node executing fstrim is Idle then the OSDs are 
very busy and performance suffers. 

Is it possible to tell the OSD daemons (or whatever) to perform the TRIMs at 
low priority also? 

Thanks! 
Chad. 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Alex Gorbachev
 This can be tuned in the iSCSI initiation on VMware - look in advanced 
 settings on your ESX hosts (at least if you use the software initiator).

Thanks, Jan. I asked this question of Vmware as well, I think the
problem is specific to a given iSCSI session, so wondering if that's
strictly the job of the target?  Do you know of any specific SCSI
settings that mitigate this kind of issue?  Basically, give up on a
session and terminate it and start a new one should an RBD not
respond?

As I understand, RBD simply never gives up.  If an OSD does not
respond but is still technically up and in, Ceph will retry IOs
forever.  I think RBD and Ceph need a timeout mechanism for this.

Best regards,
Alex

 Jan


 On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:

 Hi Alex,

 Currently RBD+LIO+ESX is broken.

 The problem is caused by the RBD device not handling device aborts properly
 causing LIO and ESXi to enter a death spiral together.

 If something in the Ceph cluster causes an IO to take longer than 10
 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
 as you have seen it never recovers.

 Mike Christie from Redhat is doing a lot of work on this currently, so
 hopefully in the future there will be a direct RBD interface into LIO and it
 will all work much better.

 Either tgt or SCST seem to be pretty stable in testing.

 Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs

 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
 LIO -
 ESXi configuration, but the problem is more general.

 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
 slowly or with significant delays.  ceph osd perf does not show this,
 neither
 does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
 point
 where the client times out, crashes or displays other unsavory behavior -
 operationally this crashes production processes.

 Today in our lab we had a disk controller issue, which brought an OSD node
 down.  Upon restart, the OSDs started up and rejoined into the cluster.
 However, immediately all IOs started hanging for a long time and aborts
 from
 ESXi - LIO were not succeeding in canceling these IOs.  The only warning
 I
 could see was:

 root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
 requests are blocked  32 sec;
 1 osds have slow requests 30 ops are blocked  2097.15 sec
 30 ops are blocked  2097.15 sec on osd.4
 1 osds have slow requests

 However, ceph osd perf is not showing high latency on osd 4:

 root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
 fs_apply_latency(ms)
  0 0   13
  1 00
  2 00
  3   172  208
  4 00
  5 00
  6 01
  7 00
  8   174  819
  9 6   10
 10 01
 11 01
 12 35
 13 01
 14 7   23
 15 01
 16 00
 17 59
 18 01
 1910   18
 20 00
 21 00
 22 01
 23 5   10

 SMART state for osd 4 disk is OK.  The OSD in up and in:

 root@lab2-mon1:/var/log/ceph# ceph osd tree
 ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -80 root ssd
 -7 14.71997 root platter
 -3  7.12000 host croc3
 22  0.89000 osd.22  up  1.0  1.0
 15  0.89000 osd.15  up  1.0  1.0
 16  0.89000 osd.16  up  1.0  1.0
 13  0.89000 osd.13  up  1.0  1.0
 18  0.89000 osd.18  up  1.0  1.0
 8  0.89000 osd.8   up  1.0  1.0
 11  0.89000 osd.11  up  1.0  1.0
 20  0.89000 osd.20  up  1.0  1.0
 -4  0.47998 host croc2
 10  0.06000 

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Jan Schermer
I never actually set up iSCSI with VMware, I just had to research various 
VMware storage options when we had a SAN-probelm at a former job... But I can 
take a look at it again if you want me to.

Is it realy deadlocked when this issue occurs? 
What I think is partly responsible for this situation is that the iSCSI LUN 
queues fill up and that's what actually kills your IO - VMware lowers queue 
depth to 1 in that situation and it can take a really long time to recover 
(especially if one of the LUNs  on the target constantly has problems, or when 
heavy IO hammers the adapter) - you should never fill this queue, ever.
iSCSI will likely be innocent victim in the chain, not the cause of the issues.

Ceph should gracefully handle all those situations, you just need to set the 
timeouts right. I have it set so that whatever happens the OSD can only delay 
work for 40s and then it is marked down - at that moment all IO start flowing 
again. 

You should take this to VMware support, they should be able to tell whether the 
problem is in iSCSI target (then you can take a look at how that behaves) or in 
the initiator settings. Though in my experience after two visits from their 
foremost experts I had to google everything myself because they were clueless 
- YMMV.

The root cause is however slow ops in Ceph, and I have no idea why you'd have 
them if the OSDs come back up - maybe one of them is really deadlocked or 
backlogged in some way? I found that when OSDs are dead but up they don't 
respond to ceph tell osd.xxx ... so try if they all respond in a timely 
manner, that should help pinpoint the bugger.

Jan


 On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com wrote:
 
 This can be tuned in the iSCSI initiation on VMware - look in advanced 
 settings on your ESX hosts (at least if you use the software initiator).
 
 Thanks, Jan. I asked this question of Vmware as well, I think the
 problem is specific to a given iSCSI session, so wondering if that's
 strictly the job of the target?  Do you know of any specific SCSI
 settings that mitigate this kind of issue?  Basically, give up on a
 session and terminate it and start a new one should an RBD not
 respond?
 
 As I understand, RBD simply never gives up.  If an OSD does not
 respond but is still technically up and in, Ceph will retry IOs
 forever.  I think RBD and Ceph need a timeout mechanism for this.
 
 Best regards,
 Alex
 
 Jan
 
 
 On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:
 
 Hi Alex,
 
 Currently RBD+LIO+ESX is broken.
 
 The problem is caused by the RBD device not handling device aborts properly
 causing LIO and ESXi to enter a death spiral together.
 
 If something in the Ceph cluster causes an IO to take longer than 10
 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
 as you have seen it never recovers.
 
 Mike Christie from Redhat is doing a lot of work on this currently, so
 hopefully in the future there will be a direct RBD interface into LIO and it
 will all work much better.
 
 Either tgt or SCST seem to be pretty stable in testing.
 
 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs
 
 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
 LIO -
 ESXi configuration, but the problem is more general.
 
 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
 slowly or with significant delays.  ceph osd perf does not show this,
 neither
 does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
 point
 where the client times out, crashes or displays other unsavory behavior -
 operationally this crashes production processes.
 
 Today in our lab we had a disk controller issue, which brought an OSD node
 down.  Upon restart, the OSDs started up and rejoined into the cluster.
 However, immediately all IOs started hanging for a long time and aborts
 from
 ESXi - LIO were not succeeding in canceling these IOs.  The only warning
 I
 could see was:
 
 root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
 requests are blocked  32 sec;
 1 osds have slow requests 30 ops are blocked  2097.15 sec
 30 ops are blocked  2097.15 sec on osd.4
 1 osds have slow requests
 
 However, ceph osd perf is not showing high latency on osd 4:
 
 root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
 fs_apply_latency(ms)
 0 0   13
 1 00
 2 00
 3   172   

[ceph-users] rbd du

2015-08-24 Thread Allen Liao
Hi all,

The online manual (http://ceph.com/docs/master/man/8/rbd/) for rbd has
documentation for the 'du' command.  I'm running ceph 0.94.2 and that
command isn't recognized, nor is it in the man page.

Is there another command that will calculate the provisioned and actual
disk usage of all images and associated snapshots within the specified
pool?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Alex Gorbachev
HI Jan,

On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer j...@schermer.cz wrote:
 I never actually set up iSCSI with VMware, I just had to research various 
 VMware storage options when we had a SAN-probelm at a former job... But I can 
 take a look at it again if you want me to.

Thank you, I don't want to waste your time as I have asked Vmware TAP
to research that - I will communicate back anything with which they
respond.


 Is it realy deadlocked when this issue occurs?
 What I think is partly responsible for this situation is that the iSCSI LUN 
 queues fill up and that's what actually kills your IO - VMware lowers queue 
 depth to 1 in that situation and it can take a really long time to recover 
 (especially if one of the LUNs  on the target constantly has problems, or 
 when heavy IO hammers the adapter) - you should never fill this queue, ever.
 iSCSI will likely be innocent victim in the chain, not the cause of the 
 issues.

Completely agreed, so iSCSI's job then is to properly communicate to
the initiator that it cannot do what it is asked to do and quit the
IO.


 Ceph should gracefully handle all those situations, you just need to set the 
 timeouts right. I have it set so that whatever happens the OSD can only delay 
 work for 40s and then it is marked down - at that moment all IO start flowing 
 again.

What setting in ceph do you use to do that?  is that
mon_osd_down_out_interval?  I think stopping slow OSDs is the answer
to the root of the problem - so far I only know to do ceph osd perf
and look at latencies.


 You should take this to VMware support, they should be able to tell whether 
 the problem is in iSCSI target (then you can take a look at how that behaves) 
 or in the initiator settings. Though in my experience after two visits from 
 their foremost experts I had to google everything myself because they were 
 clueless - YMMV.

I am hoping the TAP Elite team can do better...but we'll see...


 The root cause is however slow ops in Ceph, and I have no idea why you'd have 
 them if the OSDs come back up - maybe one of them is really deadlocked or 
 backlogged in some way? I found that when OSDs are dead but up they don't 
 respond to ceph tell osd.xxx ... so try if they all respond in a timely 
 manner, that should help pinpoint the bugger.

I think I know in this case - there are some PCIe AER/Bus errors and
TLP Header messages strewing across the console of one OSD machine -
ceph osd perf showing latencies aboce a second per OSD, but only when
IO is done to those OSDs.  I am thankful this is not production
storage, but worried of this situation in production - the OSDs are
staying up and in, but their latencies are slowing clusterwide IO to a
crawl.  I am trying to envision this situation in production and how
would one find out what is slowing everything down without guessing.

Regards,
Alex



 Jan


 On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com wrote:

 This can be tuned in the iSCSI initiation on VMware - look in advanced 
 settings on your ESX hosts (at least if you use the software initiator).

 Thanks, Jan. I asked this question of Vmware as well, I think the
 problem is specific to a given iSCSI session, so wondering if that's
 strictly the job of the target?  Do you know of any specific SCSI
 settings that mitigate this kind of issue?  Basically, give up on a
 session and terminate it and start a new one should an RBD not
 respond?

 As I understand, RBD simply never gives up.  If an OSD does not
 respond but is still technically up and in, Ceph will retry IOs
 forever.  I think RBD and Ceph need a timeout mechanism for this.

 Best regards,
 Alex

 Jan


 On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:

 Hi Alex,

 Currently RBD+LIO+ESX is broken.

 The problem is caused by the RBD device not handling device aborts properly
 causing LIO and ESXi to enter a death spiral together.

 If something in the Ceph cluster causes an IO to take longer than 10
 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
 as you have seen it never recovers.

 Mike Christie from Redhat is doing a lot of work on this currently, so
 hopefully in the future there will be a direct RBD interface into LIO and 
 it
 will all work much better.

 Either tgt or SCST seem to be pretty stable in testing.

 Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs

 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
 LIO -
 ESXi configuration, but the problem is more general.

 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs 

[ceph-users] EXT4 for Production and Journal Question?

2015-08-24 Thread Robert LeBlanc
Building off a discussion earlier this month [1], how supported is
EXT4 for OSDs? It seems that some people are getting good results with
it and I'll be testing it in our environment.

The other question is if the EXT4 journal is even necessary if you are
using Ceph SSD journals. My thoughts are thus: Incoming I/O is written
to the SSD journal. The journal then flushes to the EXT4 partition.
Only after the write is completed (I understand that this is a direct
sync write) does Ceph free the SSD journal entry.

Doesn't this provide the same reliability as the EXT4 journal? If an
OSD crashed in the middle of the write with no EXT4 journal, the file
system would be repaired and then Ceph would rewrite the last
transaction that didn't complete? I'm sure I'm missing something
here...

Thanks,


[1] http://www.spinics.net/lists/ceph-users/msg20839.html

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v9.0.3 released

2015-08-24 Thread Sage Weil
This is the second to last batch of development work for the Infernalis 
cycle.  The most intrusive change is an internal (non user-visible) change 
to the OSD's ObjectStore interface.  Many fixes and improvements elsewhere 
across RGW, RBD, and another big pile of CephFS scrub/repair improvements.

Upgrading
-

* The return code for librbd's rbd_aio_read and Image::aio_read API methods no
  longer returns the number of bytes read upon success.  Instead, it 
  returns 0 upon success and a negative value upon failure.

* 'ceph scrub', 'ceph compact' and 'ceph sync force are now DEPRECATED.  
  Users should instead use 'ceph mon scrub', 'ceph mon compact' and
  'ceph mon sync force'.

* 'ceph mon_metadata' should now be used as 'ceph mon metadata'. There is no
  need to deprecate this command (same major release since it was first
  introduced).

* The `--dump-json` option of osdmaptool is replaced by `--dump json`.

* The commands of pg ls-by-{pool,primary,osd} and pg ls now take 
  recovering instead of recovery, to include the recovering pgs in the 
  listed pgs.


Notable Changes
---

  * autotools: fix out of tree build (Krxysztof Kosinski)
  * autotools: improve make check output (Loic Dachary)
  * buffer: add invalidate_crc() (Piotr Dalek)
  * buffer: fix zero bug (#12252 Haomai Wang)
  * build: fix junit detection on Fedora 22 (Ira Cooper)
  * ceph-disk: install pip  6.1 (#11952 Loic Dachary)
  * cephfs-data-scan: many additions, improvements (John Spray)
  * ceph: improve error output for 'tell' (#11101 Kefu Chai)
  * ceph-objectstore-tool: misc improvements (David Zafman)
  * ceph-objectstore-tool: refactoring and cleanup (John Spray)
  * ceph_test_rados: test pipelined reads (Zhiqiang Wang)
  * common: fix bit_vector extent calc (#12611 Jason Dillaman)
  * common: make work queue addition/removal thread safe (#12662 Jason 
Dillaman)
  * common: optracker improvements (Zhiqiang Wang, Jianpeng Ma)
  * crush: add --check to validate dangling names, max osd id (Kefu Chai)
  * crush: cleanup, sync with kernel (Ilya Dryomov)
  * crush: fix subtree base weight on adjust_subtree_weight (#11855 Sage 
Weil)
  * crypo: fix NSS leak (Jason Dillaman)
  * crypto: fix unbalanced init/shutdown (#12598 Zheng Yan)
  * doc: misc updates (Kefu Chai, Owen Synge, Gael Fenet-Garde, Loic 
Dachary, Yannick Atchy-Dalama, Jiaying Ren, Kevin Caradant, Robert 
Maxime, Nicolas Yong, Germain Chipaux, Arthur Gorjux, Gabriel Sentucq, 
Clement Lebrun, Jean-Remi Deveaux, Clair Massot, Robin Tang, Thomas 
Laumondais, Jordan Dorne, Yuan Zhou, Valentin Thomas, Pierre Chaumont, 
Benjamin Troquereau, Benjamin Sesia, Vikhyat Umrao)
  * erasure-code: cleanup (Kefu Chai)
  * erasure-code: improve tests (Loic Dachary)
  * erasure-code: shec: fix recovery bugs (Takanori Nakao, Shotaro 
Kawaguchi)
  * libcephfs: add pread, pwrite (Jevon Qiao)
  * libcephfs,ceph-fuse: cache cleanup (Zheng Yan)
  * librados: add src_fadvise_flags for copy-from (Jianpeng Ma)
  * librados: respect default_crush_ruleset on pool_create (#11640 Yuan 
Zhou)
  * librbd: fadvise for copy, export, import (Jianpeng Ma)
  * librbd: handle NOCACHE fadvise flag (Jinapeng Ma)
  * librbd: optionally disable allocation hint (Haomai Wang)
  * librbd: prevent race between resize requests (#12664 Jason Dillaman)
  * log: fix data corruption race resulting from log rotation (#12465 
Samuel Just)
  * mds: expose frags via asok (John Spray)
  * mds: fix setting entire file layout in one setxattr (John Spray)
  * mds: fix shutdown (John Spray)
  * mds: handle misc corruption issues (John Spray)
  * mds: misc fixes (Jianpeng Ma, Dan van der Ster, Zhang Zhi)
  * mds: misc snap fixes (Zheng Yan)
  * mds: store layout on header object (#4161 John Spray)
  * misc performance and cleanup (Nathan Cutler, Xinxin Shu)
  * mon: add NOFORWARD, OBSOLETE, DEPRECATE flags for mon commands (Joao 
Eduardo Luis)
  * mon: add PG count to 'ceph osd df' output (Michal Jarzabek)
  * mon: clean up, reorg some mon commands (Joao Eduardo Luis)
  * mon: disallow 2 tiers (#11840 Kefu Chai)
  * mon: fix log dump crash when debugging (Mykola Golub)
  * mon: fix metadata update race (Mykola Golub)
  * mon: fix refresh (#11470 Joao Eduardo Luis)
  * mon: make blocked op messages more readable (Jianpeng Ma)
  * mon: only send mon metadata to supporting peers (Sage Weil)
  * mon: periodic background scrub (Joao Eduardo Luis)
  * mon: prevent pgp_num  pg_num (#12025 Xinxin Shu)
  * mon: reject large max_mds values (#1 John Spray)
  * msgr: add ceph_perf_msgr tool (Hoamai Wang)
  * msgr: async: fix seq handling (Haomai Wang)
  * msgr: xio: fastpath improvements (Raju Kurunkad)
  * msgr: xio: sync with accellio v1.4 (Vu Pham)
  * osd: clean up temp object if promotion fails (Jianpeng Ma)
  * osd: constrain collections to meta and PGs (normal and temp) (Sage 
Weil)
  * osd: filestore: clone using splice (Jianpeng Ma)
  * osd: filestore: fix 

Re: [ceph-users] OSD GHz vs. Cores Question

2015-08-24 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Thanks to all the responses. There has been more to think about which
is what I was looking for.

We have MySQL running on this cluster so we will have some VMs with
fairly low queue depths. Our Ops teams are not excited about
unplugging cables and pulling servers to replace fixed disks, so we
are looking at hot swap options.

I'll try and do some testing in our lab, but I won't be able to get a
very good spread of data due to clock and core limitations in the
existing hardware.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sat, Aug 22, 2015 at 2:42 PM, Luis Periquito  wrote:
 I've been meaning to write an email with the experience we had at the
 company I work. For the lack of a more complete one I'll just tell some of
 the findings. Please note these are my experiences, and are correct for my
 environment. The clients are running on openstack, and all servers are
 trusty. Tests were made with Hammer (0.94.2).

 TLDR: if performance is your objective buy 1S boxes with high frequency,
 good journal SSDs, and not many SSDs. Also change the cpu to performance
 mode, instead the default ondemand. And don't forget 10Gig is a must.
 Replicated pools are also a must for performance.

 We wanted to have a small cluster (30TB RAW), performance was important
 (IOPS and latency), network was designed to be 10G copper with BGP attached
 hosts. There was complete leeway in design and some in budget.

 Starting with the network that required us to only create a single network,
 but both links are usable - iperf between boxes is usually around
 17-19Gbits.

 We could choose the nodes, we evaluated dual cpu and single cpu nodes. The
 dual cpus would have 24 2.5'' drive bays on a 2U chassis whereas the single
 were 8 2.5'' drive bays on a 1U chassis. Long story short we chose the
 single cpu (E3 1241 v3). On the CPU all the tests we did with the scaling
 governors shown that performance would give us a 30-50% boost in IOPS.
 Latency also improved but not by much. The downside was that each system
 increased power usage by 5W (!?).

 For the difference in price (£80) we bought the boxes with 32G of ram.

 As for the disks, as we wanted fast IO we had to go with SSDs. Due to the
 budget we had we went with 4x Samsung 850 PRO + 1x Intel S3710 200G. We also
 tested the P3600, but one of the critical IO clients had far worse
 performance with it. From benchmarking the write performance is that of the
 Intel SSD. We made tests with Intel SSD with journal + different Intel SSD
 with data and performance was within margin for error the same that Intel
 SSD for journal + Samsung SSD for data. Single SSD performance was slightly
 lower with either one (around 10%).

 From what I've seen: on very big sequential read and write I can get up to
 700-800 MBps. On random IO (8k, random writes, reads or mixed workloads) we
 still haven't finished all the tests, but so far it indicates the SSDs are
 the bottleneck on the writes, and ceph latency on the reads. However we've
 been able to extract 400 MBps read IO with 4 clients, each doing 32 threads.
 I don't have the numbers here but that represents around 50k IOPS out of a
 smallish cluster.

 Stuff we still have to do revolves around jemalloc vs tcmalloc - trusty has
 the bug on the thread cache bytes variable. Also we still have to test
 various tunable options, like threads, caches, etc...

 Hope this helps.


 On Sat, Aug 22, 2015 at 4:45 PM, Nick Fisk  wrote:

 Another thing that is probably worth considering is the practical side as
 well. A lot of the Xeon E5 boards tend to have more SAS/SATA ports and
 onboard 10GB, this can make quite a difference to the overall cost of the
 solution if you need to buy extra PCI-E cards.

 Unless I've missed one, I've not spotted a Xeon-D board with a large
 amount
 of onboard sata/sas ports. Please let me know if such a system exists as I
 would be very interested.

 We settled on the Hadoop version of the Supermicro Fat Twin. 12 x 3.5
 disks
 + 2x 2.5 SSD's per U, onboard 10GB-T and the fact they share chassis and
 PSU's keeps the price down. For bulk storage one of these with a single 8
 core low clocked E5 Xeon is ideal in my mind. I did a spreadsheet working
 out U space, power and cost per GB for several different types of server,
 this solution came out ahead in nearly every category.

 If there is a requirement for a high perf SSD tier I would probably look
 at
 dedicated SSD nodes as I doubt you could cram enough CPU power into a
 single
 server to drive 12xSSD's.

 You mentioned low latency was a key requirement, is this always going to
 be
 at low queue depths? If you just need very low latency but won't actually
 be
 driving the SSD's very hard you will probably find a very highly clocked
 E3
 is the best bet with 2-4 SSD's per node. However if you drive the SSD's
 hard, a single one can easily max out several cores.

  

Re: [ceph-users] rbd du

2015-08-24 Thread Jason Dillaman
That rbd CLI command is a new feature that will be included with the upcoming 
infernalis release.  In the meantime, you can use this approach [1] to estimate 
your RBD image usage.

[1] http://ceph.com/planet/real-size-of-a-ceph-rbd-image/

-- 

Jason Dillaman 
Red Hat Ceph Storage Engineering 
dilla...@redhat.com 
http://www.redhat.com 


- Original Message - 

 From: Allen Liao aliao.svsga...@gmail.com
 To: ceph-users@lists.ceph.com
 Sent: Monday, August 24, 2015 1:03:03 PM
 Subject: [ceph-users] rbd du

 Hi all,

 The online manual ( http://ceph.com/docs/master/man/8/rbd/ ) for rbd has
 documentation for the 'du' command. I'm running ceph 0.94.2 and that command
 isn't recognized, nor is it in the man page.

 Is there another command that will calculate the provisioned and actual disk
 usage of all images and associated snapshots within the specified pool?

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EXT4 for Production and Journal Question?

2015-08-24 Thread Lionel Bouton
Le 24/08/2015 19:34, Robert LeBlanc a écrit :
 Building off a discussion earlier this month [1], how supported is
 EXT4 for OSDs? It seems that some people are getting good results with
 it and I'll be testing it in our environment.

 The other question is if the EXT4 journal is even necessary if you are
 using Ceph SSD journals. My thoughts are thus: Incoming I/O is written
 to the SSD journal. The journal then flushes to the EXT4 partition.
 Only after the write is completed (I understand that this is a direct
 sync write) does Ceph free the SSD journal entry.

 Doesn't this provide the same reliability as the EXT4 journal? If an
 OSD crashed in the middle of the write with no EXT4 journal, the file
 system would be repaired and then Ceph would rewrite the last
 transaction that didn't complete? I'm sure I'm missing something
 here...

I didn't try this configuration but what you miss is probably :
- the file system recovery time when there's no journal available.
e2fsck on large filesystems can be long and may need user interaction.
You don't want that if you just had a cluster-wide (or even partial but
involving tens of disks some of which might be needed to reach min_size)
power failure.
- the less tested behaviour: I'm not sure there's even a guarantee from
ext4 without journal than e2fsck can recover properly after a crash (ie:
with data consistent with the Ceph journal).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Testing CephFS

2015-08-24 Thread Simon Hallam
Hi Greg,

The MDS' detect that the other one went down and started the replay. 

I did some further testing with 20 client machines. Of the 20 client machines, 
5 hung with the following error:

[Aug24 10:53] ceph: mds0 caps stale
[Aug24 10:54] ceph: mds0 caps stale
[Aug24 10:58] ceph: mds0 hung
[Aug24 11:03] ceph: mds0 came back
[  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN)
[  +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon
[Aug24 11:04] ceph: mds0 reconnect start
[  +0.084938] libceph: mon2 10.15.0.3:6789 session established
[  +0.008475] ceph: mds0 reconnect denied

10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable.


This was the output of ceph -w as I ran the test (I've removed a lot of the pg 
remapping):

2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election
2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election
2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with 
quorum 0,1
2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 
ceph1,ceph2
2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at 
{ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 
active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0 
op/s
2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up 
{0=ceph3=up:active}, 2 up:standby
2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in
2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up 
{0=ceph2=up:replay}, 1 up:standby
2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:reconnect
2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up 
{0=ceph2=up:reconnect}, 1 up:standby
2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin
2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active
2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up 
{0=ceph2=up:active}, 1 up:standby
2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active
2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up 
{0=ceph2=up:active}, 1 up:standby
*cable plugged back in*
2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot
2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in
2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot
2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up 
{0=ceph2=up:active}, 2 up:standby
2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election
2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election
2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with 
quorum 0,1,2
2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up 
{0=ceph2=up:active}, 2 up:standby
2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is 
up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 (allowed 
interval 45)
2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is 
up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 (allowed 
interval 45)
2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is 
up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 (allowed 
interval 45)
2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is 
up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323 (allowed 
interval 45)
2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is 
up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988 (allowed 
interval 45)

I did just notice that none of the times match up. So may try again once I fix 
ntp/chrony and see if that makes a difference.

Cheers,

Simon

 -Original Message-
 From: Gregory Farnum [mailto:gfar...@redhat.com]
 Sent: 21 August 2015 12:16
 To: Simon Hallam
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Testing CephFS
 
 On Thu, Aug 20, 2015 at 11:07 AM, Simon  Hallam s...@pml.ac.uk wrote:
  Hey all,
 
 
 
  We are currently testing CephFS on a small (3 node) cluster.
 
 
 
  The setup is currently:
 
 
 
  Each server has 12 OSDs, 1 Monitor and 1 MDS running on it:
 
  The servers are running: 0.94.2-0.el7
 
  The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 4.0.6-200.fc21.x86_64
 
 
 
  ceph -s
 
  cluster 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd
 
   health HEALTH_OK
 
   monmap e1: 3 mons at
  {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
 
  election epoch 20, quorum 0,1,2 ceph1,ceph2,ceph3
 
   mdsmap e12: 1/1/1 up {0=ceph3=up:active}, 2 up:standby
 
   osdmap e389: 36 osds: 36 up, 36 in
 
pgmap v19370: 8256 pgs, 3 pools, 51217 MB data, 14035 objects
 
  95526 MB used, 196 TB / 196 TB 

Re: [ceph-users] Testing CephFS

2015-08-24 Thread Gregory Farnum
On Mon, Aug 24, 2015 at 11:35 AM, Simon  Hallam s...@pml.ac.uk wrote:
 Hi Greg,

 The MDS' detect that the other one went down and started the replay.

 I did some further testing with 20 client machines. Of the 20 client 
 machines, 5 hung with the following error:

 [Aug24 10:53] ceph: mds0 caps stale
 [Aug24 10:54] ceph: mds0 caps stale
 [Aug24 10:58] ceph: mds0 hung
 [Aug24 11:03] ceph: mds0 came back
 [  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN)
 [  +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon
 [Aug24 11:04] ceph: mds0 reconnect start
 [  +0.084938] libceph: mon2 10.15.0.3:6789 session established
 [  +0.008475] ceph: mds0 reconnect denied

Oh, this might be a kernel bug, failing to ask for mdsmap updates when
the connection goes away. Zheng, does that sound familiar?
-Greg


 10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable.


 This was the output of ceph -w as I ran the test (I've removed a lot of the 
 pg remapping):

 2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election
 2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election
 2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with 
 quorum 0,1
 2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 
 ceph1,ceph2
 2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at 
 {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
 2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 
 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0 
 op/s
 2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up 
 {0=ceph3=up:active}, 2 up:standby
 2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in
 2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up 
 {0=ceph2=up:replay}, 1 up:standby
 2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:reconnect
 2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up 
 {0=ceph2=up:reconnect}, 1 up:standby
 2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin
 2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active
 2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up 
 {0=ceph2=up:active}, 1 up:standby
 2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active
 2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up 
 {0=ceph2=up:active}, 1 up:standby
 *cable plugged back in*
 2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot
 2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in
 2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot
 2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up 
 {0=ceph2=up:active}, 2 up:standby
 2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election
 2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election
 2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with 
 quorum 0,1,2
 2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up 
 {0=ceph2=up:active}, 2 up:standby
 2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 
 (allowed interval 45)
 2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 
 (allowed interval 45)
 2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 
 (allowed interval 45)
 2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323 
 (allowed interval 45)
 2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988 (allowed 
 interval 45)

 I did just notice that none of the times match up. So may try again once I 
 fix ntp/chrony and see if that makes a difference.

 Cheers,

 Simon

 -Original Message-
 From: Gregory Farnum [mailto:gfar...@redhat.com]
 Sent: 21 August 2015 12:16
 To: Simon Hallam
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Testing CephFS

 On Thu, Aug 20, 2015 at 11:07 AM, Simon  Hallam s...@pml.ac.uk wrote:
  Hey all,
 
 
 
  We are currently testing CephFS on a small (3 node) cluster.
 
 
 
  The setup is currently:
 
 
 
  Each server has 12 OSDs, 1 Monitor and 1 MDS running on it:
 
  The servers are running: 0.94.2-0.el7
 
  The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 
  4.0.6-200.fc21.x86_64
 
 
 
  ceph -s
 
  cluster 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd
 
   health HEALTH_OK
 
   monmap e1: 3 mons at
  {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
 

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Jan Schermer
This can be tuned in the iSCSI initiation on VMware - look in advanced settings 
on your ESX hosts (at least if you use the software initiator).

Jan


 On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:
 
 Hi Alex,
 
 Currently RBD+LIO+ESX is broken.
 
 The problem is caused by the RBD device not handling device aborts properly
 causing LIO and ESXi to enter a death spiral together.
 
 If something in the Ceph cluster causes an IO to take longer than 10
 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
 as you have seen it never recovers.
 
 Mike Christie from Redhat is doing a lot of work on this currently, so
 hopefully in the future there will be a direct RBD interface into LIO and it
 will all work much better.
 
 Either tgt or SCST seem to be pretty stable in testing.
 
 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs
 
 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
 LIO -
 ESXi configuration, but the problem is more general.
 
 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
 slowly or with significant delays.  ceph osd perf does not show this,
 neither
 does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
 point
 where the client times out, crashes or displays other unsavory behavior -
 operationally this crashes production processes.
 
 Today in our lab we had a disk controller issue, which brought an OSD node
 down.  Upon restart, the OSDs started up and rejoined into the cluster.
 However, immediately all IOs started hanging for a long time and aborts
 from
 ESXi - LIO were not succeeding in canceling these IOs.  The only warning
 I
 could see was:
 
 root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
 requests are blocked  32 sec;
 1 osds have slow requests 30 ops are blocked  2097.15 sec
 30 ops are blocked  2097.15 sec on osd.4
 1 osds have slow requests
 
 However, ceph osd perf is not showing high latency on osd 4:
 
 root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
 fs_apply_latency(ms)
  0 0   13
  1 00
  2 00
  3   172  208
  4 00
  5 00
  6 01
  7 00
  8   174  819
  9 6   10
 10 01
 11 01
 12 35
 13 01
 14 7   23
 15 01
 16 00
 17 59
 18 01
 1910   18
 20 00
 21 00
 22 01
 23 5   10
 
 SMART state for osd 4 disk is OK.  The OSD in up and in:
 
 root@lab2-mon1:/var/log/ceph# ceph osd tree
 ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -80 root ssd
 -7 14.71997 root platter
 -3  7.12000 host croc3
 22  0.89000 osd.22  up  1.0  1.0
 15  0.89000 osd.15  up  1.0  1.0
 16  0.89000 osd.16  up  1.0  1.0
 13  0.89000 osd.13  up  1.0  1.0
 18  0.89000 osd.18  up  1.0  1.0
 8  0.89000 osd.8   up  1.0  1.0
 11  0.89000 osd.11  up  1.0  1.0
 20  0.89000 osd.20  up  1.0  1.0
 -4  0.47998 host croc2
 10  0.06000 osd.10  up  1.0  1.0
 12  0.06000 osd.12  up  1.0  1.0
 14  0.06000 osd.14  up  1.0  1.0
 17  0.06000 osd.17  up  1.0  1.0
 19  0.06000 osd.19  up  1.0  1.0
 21  0.06000 osd.21  up  1.0  1.0
 9  0.06000 osd.9   up  1.0  1.0
 23  0.06000 osd.23  up  1.0  1.0
 -2  7.12000 host croc1
 7  0.89000 osd.7   up  1.0

Re: [ceph-users] ceph osd debug question / proposal

2015-08-24 Thread Jan Schermer

I'm not talking about IO happening, I'm talking about file descriptors staying 
open. If they weren't open you could umount it without the -l.
Once you hit the OSD again all those open files will start working and if more 
need to be opened it will start looking for them...

Jan


 On 24 Aug 2015, at 03:07, Goncalo Borges gonc...@physics.usyd.edu.au wrote:
 
 Hi Jan...
 
 Thank for the reply.
 
 Yes, I did an 'umount -l' but I was sure that no I/O was happening at the 
 time. So, I was almost 100% sure that there were no real incoherence in terms 
 of open files in the OS.
 
 
 On 08/20/2015 07:31 PM, Jan Schermer wrote:
 Just to clarify - you unmounted the filesystem with umount -l? That almost 
 never a good idea, and it puts the OSD in a very unusual situation where IO 
 will actually work on the open files, but it can't open any new ones. I 
 think this would be enough to confuse just about any piece of software.
 
 Yes, I did an 'umount -l' but I was sure that no I/O was happening at the 
 time. So, I was almost 100% sure that there were no real incoherence in terms 
 of open files in the OS.
 

 Was journal on the filesystem or on a separate partition/device?
 
 The journal in on the same disk, but in a different partition.
 
 
 It's not the same as R/O filesystem (I hit that once and no such havoc 
 happened), in my experience the OSD traps and exits when something like that 
 happens.
 
 It would be interesting to know what would happen if you just did rm -rf 
 /var/lib/ceph/osd/ceph-4/current/* - that could be an equivalent to umount 
 -l, more or less :-)
 
 
 Will try that today and report back here.
 
 Cheers
 Goncalo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Testing CephFS

2015-08-24 Thread Yan, Zheng

 On Aug 24, 2015, at 18:38, Gregory Farnum gfar...@redhat.com wrote:
 
 On Mon, Aug 24, 2015 at 11:35 AM, Simon  Hallam s...@pml.ac.uk wrote:
 Hi Greg,
 
 The MDS' detect that the other one went down and started the replay.
 
 I did some further testing with 20 client machines. Of the 20 client 
 machines, 5 hung with the following error:
 
 [Aug24 10:53] ceph: mds0 caps stale
 [Aug24 10:54] ceph: mds0 caps stale
 [Aug24 10:58] ceph: mds0 hung
 [Aug24 11:03] ceph: mds0 came back
 [  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN)
 [  +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon
 [Aug24 11:04] ceph: mds0 reconnect start
 [  +0.084938] libceph: mon2 10.15.0.3:6789 session established
 [  +0.008475] ceph: mds0 reconnect denied
 
 Oh, this might be a kernel bug, failing to ask for mdsmap updates when
 the connection goes away. Zheng, does that sound familiar?
 -Greg

This seems like reconnect timeout. you can try enlarging mds_reconnect_timeout 
config option.

Which version of kernel are you using?

Yan, Zheng

 
 
 10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable.
 
 
 This was the output of ceph -w as I ran the test (I've removed a lot of the 
 pg remapping):
 
 2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election
 2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election
 2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with 
 quorum 0,1
 2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 
 ceph1,ceph2
 2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at 
 {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
 2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 
 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 
 0 op/s
 2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up 
 {0=ceph3=up:active}, 2 up:standby
 2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in
 2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up 
 {0=ceph2=up:replay}, 1 up:standby
 2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 
 up:reconnect
 2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up 
 {0=ceph2=up:reconnect}, 1 up:standby
 2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin
 2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active
 2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up 
 {0=ceph2=up:active}, 1 up:standby
 2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active
 2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up 
 {0=ceph2=up:active}, 1 up:standby
 *cable plugged back in*
 2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot
 2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in
 2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot
 2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up 
 {0=ceph2=up:active}, 2 up:standby
 2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election
 2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election
 2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with 
 quorum 0,1,2
 2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up 
 {0=ceph2=up:active}, 2 up:standby
 2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 
 (allowed interval 45)
 2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 
 (allowed interval 45)
 2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 
 (allowed interval 45)
 2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323 
 (allowed interval 45)
 2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is 
 up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988 
 (allowed interval 45)
 
 I did just notice that none of the times match up. So may try again once I 
 fix ntp/chrony and see if that makes a difference.
 
 Cheers,
 
 Simon
 
 -Original Message-
 From: Gregory Farnum [mailto:gfar...@redhat.com]
 Sent: 21 August 2015 12:16
 To: Simon Hallam
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Testing CephFS
 
 On Thu, Aug 20, 2015 at 11:07 AM, Simon  Hallam s...@pml.ac.uk wrote:
 Hey all,
 
 
 
 We are currently testing CephFS on a small (3 node) cluster.
 
 
 
 The setup is currently:
 
 
 
 Each server has 12 OSDs, 1 Monitor and 1 MDS running on it:
 
 The servers are running: 0.94.2-0.el7
 
 The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 
 

Re: [ceph-users] Testing CephFS

2015-08-24 Thread Simon Hallam
The clients are:
[root@gridnode50 ~]# uname -a
Linux gridnode50 4.0.8-200.fc21.x86_64 #1 SMP Fri Jul 10 21:09:54 UTC 2015 
x86_64 x86_64 x86_64 GNU/Linux
[root@gridnode50 ~]# ceph -v
ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)

I don't think it is a reconnect timeout, as they don't even attempt to 
reconnect until I plug the Ethernet cable back into the original MDS?

Cheers,

Simon

 -Original Message-
 From: Yan, Zheng [mailto:z...@redhat.com]
 Sent: 24 August 2015 12:28
 To: Simon Hallam
 Cc: ceph-users@lists.ceph.com; Gregory Farnum
 Subject: Re: [ceph-users] Testing CephFS
 
 
  On Aug 24, 2015, at 18:38, Gregory Farnum gfar...@redhat.com wrote:
 
  On Mon, Aug 24, 2015 at 11:35 AM, Simon  Hallam s...@pml.ac.uk wrote:
  Hi Greg,
 
  The MDS' detect that the other one went down and started the replay.
 
  I did some further testing with 20 client machines. Of the 20 client
 machines, 5 hung with the following error:
 
  [Aug24 10:53] ceph: mds0 caps stale
  [Aug24 10:54] ceph: mds0 caps stale
  [Aug24 10:58] ceph: mds0 hung
  [Aug24 11:03] ceph: mds0 came back
  [  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN)
  [  +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new
 mon
  [Aug24 11:04] ceph: mds0 reconnect start
  [  +0.084938] libceph: mon2 10.15.0.3:6789 session established
  [  +0.008475] ceph: mds0 reconnect denied
 
  Oh, this might be a kernel bug, failing to ask for mdsmap updates when
  the connection goes away. Zheng, does that sound familiar?
  -Greg
 
 This seems like reconnect timeout. you can try enlarging
 mds_reconnect_timeout config option.
 
 Which version of kernel are you using?
 
 Yan, Zheng
 
 
 
  10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable.
 
 
  This was the output of ceph -w as I ran the test (I've removed a lot of the
 pg remapping):
 
  2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor
 election
  2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor
 election
  2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader
 election with quorum 0,1
  2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down,
 quorum 0,1 ceph1,ceph2
  2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at
 {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
  2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256
 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0
 op/s
  2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up
 {0=ceph3=up:active}, 2 up:standby
  2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36
 in
  2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up
 {0=ceph2=up:replay}, 1 up:standby
  2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644
 up:reconnect
  2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up
 {0=ceph2=up:reconnect}, 1 up:standby
  2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644
 up:rejoin
  2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644
 up:active
  2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up
 {0=ceph2=up:active}, 1 up:standby
  2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644
 up:active
  2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up
 {0=ceph2=up:active}, 1 up:standby
  *cable plugged back in*
  2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot
  2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25
 in
  2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993
 up:boot
  2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up
 {0=ceph2=up:active}, 2 up:standby
  2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor
 election
  2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor
 election
  2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader
 election with quorum 0,1,2
  2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up
 {0=ceph2=up:active}, 2 up:standby
  2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is
 up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507
 (allowed interval 45)
  2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is
 up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952
 (allowed interval 45)
  2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is
 up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302
 (allowed interval 45)
  2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is
 up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323
 (allowed interval 45)
  2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is
 up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988
 (allowed interval 45)
 
  I did just notice that none of the 

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Nick Fisk




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 24 August 2015 18:06
 To: Jan Schermer j...@schermer.cz
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] Slow responding OSDs are not OUTed and cause
 RBD client IO hangs
 
 HI Jan,
 
 On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer j...@schermer.cz wrote:
  I never actually set up iSCSI with VMware, I just had to research
various
 VMware storage options when we had a SAN-probelm at a former job... But I
 can take a look at it again if you want me to.
 
 Thank you, I don't want to waste your time as I have asked Vmware TAP to
 research that - I will communicate back anything with which they respond.
 
 
  Is it realy deadlocked when this issue occurs?
  What I think is partly responsible for this situation is that the iSCSI
LUN
 queues fill up and that's what actually kills your IO - VMware lowers
queue
 depth to 1 in that situation and it can take a really long time to recover
 (especially if one of the LUNs  on the target constantly has problems, or
 when heavy IO hammers the adapter) - you should never fill this queue,
 ever.
  iSCSI will likely be innocent victim in the chain, not the cause of the
issues.
 
 Completely agreed, so iSCSI's job then is to properly communicate to the
 initiator that it cannot do what it is asked to do and quit the IO.

It's not a queue full or queue throttling issue. ESXi detects a slow IO
which I believe is when an IO takes longer than 10 seconds, it then tries to
send an abort message to the target so it can then retry. However the RBD
client doesn't handle the abort message passed to it from LIO. I'm not sure
what quite happens next but between LIO and ESXi neither makes the decision
to ignore the abort and so both enter a standoff with each other.

 
 
  Ceph should gracefully handle all those situations, you just need to set
the
 timeouts right. I have it set so that whatever happens the OSD can only
delay
 work for 40s and then it is marked down - at that moment all IO start
flowing
 again.
 
 What setting in ceph do you use to do that?  is that
 mon_osd_down_out_interval?  I think stopping slow OSDs is the answer to
 the root of the problem - so far I only know to do ceph osd perf
 and look at latencies.
 

You can maybe adjust some of the timeouts to make Ceph pause for less time
to hopefully make sure all IO is processed in under 10s, but you increase
the risk of OSD's randomly dropping out and there are probably still quite a
few cases where IO could still take longer than 10s.

 
  You should take this to VMware support, they should be able to tell
 whether the problem is in iSCSI target (then you can take a look at how
that
 behaves) or in the initiator settings. Though in my experience after two
visits
 from their foremost experts I had to google everything myself because
 they were clueless - YMMV.
 
 I am hoping the TAP Elite team can do better...but we'll see...
 
 
  The root cause is however slow ops in Ceph, and I have no idea why you'd
 have them if the OSDs come back up - maybe one of them is really
 deadlocked or backlogged in some way? I found that when OSDs are dead
 but up they don't respond to ceph tell osd.xxx ... so try if they all
respond
 in a timely manner, that should help pinpoint the bugger.
 
 I think I know in this case - there are some PCIe AER/Bus errors and TLP
 Header messages strewing across the console of one OSD machine - ceph
 osd perf showing latencies aboce a second per OSD, but only when IO is
 done to those OSDs.  I am thankful this is not production storage, but
worried
 of this situation in production - the OSDs are staying up and in, but
their
 latencies are slowing clusterwide IO to a crawl.  I am trying to envision
this
 situation in production and how would one find out what is slowing
 everything down without guessing.
 
 Regards,
 Alex
 
 
 
  Jan
 
 
  On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com
 wrote:
 
  This can be tuned in the iSCSI initiation on VMware - look in advanced
 settings on your ESX hosts (at least if you use the software initiator).
 
  Thanks, Jan. I asked this question of Vmware as well, I think the
  problem is specific to a given iSCSI session, so wondering if that's
  strictly the job of the target?  Do you know of any specific SCSI
  settings that mitigate this kind of issue?  Basically, give up on a
  session and terminate it and start a new one should an RBD not
  respond?
 
  As I understand, RBD simply never gives up.  If an OSD does not
  respond but is still technically up and in, Ceph will retry IOs
  forever.  I think RBD and Ceph need a timeout mechanism for this.
 
  Best regards,
  Alex
 
  Jan
 
 
  On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:
 
  Hi Alex,
 
  Currently RBD+LIO+ESX is broken.
 
  The problem is caused by the RBD device not handling device aborts
  properly causing LIO 

Re: [ceph-users] TRIM / DISCARD run at low priority by the OSDs?

2015-08-24 Thread Chad William Seys
Hi Alexandre,

Thanks for the note.
I was not clear enough.  The fstrim I was running was only on the krbd 
mountpoints.  The backend OSDs only have standard hard disks, not SSDs, so 
they don't need to be trimmed.

Instead I was reclaiming free space as reported by Ceph.  Running fstrim on 
the rbd mountpoints this caused the OSDs to become very busy, affecting all 
rbds, not just those being trimmed.

I was hoping someone had an idea of how to make the OSDs not become busy while 
running fstrim on the rbd mountpoints.  E.g. if Ceph made a distinction 
between trim operations on RBDs and other types, it could give those 
operations lower priority.

Thanks again!
Chad.


On Monday, August 24, 2015 18:26:30 you wrote:
 Hi,
 
 I'm not sure for krbd, but with librbd, using trim/discard on the client,
 
 don't do trim/discard on the osd physical disk.
 
 It's simply write zeroes in the rbd image.
 
 zeores write can be skipped since this commit (librbd related)
 https://github.com/xiaoxichen/ceph/commit/e7812b8416012141cf8faef577e7b27e1b
 29d5e3 +OPTION(rbd_skip_partial_discard, OPT_BOOL, false)
 
 
 Then you can still manage fstrim manually on the osd servers
 
 - Mail original -
 De: Chad William Seys cws...@physics.wisc.edu
 À: ceph-users ceph-us...@ceph.com
 Envoyé: Samedi 22 Août 2015 04:26:38
 Objet: [ceph-users] TRIM / DISCARD run at low priority by the OSDs?
 
 Hi All,
 
 Is it possible to give TRIM / DISCARD initiated by krbd low priority on the
 OSDs?
 
 I know it is possible to run fstrim at Idle priority on the rbd mount point,
 e.g. ionice -c Idle fstrim -v $MOUNT .
 
 But this Idle priority (it appears) only is within the context of the node
 executing fstrim . If the node executing fstrim is Idle then the OSDs are
 very busy and performance suffers.
 
 Is it possible to tell the OSD daemons (or whatever) to perform the TRIMs at
 low priority also?
 
 Thanks!
 Chad.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Opensource plugin for pulling out cluster recovery and client IO metric

2015-08-24 Thread Vickey Singh
Hello Ceph Geeks

I am planning to develop a python plugin that pulls out cluster *recovery
IO* and *client IO* operation metrics , that can be further used with
collectd.

*For example , i need to take out these values*

*recovery io 814 MB/s, 101 objects/s*
*client io 85475 kB/s rd, 1430 kB/s wr, 32 op/s*


Could you please help me in understanding how *ceph -s*  and *ceph -w*
 outputs *prints cluster recovery IO and client IO information*.
Where this information is coming from. *Is it coming from perf dump* ? If
yes then which section of perf dump output is should focus on. If not then
how can i get this values.

I tried *ceph --admin-daemon /var/run/ceph/ceph-osd.48.asok perf dump* ,
but it generates hell lot of information and i am confused which section of
output should i use.


Please help

Thanks in advance
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd debug question / proposal

2015-08-24 Thread Goncalo Borges

Hi Jan...

We were interested in the situation where an rm -Rf is done in the 
current directory of the OSD. Here are my findings:


   1. In this exercise, we simply deleted all the content of
   /var/lib/ceph/osd/ceph-23/current.

   # cd /var/lib/ceph/osd/ceph-23/current
   # rm -Rf *
   # df
   (...)
   /dev/sdj1  2918054776434548 2917620228   1%
   /var/lib/ceph/osd/ceph-23



   2. After some time, ceph enters in error state because it thinks it
   has an inconsistent PG and several scrub errors

   # ceph -s
cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc
 health HEALTH_ERR
1 pgs inconsistent
1850 scrub errors
 monmap e1: 3 mons at
   {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}
election epoch 24, quorum 0,1,2 mon1,mon3,mon2
 mdsmap e162: 1/1/1 up {0=mds=up:active}, 1 up:standby-replay
 osdmap e1903: 32 osds: 32 up, 32 in
  pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data, 1843
   kobjects
14424 GB used, 74627 GB / 89051 GB avail
2175 active+clean
   1 active+clean+inconsistent
  client io 989 B/s rd, 1 op/s


   3. Looking to ceph.log in the mon, it is possible to check which is
   the PG affected and which OSD is responsible for the error:

   # tail -f /var/log/ceph/ceph.log
   (...)
   2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384 :
   cluster [ERR] be_compare_scrubmaps: *5.336 shard 23* missing
   e300336/10001b0.2825/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   32600336/1000109.0754/head//5be_compare_scrubmaps:
   *5.336 shard 23* missing
   dd700336/10001ab.0b91/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   bc220336/10001bd.387c/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   f9320336/1000201.2e96/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   1a920336/1000228.d501/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   24a20336/10001bc.3e06/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   cd20336/1000227.4775/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   cef20336/10001b9.2260/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   ba240336/10001d8.0630/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   3e740336/10001b1.2089/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   e840336/10001ba.2618/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   17b40336/1e9.0287/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   b7950336/1e4.0800/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   94560336/10001b4.2834/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   71370336/151.0179/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   62370336/10001b5.3b5b/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   e9670336/1000120.03f8/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   1b480336/100019a.0d4b/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   11880336/10001e8.03e9/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   56c80336/183.0255/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   97790336/10001e7.0668/head//5be_compare_scrubmaps: 5.336
   shard 23 missing
   e4ca0336/10001b6.278c/head//5be_compare_scrubmaps: 5.336
   shard 23 missing 4eda0336/100019e.36ad/head//5
   (...)
   2015-08-24 11:31:14.336760 osd.13 X.X.X.X:6804/20104 2476 :
   cluster [ERR] 5.336 scrub 1850 missing, 0 inconsistent objects
   2015-08-24 11:31:14.336764 osd.13 X.X.X.X:6804/20104 2477 :
   cluster [ERR] 5.336 scrub 1850 errors

   4. We have tried to restart the problematic osd, but that fails.

   # /etc/init.d/ceph stop osd.23
   === osd.23 ===
   Stopping Ceph osd.23 on osd3...done
   [root@osd3 ~]# /etc/init.d/ceph start osd.23
   === osd.23 ===
   create-or-move updated item name 'osd.23' weight 2.72 at
   location {host=osd3,root=default} to crush map
   Starting Ceph osd.23 on osd3...
   starting osd.23 at :/0 osd_data /var/lib/ceph/osd/ceph-23
   /var/lib/ceph/osd/ceph-23/journal

   # tail -f /var/log/ceph/ceph-osd.23.log
   2015-08-24 11:48:12.189322 7fa24d85d800  0 ceph version 0.94.2
   (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd,
   pid 7266
   2015-08-24 11:48:12.389747 7fa24d85d800  0
   filestore(/var/lib/ceph/osd/ceph-23) backend xfs (magic 0x58465342)
   2015-08-24 

Re: [ceph-users] ceph osd debug question / proposal

2015-08-24 Thread Shinobu
Hope nobody never does that.
Anyway that's good to know in case of disaster recovery.

Thank you!

 Shinobu

On Tue, Aug 25, 2015 at 12:10 PM, Goncalo Borges 
gonc...@physics.usyd.edu.au wrote:

 Hi Shinobu

 Human mistake, for example :-) Not very frequent, but it happens.

 Nevertheless, the idea is to test ceph against different DC scenarios,
 triggered by different problems.

 On this particular situation, the cluster recovered ok ONCE the
 problematic OSD daemon was tagged as 'down' and 'out'

 Cheers
 Goncalo


 On 08/25/2015 01:06 PM, Shinobu wrote:

 So what is the situation where you need to do:

 # cd /var/lib/ceph/osd/ceph-23/current
 # rm -Rf *
 # df
 (...)

 I'm quite sure that is not normal.

  Shinobu

 On Tue, Aug 25, 2015 at 9:41 AM, Goncalo Borges 
 gonc...@physics.usyd.edu.augonc...@physics.usyd.edu.au wrote:

 Hi Jan...

 We were interested in the situation where an rm -Rf is done in the
 current directory of the OSD. Here are my findings:

 1. In this exercise, we simply deleted all the content of
 /var/lib/ceph/osd/ceph-23/current.

 # cd /var/lib/ceph/osd/ceph-23/current
 # rm -Rf *
 # df
 (...)
 /dev/sdj1  2918054776434548 2917620228   1%
 /var/lib/ceph/osd/ceph-23



 2. After some time, ceph enters in error state because it thinks it has
 an inconsistent PG and several scrub errors

 # ceph -s
 cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc
  health HEALTH_ERR
 1 pgs inconsistent
 1850 scrub errors
  monmap e1: 3 mons at
 {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}
 election epoch 24, quorum 0,1,2 mon1,mon3,mon2
  mdsmap e162: 1/1/1 up {0=mds=up:active}, 1 up:standby-replay
  osdmap e1903: 32 osds: 32 up, 32 in
   pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data, 1843 kobjects
 14424 GB used, 74627 GB / 89051 GB avail
 2175 active+clean
1 active+clean+inconsistent
   client io 989 B/s rd, 1 op/s


 3. Looking to ceph.log in the mon, it is possible to check which is the
 PG affected and which OSD is responsible for the error:

 # tail -f /var/log/ceph/ceph.log
 (...)
 2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384 : cluster [ERR]
 be_compare_scrubmaps: *5.336 shard 23* missing
 e300336/10001b0.2825/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 32600336/1000109.0754/head//5be_compare_scrubmaps: *5.336
 shard 23* missing
 dd700336/10001ab.0b91/head//5be_compare_scrubmaps: 5.336 shard 23
 missing bc220336/10001bd.387c/head//5be_compare_scrubmaps: 5.336
 shard 23 missing f9320336/1000201.2e96/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 1a920336/1000228.d501/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 24a20336/10001bc.3e06/head//5be_compare_scrubmaps: 5.336
 shard 23 missing cd20336/1000227.4775/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 cef20336/10001b9.2260/head//5be_compare_scrubmaps: 5.336 shard 23
 missing ba240336/10001d8.0630/head//5be_compare_scrubmaps: 5.336
 shard 23 missing 3e740336/10001b1.2089/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 e840336/10001ba.2618/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 17b40336/1e9.0287/head//5be_compare_scrubmaps: 5.336
 shard 23 missing b7950336/1e4.0800/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 94560336/10001b4.2834/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 71370336/151.0179/head//5be_compare_scrubmaps: 5.336
 shard 23 missing 62370336/10001b5.3b5b/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 e9670336/1000120.03f8/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 1b480336/100019a.0d4b/head//5be_compare_scrubmaps: 5.336
 shard 23 missing 11880336/10001e8.03e9/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 56c80336/183.0255/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 97790336/10001e7.0668/head//5be_compare_scrubmaps: 5.336
 shard 23 missing e4ca0336/10001b6.278c/head//5be_compare_scrubmaps:
 5.336 shard 23 missing 4eda0336/100019e.36ad/head//5
 (...)
 2015-08-24 11:31:14.336760 osd.13 X.X.X.X:6804/20104 2476 : cluster [ERR]
 5.336 scrub 1850 missing, 0 inconsistent objects
 2015-08-24 11:31:14.336764 osd.13 X.X.X.X:6804/20104 2477 : cluster [ERR]
 5.336 scrub 1850 errors

 4. We have tried to restart the problematic osd, but that fails.

 # /etc/init.d/ceph stop osd.23
 === osd.23 ===
 Stopping Ceph osd.23 on osd3...done
 [root@osd3 ~]# /etc/init.d/ceph start osd.23
 === osd.23 ===
 create-or-move updated item name 'osd.23' weight 2.72 at location
 {host=osd3,root=default} to crush map
 Starting Ceph osd.23 on osd3...
 starting osd.23 at :/0 osd_data /var/lib/ceph/osd/ceph-23
 /var/lib/ceph/osd/ceph-23/journal

 # tail -f /var/log/ceph/ceph-osd.23.log
 2015-08-24 11:48:12.189322 7fa24d85d800  

Re: [ceph-users] ceph osd debug question / proposal

2015-08-24 Thread Goncalo Borges

Hi Shinobu

Human mistake, for example :-) Not very frequent, but it happens.

Nevertheless, the idea is to test ceph against different DC scenarios, 
triggered by different problems.


On this particular situation, the cluster recovered ok ONCE the 
problematic OSD daemon was tagged as 'down' and 'out'


Cheers
Goncalo


On 08/25/2015 01:06 PM, Shinobu wrote:

So what is the situation where you need to do:

# cd /var/lib/ceph/osd/ceph-23/current
# rm -Rf *
# df
(...)

I'm quite sure that is not normal.

 Shinobu

On Tue, Aug 25, 2015 at 9:41 AM, Goncalo Borges 
gonc...@physics.usyd.edu.au mailto:gonc...@physics.usyd.edu.au wrote:


Hi Jan...

We were interested in the situation where an rm -Rf is done in the
current directory of the OSD. Here are my findings:

1. In this exercise, we simply deleted all the content of
/var/lib/ceph/osd/ceph-23/current.

# cd /var/lib/ceph/osd/ceph-23/current
# rm -Rf *
# df
(...)
/dev/sdj1  2918054776434548 2917620228   1%
/var/lib/ceph/osd/ceph-23



2. After some time, ceph enters in error state because it
thinks it has an inconsistent PG and several scrub errors

# ceph -s
cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc
 health HEALTH_ERR
1 pgs inconsistent
1850 scrub errors
 monmap e1: 3 mons at
{mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}
election epoch 24, quorum 0,1,2 mon1,mon3,mon2
 mdsmap e162: 1/1/1 up {0=mds=up:active}, 1
up:standby-replay
 osdmap e1903: 32 osds: 32 up, 32 in
  pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data,
1843 kobjects
14424 GB used, 74627 GB / 89051 GB avail
2175 active+clean
   1 active+clean+inconsistent
  client io 989 B/s rd, 1 op/s


3. Looking to ceph.log in the mon, it is possible to check
which is the PG affected and which OSD is responsible for the
error:

# tail -f /var/log/ceph/ceph.log
(...)
2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384
: cluster [ERR] be_compare_scrubmaps: *5.336 shard 23*
missing
e300336/10001b0.2825/head//5be_compare_scrubmaps:
5.336 shard 23 missing
32600336/1000109.0754/head//5be_compare_scrubmaps:
*5.336 shard 23* missing
dd700336/10001ab.0b91/head//5be_compare_scrubmaps:
5.336 shard 23 missing
bc220336/10001bd.387c/head//5be_compare_scrubmaps:
5.336 shard 23 missing
f9320336/1000201.2e96/head//5be_compare_scrubmaps:
5.336 shard 23 missing
1a920336/1000228.d501/head//5be_compare_scrubmaps:
5.336 shard 23 missing
24a20336/10001bc.3e06/head//5be_compare_scrubmaps:
5.336 shard 23 missing
cd20336/1000227.4775/head//5be_compare_scrubmaps:
5.336 shard 23 missing
cef20336/10001b9.2260/head//5be_compare_scrubmaps:
5.336 shard 23 missing
ba240336/10001d8.0630/head//5be_compare_scrubmaps:
5.336 shard 23 missing
3e740336/10001b1.2089/head//5be_compare_scrubmaps:
5.336 shard 23 missing
e840336/10001ba.2618/head//5be_compare_scrubmaps:
5.336 shard 23 missing
17b40336/1e9.0287/head//5be_compare_scrubmaps:
5.336 shard 23 missing
b7950336/1e4.0800/head//5be_compare_scrubmaps:
5.336 shard 23 missing
94560336/10001b4.2834/head//5be_compare_scrubmaps:
5.336 shard 23 missing
71370336/151.0179/head//5be_compare_scrubmaps:
5.336 shard 23 missing
62370336/10001b5.3b5b/head//5be_compare_scrubmaps:
5.336 shard 23 missing
e9670336/1000120.03f8/head//5be_compare_scrubmaps:
5.336 shard 23 missing
1b480336/100019a.0d4b/head//5be_compare_scrubmaps:
5.336 shard 23 missing
11880336/10001e8.03e9/head//5be_compare_scrubmaps:
5.336 shard 23 missing
56c80336/183.0255/head//5be_compare_scrubmaps:
5.336 shard 23 missing
97790336/10001e7.0668/head//5be_compare_scrubmaps:
5.336 shard 23 missing
e4ca0336/10001b6.278c/head//5be_compare_scrubmaps:
5.336 shard 23 missing 4eda0336/100019e.36ad/head//5
(...)
2015-08-24 11:31:14.336760 osd.13 

Re: [ceph-users] ceph osd debug question / proposal

2015-08-24 Thread Shinobu
So what is the situation where you need to do:

# cd /var/lib/ceph/osd/ceph-23/current
# rm -Rf *
# df
(...)

I'm quite sure that is not normal.

 Shinobu

On Tue, Aug 25, 2015 at 9:41 AM, Goncalo Borges gonc...@physics.usyd.edu.au
 wrote:

 Hi Jan...

 We were interested in the situation where an rm -Rf is done in the current
 directory of the OSD. Here are my findings:

 1. In this exercise, we simply deleted all the content of
 /var/lib/ceph/osd/ceph-23/current.

 # cd /var/lib/ceph/osd/ceph-23/current
 # rm -Rf *
 # df
 (...)
 /dev/sdj1  2918054776434548 2917620228   1%
 /var/lib/ceph/osd/ceph-23



 2. After some time, ceph enters in error state because it thinks it has an
 inconsistent PG and several scrub errors

 # ceph -s
 cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc
  health HEALTH_ERR
 1 pgs inconsistent
 1850 scrub errors
  monmap e1: 3 mons at
 {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0}
 election epoch 24, quorum 0,1,2 mon1,mon3,mon2
  mdsmap e162: 1/1/1 up {0=mds=up:active}, 1 up:standby-replay
  osdmap e1903: 32 osds: 32 up, 32 in
   pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data, 1843 kobjects
 14424 GB used, 74627 GB / 89051 GB avail
 2175 active+clean
1 active+clean+inconsistent
   client io 989 B/s rd, 1 op/s


 3. Looking to ceph.log in the mon, it is possible to check which is the PG
 affected and which OSD is responsible for the error:

 # tail -f /var/log/ceph/ceph.log
 (...)
 2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384 : cluster [ERR]
 be_compare_scrubmaps: *5.336 shard 23* missing
 e300336/10001b0.2825/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 32600336/1000109.0754/head//5be_compare_scrubmaps: *5.336
 shard 23* missing
 dd700336/10001ab.0b91/head//5be_compare_scrubmaps: 5.336 shard 23
 missing bc220336/10001bd.387c/head//5be_compare_scrubmaps: 5.336
 shard 23 missing f9320336/1000201.2e96/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 1a920336/1000228.d501/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 24a20336/10001bc.3e06/head//5be_compare_scrubmaps: 5.336
 shard 23 missing cd20336/1000227.4775/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 cef20336/10001b9.2260/head//5be_compare_scrubmaps: 5.336 shard 23
 missing ba240336/10001d8.0630/head//5be_compare_scrubmaps: 5.336
 shard 23 missing 3e740336/10001b1.2089/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 e840336/10001ba.2618/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 17b40336/1e9.0287/head//5be_compare_scrubmaps: 5.336
 shard 23 missing b7950336/1e4.0800/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 94560336/10001b4.2834/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 71370336/151.0179/head//5be_compare_scrubmaps: 5.336
 shard 23 missing 62370336/10001b5.3b5b/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 e9670336/1000120.03f8/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 1b480336/100019a.0d4b/head//5be_compare_scrubmaps: 5.336
 shard 23 missing 11880336/10001e8.03e9/head//5be_compare_scrubmaps:
 5.336 shard 23 missing
 56c80336/183.0255/head//5be_compare_scrubmaps: 5.336 shard 23
 missing 97790336/10001e7.0668/head//5be_compare_scrubmaps: 5.336
 shard 23 missing e4ca0336/10001b6.278c/head//5be_compare_scrubmaps:
 5.336 shard 23 missing 4eda0336/100019e.36ad/head//5
 (...)
 2015-08-24 11:31:14.336760 osd.13 X.X.X.X:6804/20104 2476 : cluster [ERR]
 5.336 scrub 1850 missing, 0 inconsistent objects
 2015-08-24 11:31:14.336764 osd.13 X.X.X.X:6804/20104 2477 : cluster [ERR]
 5.336 scrub 1850 errors

 4. We have tried to restart the problematic osd, but that fails.

 # /etc/init.d/ceph stop osd.23
 === osd.23 ===
 Stopping Ceph osd.23 on osd3...done
 [root@osd3 ~]# /etc/init.d/ceph start osd.23
 === osd.23 ===
 create-or-move updated item name 'osd.23' weight 2.72 at location
 {host=osd3,root=default} to crush map
 Starting Ceph osd.23 on osd3...
 starting osd.23 at :/0 osd_data /var/lib/ceph/osd/ceph-23
 /var/lib/ceph/osd/ceph-23/journal

 # tail -f /var/log/ceph/ceph-osd.23.log
 2015-08-24 11:48:12.189322 7fa24d85d800  0 ceph version 0.94.2
 (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 7266
 2015-08-24 11:48:12.389747 7fa24d85d800  0
 filestore(/var/lib/ceph/osd/ceph-23) backend xfs (magic 0x58465342)
 2015-08-24 11:48:12.391370 7fa24d85d800  0
 genericfilestorebackend(/var/lib/ceph/osd/ceph-23) detect_features: FIEMAP
 ioctl is supported and appears to work
 2015-08-24 11:48:12.391381 7fa24d85d800  0
 genericfilestorebackend(/var/lib/ceph/osd/ceph-23) detect_features: FIEMAP
 ioctl is disabled via 'filestore fiemap' config option
 2015-08-24 11:48:12.404785 7fa24d85d800  0