Re: [ceph-users] NFS interaction with RBD

2015-06-11 Thread Christian Schnidrig
Hi George

Well that’s strange. I wonder why our systems behave so differently.

We’ve got:

Hypervisors running on Ubuntu 14.04. 
VMs with 9 ceph volumes: 2TB each.
XFS instead of your ext4

Maybe the number of placement groups plays a major role as well. Jens-Christian 
may be able to give you the specifics of our ceph cluster. 
I’m about to leave on vacation and don’t have time to look that up anymore.

Best regards
Christian


On 29 May 2015, at 14:42, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote:

 All,
 
 I 've tried to recreate the issue without success!
 
 My configuration is the following:
 
 OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64)
 QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64
 Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 20x4TB 
 OSDs equally distributed on two disk nodes, 3xMonitors
 
 
 OpenStack Cinder has been configured to provide RBD Volumes from Ceph.
 
 I have created 10x 500GB Volumes which were then all attached at a single 
 Virtual Machine.
 
 All volumes were formatted two times for comparison reasons, one using 
 mkfs.xfs and one using mkfs.ext4.
 I did try to issue the commands all at the same time (or as possible to that).
 
 In both tests I didn't notice any interruption. It may took longer than just 
 doing one at a time but the system was continuously up and everything was 
 responding without the problem.
 
 At the time of these processes the open connections were 100 with one of the 
 OSD node and 111 with the other one.
 
 So I guess I am not experiencing the issue due to the low number of OSDs I am 
 having. Is my assumption correct?
 
 
 Best regards,
 
 George
 
 
 
 Thanks a million for the feedback Christian!
 
 I 've tried to recreate the issue with 10RBD Volumes mounted on a
 single server without success!
 
 I 've issued the mkfs.xfs command simultaneously (or at least as
 fast I could do it in different terminals) without noticing any
 problems. Can you please tell me what was the size of each one of the
 RBD Volumes cause I have a feeling that mine were two small, and if so
 I have to test it on our bigger cluster.
 
 I 've also thought that besides QEMU version it might also be
 important the underlying OS, so what was your testbed?
 
 
 All the best,
 
 George
 
 Hi George
 
 In order to experience the error it was enough to simply run mkfs.xfs
 on all the volumes.
 
 
 In the meantime it became clear what the problem was:
 
 ~ ; cat /proc/183016/limits
 ...
 Max open files1024 4096 files
 ..
 
 This can be changed by setting a decent value in
 /etc/libvirt/qemu.conf for max_files.
 
 Regards
 Christian
 
 
 
 On 27 May 2015, at 16:23, Jens-Christian Fischer
 jens-christian.fisc...@switch.ch wrote:
 
 George,
 
 I will let Christian provide you the details. As far as I know, it was 
 enough to just do a ‘ls’ on all of the attached drives.
 
 we are using Qemu 2.0:
 
 $ dpkg -l | grep qemu
 ii  ipxe-qemu   
 1.0.0+git-2013.c3d1e78-2ubuntu1   all  PXE boot firmware - ROM 
 images for qemu
 ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11  all
   QEMU keyboard maps
 ii  qemu-system 2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries
 ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (arm)
 ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (common files)
 ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (mips)
 ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (miscelaneous)
 ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (ppc)
 ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (sparc)
 ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU full system emulation binaries (x86)
 ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11  amd64  
   QEMU utilities
 
 cheers
 jc
 
 --
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch
 
 http://www.switch.ch/stories
 
 On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr 
 wrote:
 
 Jens-Christian,
 
 how did you test that? Did you just tried to write to them 
 simultaneously? Any other tests that one can perform to verify that?
 
 In our installation we have a VM with 30 RBD volumes mounted which are 
 all exported via NFS to other VMs.
 No one has complaint for the moment but the load/usage is 

Re: [ceph-users] NFS interaction with RBD

2015-06-11 Thread Christian Schnidrig
Hi George

In order to experience the error it was enough to simply run mkfs.xfs on all 
the volumes.


In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files1024 4096 files
..

This can be changed by setting a decent value in /etc/libvirt/qemu.conf for 
max_files.

Regards
Christian



On 27 May 2015, at 16:23, Jens-Christian Fischer 
jens-christian.fisc...@switch.ch wrote:

 George,
 
 I will let Christian provide you the details. As far as I know, it was enough 
 to just do a ‘ls’ on all of the attached drives.
 
 we are using Qemu 2.0:
 
 $ dpkg -l | grep qemu
 ii  ipxe-qemu   1.0.0+git-2013.c3d1e78-2ubuntu1   
 all  PXE boot firmware - ROM images for qemu
 ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11
 all  QEMU keyboard maps
 ii  qemu-system 2.0.0+dfsg-2ubuntu1.11
 amd64QEMU full system emulation binaries
 ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11
 amd64QEMU full system emulation binaries (arm)
 ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11
 amd64QEMU full system emulation binaries (common files)
 ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11
 amd64QEMU full system emulation binaries (mips)
 ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11
 amd64QEMU full system emulation binaries (miscelaneous)
 ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11
 amd64QEMU full system emulation binaries (ppc)
 ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11
 amd64QEMU full system emulation binaries (sparc)
 ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11
 amd64QEMU full system emulation binaries (x86)
 ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11
 amd64QEMU utilities
 
 cheers
 jc
 
 -- 
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch
 
 http://www.switch.ch/stories
 
 On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote:
 
 Jens-Christian,
 
 how did you test that? Did you just tried to write to them simultaneously? 
 Any other tests that one can perform to verify that?
 
 In our installation we have a VM with 30 RBD volumes mounted which are all 
 exported via NFS to other VMs.
 No one has complaint for the moment but the load/usage is very minimal.
 If this problem really exists then very soon that the trial phase will be 
 over we will have millions of complaints :-(
 
 What version of QEMU are you using? We are using the one provided by Ceph in 
 qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
 
 Best regards,
 
 George
 
 I think we (i.e. Christian) found the problem:
 
 We created a test VM with 9 mounted RBD volumes (no NFS server). As
 soon as he hit all disks, we started to experience these 120 second
 timeouts. We realized that the QEMU process on the hypervisor is
 opening a TCP connection to every OSD for every mounted volume -
 exceeding the 1024 FD limit.
 
 So no deep scrubbing etc, but simply to many connections…
 
 cheers
 jc
 
 --
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch [3]
 http://www.switch.ch
 
 http://www.switch.ch/stories
 
 On 25.05.2015, at 06:02, Christian Balzer  wrote:
 
 Hello,
 
 lets compare your case with John-Paul's.
 
 Different OS and Ceph versions (thus we can assume different NFS
 versions
 as well).
 The only common thing is that both of you added OSDs and are likely
 suffering from delays stemming from Ceph re-balancing or
 deep-scrubbing.
 
 Ceph logs will only pipe up when things have been blocked for more
 than 30
 seconds, NFS might take offense to lower values (or the accumulation
 of
 several distributed delays).
 
 You added 23 OSDs, tell us more about your cluster, HW, network.
 Were these added to the existing 16 nodes, are these on new storage
 nodes
 (so could there be something different with those nodes?), how busy
 is your
 network, CPU.
 Running something like collectd to gather all ceph perf data and
 other
 data from the storage nodes and then feeding it to graphite (or
 similar)
 can be VERY helpful to identify if something is going wrong and what
 it is
 in particular.
 Otherwise run atop on your storage nodes to identify if CPU,
 network,
 specific HDDs/OSDs are bottlenecks.
 
 Deep scrubbing can be _very_ taxing, do your problems persist if
 inject
 into your running cluster an osd_scrub_sleep value of 0.5 (lower
 

Re: [ceph-users] NFS interaction with RBD

2015-05-29 Thread Georgios Dimitrakakis

All,

I 've tried to recreate the issue without success!

My configuration is the following:

OS (Hypervisor + VM): CentOS 6.6 (2.6.32-504.1.3.el6.x86_64)
QEMU: qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64
Ceph: ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047), 
20x4TB OSDs equally distributed on two disk nodes, 3xMonitors



OpenStack Cinder has been configured to provide RBD Volumes from Ceph.

I have created 10x 500GB Volumes which were then all attached at a 
single Virtual Machine.


All volumes were formatted two times for comparison reasons, one using 
mkfs.xfs and one using mkfs.ext4.
I did try to issue the commands all at the same time (or as possible to 
that).


In both tests I didn't notice any interruption. It may took longer than 
just doing one at a time but the system was continuously up and 
everything was responding without the problem.


At the time of these processes the open connections were 100 with one 
of the OSD node and 111 with the other one.


So I guess I am not experiencing the issue due to the low number of 
OSDs I am having. Is my assumption correct?



Best regards,

George




Thanks a million for the feedback Christian!

I 've tried to recreate the issue with 10RBD Volumes mounted on a
single server without success!

I 've issued the mkfs.xfs command simultaneously (or at least as
fast I could do it in different terminals) without noticing any
problems. Can you please tell me what was the size of each one of the
RBD Volumes cause I have a feeling that mine were two small, and if 
so

I have to test it on our bigger cluster.

I 've also thought that besides QEMU version it might also be
important the underlying OS, so what was your testbed?


All the best,

George


Hi George

In order to experience the error it was enough to simply run 
mkfs.xfs

on all the volumes.


In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files1024 4096 
files

..

This can be changed by setting a decent value in
/etc/libvirt/qemu.conf for max_files.

Regards
Christian



On 27 May 2015, at 16:23, Jens-Christian Fischer
jens-christian.fisc...@switch.ch wrote:


George,

I will let Christian provide you the details. As far as I know, it 
was enough to just do a ‘ls’ on all of the attached drives.


we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu   
1.0.0+git-2013.c3d1e78-2ubuntu1   all  PXE boot firmware 
- ROM images for qemu
ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11  
all  QEMU keyboard maps
ii  qemu-system 2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries
ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (arm)
ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (common files)
ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (mips)
ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (miscelaneous)
ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (sparc)
ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11  
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11  
amd64QEMU utilities


cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis 
gior...@acmac.uoc.gr wrote:



Jens-Christian,

how did you test that? Did you just tried to write to them 
simultaneously? Any other tests that one can perform to verify that?


In our installation we have a VM with 30 RBD volumes mounted which 
are all exported via NFS to other VMs.
No one has complaint for the moment but the load/usage is very 
minimal.
If this problem really exists then very soon that the trial phase 
will be over we will have millions of complaints :-(


What version of QEMU are you using? We are using the one provided 
by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm


Best regards,

George


I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). 
As
soon as he hit all disks, we started to experience these 120 
second

timeouts. We realized that the QEMU process on the hypervisor is
opening a TCP connection to every OSD for every mounted volume -
exceeding the 1024 FD 

Re: [ceph-users] NFS interaction with RBD

2015-05-29 Thread John-Paul Robinson
In the end this came down to one slow OSD.  There were no hardware
issues so have to just assume something gummed up during rebalancing and
peering.

I restarted the osd process after setting the cluster to noout.  After
the osd was restarted the rebalance completed and the cluster returned
to health ok.

As soon as the osd restarted all previously hanging operations returned
to normal.

I'm surprised by a single slow OSD impacting access to the entire
cluster.   I understand now that only the primary osd is used for reads
and writes must go to the primary then secondary, but I would have
expected  the impact to be more contained.

We currently build XFS file systems directly on RBD images.  I'm
wondering if there would be any value in using an LVM abstraction on top
to spread access to other osds  for read and failure scenarios.

Any thoughts on the above appreciated.

~jpr


On 05/28/2015 03:18 PM, John-Paul Robinson wrote:
 To follow up on the original post,

 Further digging indicates this is a problem with RBD image access and
 is not related to NFS-RBD interaction as initially suspected.  The
 nfsd is simply hanging as a result of a hung request to the XFS file
 system mounted on our RBD-NFS gateway.This hung XFS call is caused
 by a problem with the RBD module interacting with our Ceph pool.

 I've found a reliable way to trigger a hang directly on an rbd image
 mapped into our RBD-NFS gateway box.  The image contains an XFS file
 system.  When I try to list the contents of a particular directory,
 the request hangs indefinitely.

 Two weeks ago our ceph status was:

 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
health HEALTH_WARN 1 near full osd(s)
monmap e1: 3 mons at
 
 {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
 election epoch 350, quorum 0,1,2
 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
osdmap e5978: 66 osds: 66 up, 66 in
 pgmap v26434260: 3072 pgs: 3062 active+clean, 6
 active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
 data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
mdsmap e1: 0/0/1 up


 The near full osd was number 53 and we updated our crush map to
 rewieght the osd.  All of the OSDs had a weight of 1 based on the
 assumption that all osds were 2.0TB.  Apparently one of our severs had
 the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough
 we are only at 50% utilization.  We reweighted the near full osd to .8
 and that initiated a rebalance that has since relieved the 95% full
 condition on that OSD.

 However, since that time the repeering has not completed and we
 suspect this is causing problems with our access of RBD images.   Our
 current ceph status is:

 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
 stuck unclean; recovery 9/23842120 degraded (0.000%)
monmap e1: 3 mons at
 
 {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
 election epoch 350, quorum 0,1,2
 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
osdmap e6036: 66 osds: 66 up, 66 in
 pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
 active+clean+scrubbing, 1 remapped+peering, 3
 active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297
 GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
mdsmap e1: 0/0/1 up


 Here are further details on our stuck pgs:

 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
 dump_stuck inactive
 ok
 pg_stat objects mip degrunf bytes   log disklog
 state   state_stamp v   reportedup  acting 
 last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
 3.3af   11600   0   0   0   47941791744 153812 
 153812  remapped+peering2015-05-15 12:47:17.223786 
 5979'293066  6000'1248735 [48,62] [53,48,62] 
 5979'293056 2015-05-15 07:40:36.275563  5979'293056
 2015-05-15 07:40:36.275563

 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
 dump_stuck unclean
 ok
 pg_stat objects mip degrunf bytes   log disklog
 state   state_stamp v   reportedup  acting 
 last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
 3.106   11870   0   9   0   49010106368 163991 
 163991  active  2015-05-15 12:47:19.761469  6035'356332
 5968'1358516 [62,53]  [62,53] 5979'356242 2015-05-14
 22:22:12.966150  5979'351351 2015-05-12 18:04:41.838686
 5.104   0   0   0   0   0   0   0  
 active  2015-05-15 12:47:19.800676  0'0 5968'1615  
 

Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread John-Paul Robinson
To follow up on the original post,

Further digging indicates this is a problem with RBD image access and is
not related to NFS-RBD interaction as initially suspected.  The nfsd is
simply hanging as a result of a hung request to the XFS file system
mounted on our RBD-NFS gateway.This hung XFS call is caused by a
problem with the RBD module interacting with our Ceph pool.

I've found a reliable way to trigger a hang directly on an rbd image
mapped into our RBD-NFS gateway box.  The image contains an XFS file
system.  When I try to list the contents of a particular directory, the
request hangs indefinitely.

Two weeks ago our ceph status was:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
   health HEALTH_WARN 1 near full osd(s)
   monmap e1: 3 mons at

{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
   osdmap e5978: 66 osds: 66 up, 66 in
pgmap v26434260: 3072 pgs: 3062 active+clean, 6
active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
   mdsmap e1: 0/0/1 up


The near full osd was number 53 and we updated our crush map to rewieght
the osd.  All of the OSDs had a weight of 1 based on the assumption that
all osds were 2.0TB.  Apparently one of our severs had the OSDs Sized to
2.8TB and this caused the OSD imbalance eventhough we are only at 50%
utilization.  We reweighted the near full osd to .8 and that initiated a
rebalance that has since relieved the 95% full condition on that OSD.

However, since that time the repeering has not completed and we suspect
this is causing problems with our access of RBD images.   Our current
ceph status is:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
   health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
stuck unclean; recovery 9/23842120 degraded (0.000%)
   monmap e1: 3 mons at

{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
   osdmap e6036: 66 osds: 66 up, 66 in
pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
active+clean+scrubbing, 1 remapped+peering, 3
active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB
/ 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
   mdsmap e1: 0/0/1 up


Here are further details on our stuck pgs:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck inactive
ok
pg_stat objects mip degrunf bytes   log disklog
state   state_stamp v   reportedup  acting 
last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
3.3af   11600   0   0   0   47941791744 153812 
153812  remapped+peering2015-05-15 12:47:17.223786 
5979'293066  6000'1248735 [48,62] [53,48,62] 
5979'293056 2015-05-15 07:40:36.275563  5979'293056
2015-05-15 07:40:36.275563

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck unclean
ok
pg_stat objects mip degrunf bytes   log disklog
state   state_stamp v   reportedup  acting 
last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
3.106   11870   0   9   0   49010106368 163991 
163991  active  2015-05-15 12:47:19.761469  6035'356332
5968'1358516 [62,53]  [62,53] 5979'356242 2015-05-14
22:22:12.966150  5979'351351 2015-05-12 18:04:41.838686
5.104   0   0   0   0   0   0   0  
active  2015-05-15 12:47:19.800676  0'0 5968'1615  
[62,53] [62,53]   0'0 2015-05-14 18:43:22.425105 
0'0 2015-05-08 10:19:54.938934
4.105   0   0   0   0   0   0   0  
active  2015-05-15 12:47:19.801028  0'0 5968'1615  
[62,53] [62,53]   0'0 2015-05-14 18:43:04.434826 
0'0 2015-05-14 18:43:04.434826
3.3af   11600   0   0   0   47941791744 153812 
153812  remapped+peering2015-05-15 12:47:17.223786 
5979'293066  6000'1248735 [48,62] [53,48,62] 
5979'293056 2015-05-15 07:40:36.275563  5979'293056
2015-05-15 07:40:36.275563


The servers in the pool are not overloaded.  On the ceph server that
originally had the nearly full osd, (osd 53), I'm seeing entries like
this in the osd log:

2015-05-28 06:25:02.900129 7f2ea8a4f700  0 log [WRN] : 6 slow
requests, 6 included below; oldest blocked for  1096430.805069 secs
2015-05-28 06:25:02.900145 7f2ea8a4f700  0 log [WRN] : slow request

Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread Georgios Dimitrakakis

Thanks a million for the feedback Christian!

I 've tried to recreate the issue with 10RBD Volumes mounted on a 
single server without success!


I 've issued the mkfs.xfs command simultaneously (or at least as fast 
I could do it in different terminals) without noticing any problems. Can 
you please tell me what was the size of each one of the RBD Volumes 
cause I have a feeling that mine were two small, and if so I have to 
test it on our bigger cluster.


I 've also thought that besides QEMU version it might also be important 
the underlying OS, so what was your testbed?



All the best,

George


Hi George

In order to experience the error it was enough to simply run mkfs.xfs
on all the volumes.


In the meantime it became clear what the problem was:

 ~ ; cat /proc/183016/limits
...
Max open files1024 4096 
files

..

This can be changed by setting a decent value in
/etc/libvirt/qemu.conf for max_files.

Regards
Christian



On 27 May 2015, at 16:23, Jens-Christian Fischer
jens-christian.fisc...@switch.ch wrote:


George,

I will let Christian provide you the details. As far as I know, it 
was enough to just do a ‘ls’ on all of the attached drives.


we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu   
1.0.0+git-2013.c3d1e78-2ubuntu1   all  PXE boot firmware - 
ROM images for qemu
ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11   
all  QEMU keyboard maps
ii  qemu-system 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries
ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (arm)
ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (common 
files)
ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (mips)
ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries 
(miscelaneous)
ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (sparc)
ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11   
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11   
amd64QEMU utilities


cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis 
gior...@acmac.uoc.gr wrote:



Jens-Christian,

how did you test that? Did you just tried to write to them 
simultaneously? Any other tests that one can perform to verify that?


In our installation we have a VM with 30 RBD volumes mounted which 
are all exported via NFS to other VMs.
No one has complaint for the moment but the load/usage is very 
minimal.
If this problem really exists then very soon that the trial phase 
will be over we will have millions of complaints :-(


What version of QEMU are you using? We are using the one provided 
by Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm


Best regards,

George


I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). 
As
soon as he hit all disks, we started to experience these 120 
second

timeouts. We realized that the QEMU process on the hypervisor is
opening a TCP connection to every OSD for every mounted volume -
exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

--
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch [3]
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer  wrote:


Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS
versions
as well).
The only common thing is that both of you added OSDs and are 
likely

suffering from delays stemming from Ceph re-balancing or
deep-scrubbing.

Ceph logs will only pipe up when things have been blocked for 
more

than 30
seconds, NFS might take offense to lower values (or the 
accumulation

of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new 
storage

nodes
(so could there be something different with those nodes?), how 
busy

is 

Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread Trent Lloyd
Jens-Christian Fischer jens-christian.fischer@... writes:

 
 I think we (i.e. Christian) found the problem:
 We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as 
he hit all disks, we started to experience these 120 second timeouts. We 
realized that the QEMU process on the hypervisor is opening a TCP connection 
to every OSD for every mounted volume - exceeding the 1024 FD limit.
 
 So no deep scrubbing etc, but simply to many connections…

Have seen mention of similar from CERN in their presentations, found this 
post on a quick google.. might help?

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-
December/026187.html

Cheers,
Trent
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS interaction with RBD

2015-05-27 Thread Jens-Christian Fischer
George,

I will let Christian provide you the details. As far as I know, it was enough 
to just do a ‘ls’ on all of the attached drives.

we are using Qemu 2.0:

$ dpkg -l | grep qemu
ii  ipxe-qemu   1.0.0+git-2013.c3d1e78-2ubuntu1   
all  PXE boot firmware - ROM images for qemu
ii  qemu-keymaps2.0.0+dfsg-2ubuntu1.11
all  QEMU keyboard maps
ii  qemu-system 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries
ii  qemu-system-arm 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (arm)
ii  qemu-system-common  2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (common files)
ii  qemu-system-mips2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (mips)
ii  qemu-system-misc2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (miscelaneous)
ii  qemu-system-ppc 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc   2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (sparc)
ii  qemu-system-x86 2.0.0+dfsg-2ubuntu1.11
amd64QEMU full system emulation binaries (x86)
ii  qemu-utils  2.0.0+dfsg-2ubuntu1.11
amd64QEMU utilities

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 26.05.2015, at 19:12, Georgios Dimitrakakis gior...@acmac.uoc.gr wrote:

 Jens-Christian,
 
 how did you test that? Did you just tried to write to them simultaneously? 
 Any other tests that one can perform to verify that?
 
 In our installation we have a VM with 30 RBD volumes mounted which are all 
 exported via NFS to other VMs.
 No one has complaint for the moment but the load/usage is very minimal.
 If this problem really exists then very soon that the trial phase will be 
 over we will have millions of complaints :-(
 
 What version of QEMU are you using? We are using the one provided by Ceph in 
 qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm
 
 Best regards,
 
 George
 
 I think we (i.e. Christian) found the problem:
 
 We created a test VM with 9 mounted RBD volumes (no NFS server). As
 soon as he hit all disks, we started to experience these 120 second
 timeouts. We realized that the QEMU process on the hypervisor is
 opening a TCP connection to every OSD for every mounted volume -
 exceeding the 1024 FD limit.
 
 So no deep scrubbing etc, but simply to many connections…
 
 cheers
 jc
 
 --
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch [3]
 http://www.switch.ch
 
 http://www.switch.ch/stories
 
 On 25.05.2015, at 06:02, Christian Balzer  wrote:
 
 Hello,
 
 lets compare your case with John-Paul's.
 
 Different OS and Ceph versions (thus we can assume different NFS
 versions
 as well).
 The only common thing is that both of you added OSDs and are likely
 suffering from delays stemming from Ceph re-balancing or
 deep-scrubbing.
 
 Ceph logs will only pipe up when things have been blocked for more
 than 30
 seconds, NFS might take offense to lower values (or the accumulation
 of
 several distributed delays).
 
 You added 23 OSDs, tell us more about your cluster, HW, network.
 Were these added to the existing 16 nodes, are these on new storage
 nodes
 (so could there be something different with those nodes?), how busy
 is your
 network, CPU.
 Running something like collectd to gather all ceph perf data and
 other
 data from the storage nodes and then feeding it to graphite (or
 similar)
 can be VERY helpful to identify if something is going wrong and what
 it is
 in particular.
 Otherwise run atop on your storage nodes to identify if CPU,
 network,
 specific HDDs/OSDs are bottlenecks.
 
 Deep scrubbing can be _very_ taxing, do your problems persist if
 inject
 into your running cluster an osd_scrub_sleep value of 0.5 (lower
 that
 until it hurts again) or if you turn of deep scrubs altogether for
 the
 moment?
 
 Christian
 
 On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
 
 We see something very similar on our Ceph cluster, starting as of
 today.
 
 We use a 16 node, 102 OSD Ceph installation as the basis for an
 Icehouse
 OpenStack cluster (we applied the RBD patches for live migration
 etc)
 
 On this cluster we have a big ownCloud installation (Sync  Share)
 that
 stores its files on three NFS servers, each 

Re: [ceph-users] NFS interaction with RBD

2015-05-26 Thread Jens-Christian Fischer
I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). As soon as he 
hit all disks, we started to experience these 120 second timeouts. We realized 
that the QEMU process on the hypervisor is opening a TCP connection to every 
OSD for every mounted volume - exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer ch...@gol.com wrote:

 
 Hello,
 
 lets compare your case with John-Paul's.
 
 Different OS and Ceph versions (thus we can assume different NFS versions
 as well).
 The only common thing is that both of you added OSDs and are likely
 suffering from delays stemming from Ceph re-balancing or deep-scrubbing.
 
 Ceph logs will only pipe up when things have been blocked for more than 30
 seconds, NFS might take offense to lower values (or the accumulation of
 several distributed delays).
 
 You added 23 OSDs, tell us more about your cluster, HW, network.
 Were these added to the existing 16 nodes, are these on new storage nodes
 (so could there be something different with those nodes?), how busy is your
 network, CPU.
 Running something like collectd to gather all ceph perf data and other
 data from the storage nodes and then feeding it to graphite (or similar)
 can be VERY helpful to identify if something is going wrong and what it is
 in particular.
 Otherwise run atop on your storage nodes to identify if CPU, network,
 specific HDDs/OSDs are bottlenecks. 
 
 Deep scrubbing can be _very_ taxing, do your problems persist if inject
 into your running cluster an osd_scrub_sleep value of 0.5 (lower that
 until it hurts again) or if you turn of deep scrubs altogether for the
 moment?
 
 Christian
 
 On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:
 
 We see something very similar on our Ceph cluster, starting as of today.
 
 We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
 OpenStack cluster (we applied the RBD patches for live migration etc)
 
 On this cluster we have a big ownCloud installation (Sync  Share) that
 stores its files on three NFS servers, each mounting 6 2TB RBD volumes
 and exposing them to around 10 web server VMs (we originally started
 with one NFS server with a 100TB volume, but that has become unwieldy).
 All of the servers (hypervisors, ceph storage nodes and VMs) are using
 Ubuntu 14.04
 
 Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
 OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
 rebalancing process ended this morning (after around 12 hours) The
 cluster has been clean since then:
 
cluster b1f3f4c8-x
 health HEALTH_OK
 monmap e2: 3 mons at
 {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},
 election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
 e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
 active+clean 17 active+clean+scrubbing+deep
  client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
 
 At midnight, we run a script that creates an RBD snapshot of all RBD
 volumes that are attached to the NFS servers (for backup purposes).
 Looking at our monitoring, around that time, one of the NFS servers
 became unresponsive and took down the complete ownCloud installation
 (load on the web server was  200 and they had lost some of the NFS
 mounts)
 
 Rebooting the NFS server solved that problem, but the NFS kernel server
 kept crashing all day long after having run between 10 to 90 minutes.
 
 We initially suspected a corrupt rbd volume (as it seemed that we could
 trigger the kernel crash by just “ls -l” one of the volumes, but
 subsequent “xfs_repair -n” checks on those RBD volumes showed no
 problems.
 
 We migrated the NFS server off of its hypervisor, suspecting a problem
 with RBD kernel modules, rebooted the hypervisor but the problem
 persisted (both on the new hypervisor, and on the old one when we
 migrated it back)
 
 We changed the /etc/default/nfs-kernel-server to start up 256 servers
 (even though the defaults had been working fine for over a year)
 
 Only one of our 3 NFS servers crashes (see below for syslog information)
 - the other 2 have been fine
 
 May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
 Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
 grace period (net 81cdab00) May 23 21:44:23 drive-nfs1
 rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
 kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core 

Re: [ceph-users] NFS interaction with RBD

2015-05-26 Thread Georgios Dimitrakakis

Jens-Christian,

how did you test that? Did you just tried to write to them 
simultaneously? Any other tests that one can perform to verify that?


In our installation we have a VM with 30 RBD volumes mounted which are 
all exported via NFS to other VMs.

No one has complaint for the moment but the load/usage is very minimal.
If this problem really exists then very soon that the trial phase will 
be over we will have millions of complaints :-(


What version of QEMU are you using? We are using the one provided by 
Ceph in qemu-kvm-0.12.1.2-2.415.el6.3ceph.x86_64.rpm


Best regards,

George


I think we (i.e. Christian) found the problem:

We created a test VM with 9 mounted RBD volumes (no NFS server). As
soon as he hit all disks, we started to experience these 120 second
timeouts. We realized that the QEMU process on the hypervisor is
opening a TCP connection to every OSD for every mounted volume -
exceeding the 1024 FD limit.

So no deep scrubbing etc, but simply to many connections…

cheers
jc

 --
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch [3]
http://www.switch.ch

http://www.switch.ch/stories

On 25.05.2015, at 06:02, Christian Balzer  wrote:


Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS
versions
as well).
The only common thing is that both of you added OSDs and are likely
suffering from delays stemming from Ceph re-balancing or
deep-scrubbing.

Ceph logs will only pipe up when things have been blocked for more
than 30
seconds, NFS might take offense to lower values (or the accumulation
of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new storage
nodes
(so could there be something different with those nodes?), how busy
is your
network, CPU.
Running something like collectd to gather all ceph perf data and
other
data from the storage nodes and then feeding it to graphite (or
similar)
can be VERY helpful to identify if something is going wrong and what
it is
in particular.
Otherwise run atop on your storage nodes to identify if CPU,
network,
specific HDDs/OSDs are bottlenecks.

Deep scrubbing can be _very_ taxing, do your problems persist if
inject
into your running cluster an osd_scrub_sleep value of 0.5 (lower
that
until it hurts again) or if you turn of deep scrubs altogether for
the
moment?

Christian

On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:


We see something very similar on our Ceph cluster, starting as of
today.

We use a 16 node, 102 OSD Ceph installation as the basis for an
Icehouse
OpenStack cluster (we applied the RBD patches for live migration
etc)

On this cluster we have a big ownCloud installation (Sync  Share)
that
stores its files on three NFS servers, each mounting 6 2TB RBD
volumes
and exposing them to around 10 web server VMs (we originally
started
with one NFS server with a 100TB volume, but that has become
unwieldy).
All of the servers (hypervisors, ceph storage nodes and VMs) are
using
Ubuntu 14.04

Yesterday evening we added 23 ODSs to the cluster bringing it up
to 125
OSDs (because we had 4 OSDs that were nearing the 90% full mark).
The
rebalancing process ended this morning (after around 12 hours) The
cluster has been clean since then:

cluster b1f3f4c8-x
health HEALTH_OK
monmap e2: 3 mons at





{zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},

election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17
pools,
82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail
3319
active+clean 17 active+clean+scrubbing+deep
client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s

At midnight, we run a script that creates an RBD snapshot of all
RBD
volumes that are attached to the NFS servers (for backup
purposes).
Looking at our monitoring, around that time, one of the NFS
servers
became unresponsive and took down the complete ownCloud
installation
(load on the web server was  200 and they had lost some of the
NFS
mounts)

Rebooting the NFS server solved that problem, but the NFS kernel
server
kept crashing all day long after having run between 10 to 90
minutes.

We initially suspected a corrupt rbd volume (as it seemed that we
could
trigger the kernel crash by just “ls -l” one of the volumes,
but
subsequent “xfs_repair -n” checks on those RBD volumes showed
no
problems.

We migrated the NFS server off of its hypervisor, suspecting a
problem
with RBD kernel modules, rebooted the hypervisor but the problem
persisted (both on the new hypervisor, and on the old one when we
migrated it back)

We changed the /etc/default/nfs-kernel-server to start up 256
servers
(even though the defaults had been working fine for over a year)

Only 

Re: [ceph-users] NFS interaction with RBD

2015-05-24 Thread Christian Balzer

Hello,

lets compare your case with John-Paul's.

Different OS and Ceph versions (thus we can assume different NFS versions
as well).
The only common thing is that both of you added OSDs and are likely
suffering from delays stemming from Ceph re-balancing or deep-scrubbing.

Ceph logs will only pipe up when things have been blocked for more than 30
seconds, NFS might take offense to lower values (or the accumulation of
several distributed delays).

You added 23 OSDs, tell us more about your cluster, HW, network.
Were these added to the existing 16 nodes, are these on new storage nodes
(so could there be something different with those nodes?), how busy is your
network, CPU.
Running something like collectd to gather all ceph perf data and other
data from the storage nodes and then feeding it to graphite (or similar)
can be VERY helpful to identify if something is going wrong and what it is
in particular.
Otherwise run atop on your storage nodes to identify if CPU, network,
specific HDDs/OSDs are bottlenecks. 

Deep scrubbing can be _very_ taxing, do your problems persist if inject
into your running cluster an osd_scrub_sleep value of 0.5 (lower that
until it hurts again) or if you turn of deep scrubs altogether for the
moment?

Christian

On Sat, 23 May 2015 23:28:32 +0200 Jens-Christian Fischer wrote:

 We see something very similar on our Ceph cluster, starting as of today.
 
 We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse
 OpenStack cluster (we applied the RBD patches for live migration etc)
 
 On this cluster we have a big ownCloud installation (Sync  Share) that
 stores its files on three NFS servers, each mounting 6 2TB RBD volumes
 and exposing them to around 10 web server VMs (we originally started
 with one NFS server with a 100TB volume, but that has become unwieldy).
 All of the servers (hypervisors, ceph storage nodes and VMs) are using
 Ubuntu 14.04
 
 Yesterday evening we added 23 ODSs to the cluster bringing it up to 125
 OSDs (because we had 4 OSDs that were nearing the 90% full mark). The
 rebalancing process ended this morning (after around 12 hours) The
 cluster has been clean since then:
 
 cluster b1f3f4c8-x
  health HEALTH_OK
  monmap e2: 3 mons at
 {zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},
 election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025 osdmap
 e43476: 125 osds: 125 up, 125 in pgmap v18928606: 3336 pgs, 17 pools,
 82447 GB data, 22585 kobjects 266 TB used, 187 TB / 454 TB avail 3319
 active+clean 17 active+clean+scrubbing+deep
   client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s
 
 At midnight, we run a script that creates an RBD snapshot of all RBD
 volumes that are attached to the NFS servers (for backup purposes).
 Looking at our monitoring, around that time, one of the NFS servers
 became unresponsive and took down the complete ownCloud installation
 (load on the web server was  200 and they had lost some of the NFS
 mounts)
 
 Rebooting the NFS server solved that problem, but the NFS kernel server
 kept crashing all day long after having run between 10 to 90 minutes.
 
 We initially suspected a corrupt rbd volume (as it seemed that we could
 trigger the kernel crash by just “ls -l” one of the volumes, but
 subsequent “xfs_repair -n” checks on those RBD volumes showed no
 problems.
 
 We migrated the NFS server off of its hypervisor, suspecting a problem
 with RBD kernel modules, rebooted the hypervisor but the problem
 persisted (both on the new hypervisor, and on the old one when we
 migrated it back)
 
 We changed the /etc/default/nfs-kernel-server to start up 256 servers
 (even though the defaults had been working fine for over a year)
 
 Only one of our 3 NFS servers crashes (see below for syslog information)
 - the other 2 have been fine
 
 May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD:
 Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory May
 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second
 grace period (net 81cdab00) May 23 21:44:23 drive-nfs1
 rpc.mountd[1724]: Version 1.2.8 starting May 23 21:44:28 drive-nfs1
 kernel: [  182.917775] ip_tables: (C) 2000-2006 Netfilter Core Team May
 23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0
 (16384 buckets, 65536 max) May 23 21:44:28 drive-nfs1 kernel:
 [  183.044091] ip6_tables: (C) 2000-2006 Netfilter Core Team May 23
 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1
  /dev/null  debian-sa1 1 1) May 23 21:45:17 drive-nfs1
  collectd[1872]: python: Plugin loaded but not configured. May 23
  21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering
  read-loop. May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init:
  plymouth-upstart-bridge main process ended, respawning May 23 21:51:26
  drive-nfs1 kernel: [  600.776177] INFO: task nfsd:1696 blocked for
  more than 120 seconds.
 May 23 21:51:26 drive-nfs1 

Re: [ceph-users] NFS interaction with RBD

2015-05-23 Thread Jens-Christian Fischer
We see something very similar on our Ceph cluster, starting as of today.

We use a 16 node, 102 OSD Ceph installation as the basis for an Icehouse 
OpenStack cluster (we applied the RBD patches for live migration etc)

On this cluster we have a big ownCloud installation (Sync  Share) that stores 
its files on three NFS servers, each mounting 6 2TB RBD volumes and exposing 
them to around 10 web server VMs (we originally started with one NFS server 
with a 100TB volume, but that has become unwieldy). All of the servers 
(hypervisors, ceph storage nodes and VMs) are using Ubuntu 14.04

Yesterday evening we added 23 ODSs to the cluster bringing it up to 125 OSDs 
(because we had 4 OSDs that were nearing the 90% full mark). The rebalancing 
process ended this morning (after around 12 hours)
The cluster has been clean since then:

cluster b1f3f4c8-x
 health HEALTH_OK
 monmap e2: 3 mons at 
{zhdk0009=[:::1009]:6789/0,zhdk0013=[:::1013]:6789/0,zhdk0025=[:::1025]:6789/0},
 election epoch 612, quorum 0,1,2 zhdk0009,zhdk0013,zhdk0025
 osdmap e43476: 125 osds: 125 up, 125 in
  pgmap v18928606: 3336 pgs, 17 pools, 82447 GB data, 22585 kobjects
266 TB used, 187 TB / 454 TB avail
3319 active+clean
  17 active+clean+scrubbing+deep
  client io 8186 kB/s rd, 7747 kB/s wr, 2288 op/s

At midnight, we run a script that creates an RBD snapshot of all RBD volumes 
that are attached to the NFS servers (for backup purposes). Looking at our 
monitoring, around that time, one of the NFS servers became unresponsive and 
took down the complete ownCloud installation (load on the web server was  200 
and they had lost some of the NFS mounts)

Rebooting the NFS server solved that problem, but the NFS kernel server kept 
crashing all day long after having run between 10 to 90 minutes.

We initially suspected a corrupt rbd volume (as it seemed that we could trigger 
the kernel crash by just “ls -l” one of the volumes, but subsequent “xfs_repair 
-n” checks on those RBD volumes showed no problems.

We migrated the NFS server off of its hypervisor, suspecting a problem with RBD 
kernel modules, rebooted the hypervisor but the problem persisted (both on the 
new hypervisor, and on the old one when we migrated it back)

We changed the /etc/default/nfs-kernel-server to start up 256 servers (even 
though the defaults had been working fine for over a year)

Only one of our 3 NFS servers crashes (see below for syslog information) - the 
other 2 have been fine

May 23 21:44:10 drive-nfs1 kernel: [  165.264648] NFSD: Using 
/var/lib/nfs/v4recovery as the NFSv4 state recovery directory
May 23 21:44:19 drive-nfs1 kernel: [  173.880092] NFSD: starting 90-second 
grace period (net 81cdab00)
May 23 21:44:23 drive-nfs1 rpc.mountd[1724]: Version 1.2.8 starting
May 23 21:44:28 drive-nfs1 kernel: [  182.917775] ip_tables: (C) 2000-2006 
Netfilter Core Team
May 23 21:44:28 drive-nfs1 kernel: [  182.958465] nf_conntrack version 0.5.0 
(16384 buckets, 65536 max)
May 23 21:44:28 drive-nfs1 kernel: [  183.044091] ip6_tables: (C) 2000-2006 
Netfilter Core Team
May 23 21:45:10 drive-nfs1 CRON[1867]: (root) CMD (command -v debian-sa1  
/dev/null  debian-sa1 1 1)
May 23 21:45:17 drive-nfs1 collectd[1872]: python: Plugin loaded but not 
configured.
May 23 21:45:17 drive-nfs1 collectd[1872]: Initialization complete, entering 
read-loop.
May 23 21:47:11 drive-nfs1 kernel: [  346.392283] init: plymouth-upstart-bridge 
main process ended, respawning
May 23 21:51:26 drive-nfs1 kernel: [  600.776177] INFO: task nfsd:1696 blocked 
for more than 120 seconds.
May 23 21:51:26 drive-nfs1 kernel: [  600.778090]   Not tainted 
3.13.0-53-generic #89-Ubuntu
May 23 21:51:26 drive-nfs1 kernel: [  600.779507] echo 0  
/proc/sys/kernel/hung_task_timeout_secs disables this message.
May 23 21:51:26 drive-nfs1 kernel: [  600.781504] nfsdD 
88013fd93180 0  1696  2 0x
May 23 21:51:26 drive-nfs1 kernel: [  600.781508]  8800b2391c50 
0046 8800b22f9800 8800b2391fd8
May 23 21:51:26 drive-nfs1 kernel: [  600.781511]  00013180 
00013180 8800b22f9800 880035f48240
May 23 21:51:26 drive-nfs1 kernel: [  600.781513]  880035f48244 
8800b22f9800  880035f48248
May 23 21:51:26 drive-nfs1 kernel: [  600.781515] Call Trace:
May 23 21:51:26 drive-nfs1 kernel: [  600.781523]  [81727749] 
schedule_preempt_disabled+0x29/0x70
May 23 21:51:26 drive-nfs1 kernel: [  600.781526]  [817295b5] 
__mutex_lock_slowpath+0x135/0x1b0
May 23 21:51:26 drive-nfs1 kernel: [  600.781528]  [8172964f] 
mutex_lock+0x1f/0x2f
May 23 21:51:26 drive-nfs1 kernel: [  600.781557]  [a03b1761] 
nfsd_lookup_dentry+0xa1/0x490 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781568]  [a03b044b] ? 
fh_verify+0x14b/0x5e0 [nfsd]
May 23 21:51:26 drive-nfs1 kernel: [  600.781591]  [a03b1bb9] 

[ceph-users] NFS interaction with RBD

2015-05-23 Thread John-Paul Robinson (Campus)
We've had a an NFS gateway serving up RBD images successfully for over a year. 
Ubuntu 12.04 and ceph .73 iirc. 

In the past couple of weeks we have developed a problem where the nfs clients 
hang while accessing exported rbd containers. 

We see errors on the server about nfsd hanging for 120sec etc. 

The nfs server is still able to successfully interact with the images it is 
serving. We can export non rbd shares from the local file system and nfs 
clients can use them just fine. 

There seems to be something weird going on with rbd and nfs kernel modules. 

Our ceph pool is in a warn state due to an osd rebalance that is continuing 
slowly. But the fact that we continue to have good rbd image access directly on 
the server makes me think this is not related.  Also the nfs server is only a 
client of the pool, it doesnt participate in it. 

Has anyone experienced similar issues?  

We do have a lot of images attached to the server but he issue is there even 
when we map only a few. 

Thanks for any pointers. 

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com