[ceph-users] Source Package radosgw file has authentication issues

2016-10-19 Thread 于 姜
ceph_10.2.3.orig.tar.gz Source package

Compile completed:

/root/neunn_gitlab/ceph-Jewel10.2.3/src/radosgw

The following issues occur when script execution:

2016-10-20 11:36:30.102266 7f8b4b93f900 -1 auth: unable to find a keyring on 
/var/lib/ceph/radosgw/-admin/keyring: (2) No such file or directory
2016-10-20 11:36:30.102305 7f8b4b93f900 -1 monclient(hunting): ERROR: missing 
keyring, cannot use cephx for authentication
2016-10-20 11:36:30.121775 7f8b4b93f900 -1 Couldn't init storage provider 
(RADOS)

The same compiler environment, everything is normal use 10.2.0 Source Package



Sender : YuJiang
Country : China
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-rbd and ceph striping

2016-10-19 Thread Ahmed Mostafa
Does this also mean that strip count can be thought of as the number of
parrallel writes to different objects at different OSDs ?

Thank you

On Thursday, 20 October 2016, Jason Dillaman  wrote:

> librbd (used by QEMU to provide RBD-backed disks) uses librados and
> provides the necessary handling for striping across multiple backing
> objects. When you don't specify "fancy" striping options via
> "--stripe-count" and "--stripe-unit", it essentially defaults to
> stripe count of 1 and stripe unit of the object size (defaults to
> 4MB).
>
> The use-case for fancy striping settings for an RBD image are images
> that have lots of small, sequential IO. The rationale for that is that
> normally these small, sequential IOs will continue to hit the same PG
> until the object boundary is crossed. However, if you were to use a
> small stripe unit that matched your normal IO size (or a small
> multiple thereof), your small, sequential IO requests would be sent to
>  PGs -- spreading the load.
>
> On Wed, Oct 19, 2016 at 12:32 PM, Ahmed Mostafa
> > wrote:
> > Hello
> >
> > From the documentation i understand that clients that uses librados must
> > perform striping for themselves, but i do not understand how could this
> be
> > if we have striping options in ceph ? i mean i can create rbd images that
> > has configuration for striping, count and unite size.
> >
> > So my question is, if i created an RBD image that have striping enabled
> and
> > configured, will that make a difference with qemu-rbd ? by difference i
> mean
> > enhancing performance of my virtual machines i/o and allowing utilizing
> the
> > cluster resources
> >
> > Thank you
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-19 Thread Goncalo Borges
Hi Kostis...
That is a tale from the dark side. Glad you recover it and that you were 
willing to doc it all up, and share it. Thank you for that,
Can I also ask which tool did you use to recover the leveldb?
Cheers
Goncalo

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
Fardelas [dante1...@gmail.com]
Sent: 20 October 2016 09:09
To: ceph-users
Subject: [ceph-users] Surviving a ceph cluster outage: the hard way

Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.

I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users and
ceph-devel lists who contributed to our inquiries during
troubleshooting.

https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When the kernel support JEWEL tunables?

2016-10-19 Thread Alexandre DERUMIER
works fine with kernel 4.6 for me.

from doc:
http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-tunables

it should works with kernel 4.5 too.



I don't known if they are any plan to backport last krbd module version to 
kernel 4.4 ?


- Mail original - 
De: "한승진" 
À: "ceph-users" 
Envoyé: Jeudi 20 Octobre 2016 05:31:49
Objet: [ceph-users]  When the kernel support JEWEL tunables?

Hi all, 
When I try to mount rbd through KRBD, it failed because of mismatch features. 

The Client's OS is Ubuntu 16.04 and kernel is 4.4.0-38 

My original CRUSH tunables is below. 

root@Fx2x1ctrlserv01:~# ceph osd crush show-tunables 
{ 
"choose_local_tries": 0, 
"choose_local_fallback_tries": 0, 
"choose_total_tries": 50, 
"chooseleaf_descend_once": 1, 
"chooseleaf_vary_r": 1, 
"chooseleaf_stable": 1, 
"straw_calc_version": 1, 
"allowed_bucket_algs": 54, 
"profile": "jewel", 
"optimal_tunables": 1, 
"legacy_tunables": 0, 
"minimum_required_version": "jewel", 
"require_feature_tunables": 1, 
"require_feature_tunables2": 1, 
"has_v2_rules": 0, 
"require_feature_tunables3": 1, 
"has_v3_rules": 0, 
"has_v4_buckets": 1, 
"require_feature_tunables5": 1, 
"has_v5_rules": 0 
} 

I disabled tunables5 feature in CRUSH MAP. 

# begin crush map 
tunable choose_local_tries 0 
tunable choose_local_fallback_tries 0 
tunable choose_total_tries 50 
tunable chooseleaf_descend_once 1 
tunable chooseleaf_vary_r 1 
tunable chooseleaf_stable 0 
tunable straw_calc_version 1 
tunable allowed_bucket_algs 54 

Finally, the crush tunables became like hammer 

root@Fx2x1ctrlserv01:~/crushmap# ceph osd crush show-tunables 
{ 
"choose_local_tries": 0, 
"choose_local_fallback_tries": 0, 
"choose_total_tries": 50, 
"chooseleaf_descend_once": 1, 
"chooseleaf_vary_r": 1, 
"chooseleaf_stable": 0, 
"straw_calc_version": 1, 
"allowed_bucket_algs": 54, 
"profile": "hammer", 
"optimal_tunables": 0, 
"legacy_tunables": 0, 
"minimum_required_version": " hammer ", 
"require_feature_tunables": 1, 
"require_feature_tunables2": 1, 
"has_v2_rules": 0, 
"require_feature_tunables3": 1, 
"has_v3_rules": 0, 
"has_v4_buckets": 1, 
"require_feature_tunables5": 0, 
"has_v5_rules": 0 
} 

After that, I could mount a rbd using KRBD 

When can I expect the KRBD support feature_tunables5 ? 

Thanks for your help. 

John Haan 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] When the kernel support JEWEL tunables?

2016-10-19 Thread 한승진
Hi all,

When I try to mount rbd through KRBD, it failed because of mismatch
features.

The Client's OS is Ubuntu 16.04 and kernel is 4.4.0-38

My original CRUSH tunables is below.

root@Fx2x1ctrlserv01:~# ceph osd crush show-tunables
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 1,
"straw_calc_version": 1,
"allowed_bucket_algs": 54,
"profile": "jewel",
"optimal_tunables": 1,
"legacy_tunables": 0,
"minimum_required_version": "jewel",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 0,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 1,
*"require_feature_tunables5": 1,*
"has_v5_rules": 0
}

I disabled tunables5 feature in CRUSH MAP.

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
*tunable chooseleaf_stable 0*
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

Finally, the crush tunables became like hammer

root@Fx2x1ctrlserv01:~/crushmap# ceph osd crush show-tunables
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 0,
"straw_calc_version": 1,
"allowed_bucket_algs": 54,
"profile": "hammer",
"optimal_tunables": 0,
"legacy_tunables": 0,
"minimum_required_version": "*hammer*",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 0,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 1,
*"require_feature_tunables5": 0,*
"has_v5_rules": 0
}

After that, I could mount a rbd using KRBD

When can I expect the KRBD support feature_tunables5 ?

Thanks for your help.

John Haan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Christian Balzer

Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
> 
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.  
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
> 
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
> 
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 
> osd.0 is the cache ssd
> 
> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
> await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The 
> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
> 
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> sdb
>   0.00 0.339.00   84.33 0.9620.11   462.40
> 75.92  397.56  125.67  426.58  10.70  99.90
>   0.00 0.67   30.00   87.33 5.9621.03   471.20
> 67.86  910.95   87.00 1193.99   8.27  97.07
>   0.0016.67   33.00  289.33 4.2118.80   146.20
> 29.83   88.99   93.91   88.43   3.10  99.83
>   0.00 7.337.67  261.67 1.9219.63   163.81   
> 117.42  331.97  182.04  336.36   3.71 100.00
> 
> 
> If I look at the iostat for all the drives, only the cache ssd drive is 
> backed up
> 
Yes, consumer SSDs on top of a design that channels everything through
them.

Rebuild your cluster along more conventional and conservative lines, don't
use the 850 PROs. 
Feel free to run any new design by us.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing image of rbd mirroring

2016-10-19 Thread Jason Dillaman
On Wed, Oct 19, 2016 at 6:52 PM, yan cui  wrote:
> 2016-10-19 15:46:44.843053 7f35c9925d80 -1 librbd: cannot obtain exclusive
> lock - not removing

Are you attempting to delete the primary or non-primary image? I would
expect any attempts to delete the non-primary image to fail since the
non-primary image will automatically be deleted when mirroring is
disabled on the primary side (or the primary image is deleted).

There was an issue where the rbd-mirror daemon would not release the
exclusive lock on the image after a forced promotion. The fix for that
will be included in the forthcoming 10.2.4 release.

If it is neither of these scenarios, can you re-run the "rbd rm"
command with "--debug-rbd=20" option appended?

Thanks,

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-rbd and ceph striping

2016-10-19 Thread Jason Dillaman
librbd (used by QEMU to provide RBD-backed disks) uses librados and
provides the necessary handling for striping across multiple backing
objects. When you don't specify "fancy" striping options via
"--stripe-count" and "--stripe-unit", it essentially defaults to
stripe count of 1 and stripe unit of the object size (defaults to
4MB).

The use-case for fancy striping settings for an RBD image are images
that have lots of small, sequential IO. The rationale for that is that
normally these small, sequential IOs will continue to hit the same PG
until the object boundary is crossed. However, if you were to use a
small stripe unit that matched your normal IO size (or a small
multiple thereof), your small, sequential IO requests would be sent to
 PGs -- spreading the load.

On Wed, Oct 19, 2016 at 12:32 PM, Ahmed Mostafa
 wrote:
> Hello
>
> From the documentation i understand that clients that uses librados must
> perform striping for themselves, but i do not understand how could this be
> if we have striping options in ceph ? i mean i can create rbd images that
> has configuration for striping, count and unite size.
>
> So my question is, if i created an RBD image that have striping enabled and
> configured, will that make a difference with qemu-rbd ? by difference i mean
> enhancing performance of my virtual machines i/o and allowing utilizing the
> cluster resources
>
> Thank you
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] removing image of rbd mirroring

2016-10-19 Thread yan cui
Hi all,

   We setup rbd mirroring between 2 clusters, but have issues when we want
to delete one image. Following is the  detailed info.
It reports that some other instance is still using it, which kind of makes
sense because we set up the mirror between 2 clusters.
What's the best practice to remove a image which is under rbd mirroring?
By the way, we did a pool based rbd mirroring.

rbd --cluster iad rm  -p mirror

2016-10-19 15:46:44.843053 7f35c9925d80 -1 librbd: cannot obtain exclusive
lock - not removing

Removing image: 0% complete...failed.

rbd: error: image still has watchers

This means the image is still open or the client using it crashed. Try
again after closing/unmapping it or waiting 30s for the crashed client to
timeout.
-- 
Think big; Dream impossible; Make it happen.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-19 Thread Kostis Fardelas
Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.

I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users and
ceph-devel lists who contributed to our inquiries during
troubleshooting.

https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
John,



Updating to the latest mainline kernel from elrepo (4.8.2-1) on all 4 ceph 
servers, and the ceph client that I am testing with, still didn’t fix the 
issues.

Still getting “Failing to respond to Cache Pressure”. And ops block currently 
hovering between 100-300 requests  > 32 sec



This is just from doing an rsync from one ceph client (reading data from an old 
san over nfs, and writing to ceph cluster on infiniband



I guess I’ll try getting rid of the EC pool and the cache tier, and just using 
replication with size 3 and see if it works better



Sent from Mail for Windows 10



From: John Spray
Sent: Wednesday, October 19, 2016 12:16 PM
To: Jim Kilborn
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



On Wed, Oct 19, 2016 at 5:17 PM, Jim Kilborn  wrote:
> John,
>
>
>
> Thanks for the tips….
>
> Unfortunately, I was looking at this page 
> http://docs.ceph.com/docs/jewel/start/os-recommendations/

OK, thanks - I've pushed an update to clarify that
(https://github.com/ceph/ceph/pull/11564).

> I’ll consider either upgrading the kernels or using the fuse client, but will 
> likely go the kernel 4.4 route
>
>
>
> As for moving to just a replicated pool, I take it that a replication size of 
> 3 is minimum recommended.
>
> If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
> servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
> or is that two many osds per journal, creating a hot spot for writes?

That sounds like a lot of journals on one SSD, but people other than
me have more empirical experience in hardware selection.

John

>
>
>
> Thanks!!
>
>
>
>
>
>
>
> Sent from Mail for Windows 10
>
>
>
> From: John Spray
> Sent: Wednesday, October 19, 2016 9:10 AM
> To: Jim Kilborn
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
>
>
>
> On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn  wrote:
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>>
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  Since I am running cephfs, I have 
>> tiering setup.
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>>
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>>
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>>
>> osd.0 is the cache ssd
>>
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>   0.00 0.339.00   84.33 0.9620.11   462.40   
>>  75.92  397.56  125.67  426.58  10.70  99.90
>>   0.00 0.67   30.00   87.33 5.9621.03   471.20   
>>  67.86  910.95   87.00 1193.99   8.27  97.07
>>   0.0016.67   33.00  289.33 4.2118.80   146.20   
>>  29.83   88.99   93.91   88.43   3.10  99.83
>>   0.00 7.337.67  261.67 1.9219.63   163.81   
>> 117.42  331.97  182.04  336.36   3.71 100.00
>>
>>
>> If I look at the iostat for all the drives, only the cache ssd drive is 
>> backed up
>>
>> Device:   rrqm/s   

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cachepressure, capability release, poor iostat await avg queue size

2016-10-19 Thread mykola.dvornik
Not sure if related, but I see the same issue on the very different 
hardware/configuration. In particular on large data transfers OSDs become slow 
and blocking. Iostat await on spinners can go up to 6(!) s ( journal is on the 
ssd). Looking closer on those spinners with blktrace suggest that most of those 
6 s the io requests spend in the que, before get committed to the driver and 
eventually written to disk. Tried different io schedulers, played with their 
parameters, but nothing helps. Unfortunately, blktrace is a very nasty thing 
that fails to start at some point until machine is rebooted. So I am still 
waiting for the appropriate time slot to reboot the OSD nodes and record io 
with blktrace again. 

-Mykola

From: John Spray
Sent: Wednesday, 19 October 2016 19:17
To: Jim Kilborn
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
cachepressure, capability release, poor iostat await avg queue size

On Wed, Oct 19, 2016 at 5:17 PM, Jim Kilborn  wrote:
> John,
>
>
>
> Thanks for the tips….
>
> Unfortunately, I was looking at this page 
> http://docs.ceph.com/docs/jewel/start/os-recommendations/

OK, thanks - I've pushed an update to clarify that
(https://github.com/ceph/ceph/pull/11564).

> I’ll consider either upgrading the kernels or using the fuse client, but will 
> likely go the kernel 4.4 route
>
>
>
> As for moving to just a replicated pool, I take it that a replication size of 
> 3 is minimum recommended.
>
> If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
> servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
> or is that two many osds per journal, creating a hot spot for writes?

That sounds like a lot of journals on one SSD, but people other than
me have more empirical experience in hardware selection.

John

>
>
>
> Thanks!!
>
>
>
>
>
>
>
> Sent from Mail for Windows 10
>
>
>
> From: John Spray
> Sent: Wednesday, October 19, 2016 9:10 AM
> To: Jim Kilborn
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
>
>
>
> On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn  wrote:
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>>
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  Since I am running cephfs, I have 
>> tiering setup.
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>>
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>>
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>>
>> osd.0 is the cache ssd
>>
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>   0.00 0.339.00   84.33 0.9620.11   462.40   
>>  75.92  397.56  125.67  426.58  10.70  99.90
>>   0.00 0.67   30.00   87.33 5.9621.03   471.20   
>>  67.86  910.95   87.00 1193.99   8.27  97.07
>>   0.0016.67   33.00  289.33 4.2118.80   146.20   
>>  29.83   88.99   93.91   88.43   3.10  99.83
>>   0.00 7.337.67  261.67 1.9219.63   163.81   
>> 117.42  331.97  182.04  336.36   3.71 100.00
>>
>>
>> If I look at the iostat for all the drives, only the cache ssd drive is 
>> backed up
>>
>> Device:   rrqm/s   wrqm/s r/s  

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread John Spray
On Wed, Oct 19, 2016 at 5:17 PM, Jim Kilborn  wrote:
> John,
>
>
>
> Thanks for the tips….
>
> Unfortunately, I was looking at this page 
> http://docs.ceph.com/docs/jewel/start/os-recommendations/

OK, thanks - I've pushed an update to clarify that
(https://github.com/ceph/ceph/pull/11564).

> I’ll consider either upgrading the kernels or using the fuse client, but will 
> likely go the kernel 4.4 route
>
>
>
> As for moving to just a replicated pool, I take it that a replication size of 
> 3 is minimum recommended.
>
> If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
> servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
> or is that two many osds per journal, creating a hot spot for writes?

That sounds like a lot of journals on one SSD, but people other than
me have more empirical experience in hardware selection.

John

>
>
>
> Thanks!!
>
>
>
>
>
>
>
> Sent from Mail for Windows 10
>
>
>
> From: John Spray
> Sent: Wednesday, October 19, 2016 9:10 AM
> To: Jim Kilborn
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
>
>
>
> On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn  wrote:
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>>
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  Since I am running cephfs, I have 
>> tiering setup.
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>>
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>>
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>>
>> osd.0 is the cache ssd
>>
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>   0.00 0.339.00   84.33 0.9620.11   462.40   
>>  75.92  397.56  125.67  426.58  10.70  99.90
>>   0.00 0.67   30.00   87.33 5.9621.03   471.20   
>>  67.86  910.95   87.00 1193.99   8.27  97.07
>>   0.0016.67   33.00  289.33 4.2118.80   146.20   
>>  29.83   88.99   93.91   88.43   3.10  99.83
>>   0.00 7.337.67  261.67 1.9219.63   163.81   
>> 117.42  331.97  182.04  336.36   3.71 100.00
>>
>>
>> If I look at the iostat for all the drives, only the cache ssd drive is 
>> backed up
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> Sdg (journal for cache drive)
>>   0.00 6.330.008.00 0.00 0.0719.04   
>>   0.000.330.000.33   0.33   0.27
>> Sdb (cache drive)
>>   0.00 0.333.33   82.00 0.8320.07   501.68   
>> 106.75 1057.81  269.40 1089.86  11.72 100.00
>> Sda (4TB EC)
>>   0.00 0.000.004.00 0.00 0.02 9.33   
>>   0.000.000.000.00   0.00   0.00
>> Sdd (4TB EC)
>>   0.00 0.000.002.33 0.00 0.45   392.00   
>>   0.08   34.000.00   34.00   6.86   1.60
>> Sdf (4TB EC)
>>   0.0014.000.00   26.00 0.00 0.2217.71   
>>   1.00   38.550.00   38.55   0.68   1.77
>> Sdc (4TB EC)
>>   0.00 0.000.001.33 0.00 0.01 8.75   
>>   0.02   12.250.00   12.25  12.25   

[ceph-users] qemu-rbd and ceph striping

2016-10-19 Thread Ahmed Mostafa
Hello

>From the documentation i understand that clients that uses librados must
perform striping for themselves, but i do not understand how could this be
if we have striping options in ceph ? i mean i can create rbd images that
has configuration for striping, count and unite size.

So my question is, if i created an RBD image that have striping enabled and
configured, will that make a difference with qemu-rbd ? by difference i
mean enhancing performance of my virtual machines i/o and allowing
utilizing the cluster resources

Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
John,



Thanks for the tips….

Unfortunately, I was looking at this page 
http://docs.ceph.com/docs/jewel/start/os-recommendations/

I’ll consider either upgrading the kernels or using the fuse client, but will 
likely go the kernel 4.4 route



As for moving to just a replicated pool, I take it that a replication size of 3 
is minimum recommended.

If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
or is that two many osds per journal, creating a hot spot for writes?



Thanks!!







Sent from Mail for Windows 10



From: John Spray
Sent: Wednesday, October 19, 2016 9:10 AM
To: Jim Kilborn
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn  wrote:
> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.  Since I am running cephfs, I have tiering 
> setup.
> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2
> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection
> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>
> osd.0 is the cache ssd
>
> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
> await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The 
> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> sdb
>   0.00 0.339.00   84.33 0.9620.11   462.40
> 75.92  397.56  125.67  426.58  10.70  99.90
>   0.00 0.67   30.00   87.33 5.9621.03   471.20
> 67.86  910.95   87.00 1193.99   8.27  97.07
>   0.0016.67   33.00  289.33 4.2118.80   146.20
> 29.83   88.99   93.91   88.43   3.10  99.83
>   0.00 7.337.67  261.67 1.9219.63   163.81   
> 117.42  331.97  182.04  336.36   3.71 100.00
>
>
> If I look at the iostat for all the drives, only the cache ssd drive is 
> backed up
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> Sdg (journal for cache drive)
>   0.00 6.330.008.00 0.00 0.0719.04
>  0.000.330.000.33   0.33   0.27
> Sdb (cache drive)
>   0.00 0.333.33   82.00 0.8320.07   501.68   
> 106.75 1057.81  269.40 1089.86  11.72 100.00
> Sda (4TB EC)
>   0.00 0.000.004.00 0.00 0.02 9.33
>  0.000.000.000.00   0.00   0.00
> Sdd (4TB EC)
>   0.00 0.000.002.33 0.00 0.45   392.00
>  0.08   34.000.00   34.00   6.86   1.60
> Sdf (4TB EC)
>   0.0014.000.00   26.00 0.00 0.2217.71
>  1.00   38.550.00   38.55   0.68   1.77
> Sdc (4TB EC)
>   0.00 0.000.001.33 0.00 0.01 8.75
>  0.02   12.250.00   12.25  12.25   1.63
>
> While at this time is just complaining about slow osd.0, sometimes the other 
> cache tier ssds show some slow response, but not as frequently.
>
>
> I occasionally see complaints about a client not responding to cache 
> pressure, and yesterday while copying serveral terabytes, the client doing 
> the copy was noted for failing to respond to capability release, and I ended 
> up rebooting it.
>
> I just seems the cluster 

Re: [ceph-users] Feedback wanted: health warning when standby MDS dies?

2016-10-19 Thread Sean Redmond
Hi,

I would be interested in this case when a mds in standby-replay fails.

Thanks

On Wed, Oct 19, 2016 at 4:06 PM, Scottix  wrote:

> I would take the analogy of a Raid scenario. Basically a standby is
> considered like a spare drive. If that spare drive goes down. It is good to
> know about the event, but it does in no way indicate a degraded system,
> everything keeps running at top speed.
>
> If you had multi active MDS and one goes down then I would say that is a
> degraded system, but still waiting for that feature.
>
>
> On Tue, Oct 18, 2016 at 10:18 AM Goncalo Borges <
> goncalo.bor...@sydney.edu.au> wrote:
>
>> Hi John.
>>
>> That would be good.
>>
>> In our case we are just picking that up simply through nagios and some
>> fancy scripts parsing the dump of the MDS maps.
>>
>> Cheers
>> Goncalo
>> 
>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of John
>> Spray [jsp...@redhat.com]
>> Sent: 18 October 2016 22:46
>> To: ceph-users
>> Subject: [ceph-users] Feedback wanted: health warning when standby MDS
>> dies?
>>
>> Hi all,
>>
>> Someone asked me today how to get a list of down MDS daemons, and I
>> explained that currently the MDS simply forgets about any standby that
>> stops sending beacons.  That got me thinking about the case where a
>> standby dies while the active MDS remains up -- the cluster has gone
>> into a non-highly-available state, but we are not giving the admin any
>> indication.
>>
>> I've suggested a solution here:
>> http://tracker.ceph.com/issues/17604
>>
>> This is probably going to be a bit of a subjective thing in terms of
>> whether people find it useful or find it to be annoying noise, so I'd
>> be interested in feedback from people currently running cephfs.
>>
>> Cheers,
>> John
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Feedback wanted: health warning when standby MDS dies?

2016-10-19 Thread Scottix
I would take the analogy of a Raid scenario. Basically a standby is
considered like a spare drive. If that spare drive goes down. It is good to
know about the event, but it does in no way indicate a degraded system,
everything keeps running at top speed.

If you had multi active MDS and one goes down then I would say that is a
degraded system, but still waiting for that feature.

On Tue, Oct 18, 2016 at 10:18 AM Goncalo Borges <
goncalo.bor...@sydney.edu.au> wrote:

> Hi John.
>
> That would be good.
>
> In our case we are just picking that up simply through nagios and some
> fancy scripts parsing the dump of the MDS maps.
>
> Cheers
> Goncalo
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of John
> Spray [jsp...@redhat.com]
> Sent: 18 October 2016 22:46
> To: ceph-users
> Subject: [ceph-users] Feedback wanted: health warning when standby MDS
> dies?
>
> Hi all,
>
> Someone asked me today how to get a list of down MDS daemons, and I
> explained that currently the MDS simply forgets about any standby that
> stops sending beacons.  That got me thinking about the case where a
> standby dies while the active MDS remains up -- the cluster has gone
> into a non-highly-available state, but we are not giving the admin any
> indication.
>
> I've suggested a solution here:
> http://tracker.ceph.com/issues/17604
>
> This is probably going to be a bit of a subjective thing in terms of
> whether people find it useful or find it to be annoying noise, so I'd
> be interested in feedback from people currently running cephfs.
>
> Cheers,
> John
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-19 Thread Tyler Bishop
This is a cool project, keep up the good work!


_ 

Tyler Bishop 
Founder 


O: 513-299-7108 x10 
M: 513-646-5809 
http://BeyondHosting.net 


This email is intended only for the recipient(s) above and/or otherwise 
authorized personnel. The information contained herein and attached is 
confidential and the property of Beyond Hosting. Any unauthorized copying, 
forwarding, printing, and/or disclosing any information related to this email 
is prohibited. If you received this message in error, please contact the sender 
and destroy all copies of this email and any attachment(s).

- Original Message -
From: "Maged Mokhtar" 
To: "ceph users" 
Sent: Sunday, October 16, 2016 12:57:14 PM
Subject: [ceph-users] new Open Source Ceph based iSCSI SAN project

Hello,

I am happy to announce PetaSAN, an open source scale-out SAN that uses Ceph 
storage and LIO iSCSI Target.
visit us at:
www.petasan.org

your feedback will be much appreciated.
maged mokhtar 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread John Spray
On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn  wrote:
> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.  Since I am running cephfs, I have tiering 
> setup.
> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2
> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection
> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>
> osd.0 is the cache ssd
>
> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
> await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The 
> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> sdb
>   0.00 0.339.00   84.33 0.9620.11   462.40
> 75.92  397.56  125.67  426.58  10.70  99.90
>   0.00 0.67   30.00   87.33 5.9621.03   471.20
> 67.86  910.95   87.00 1193.99   8.27  97.07
>   0.0016.67   33.00  289.33 4.2118.80   146.20
> 29.83   88.99   93.91   88.43   3.10  99.83
>   0.00 7.337.67  261.67 1.9219.63   163.81   
> 117.42  331.97  182.04  336.36   3.71 100.00
>
>
> If I look at the iostat for all the drives, only the cache ssd drive is 
> backed up
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> Sdg (journal for cache drive)
>   0.00 6.330.008.00 0.00 0.0719.04
>  0.000.330.000.33   0.33   0.27
> Sdb (cache drive)
>   0.00 0.333.33   82.00 0.8320.07   501.68   
> 106.75 1057.81  269.40 1089.86  11.72 100.00
> Sda (4TB EC)
>   0.00 0.000.004.00 0.00 0.02 9.33
>  0.000.000.000.00   0.00   0.00
> Sdd (4TB EC)
>   0.00 0.000.002.33 0.00 0.45   392.00
>  0.08   34.000.00   34.00   6.86   1.60
> Sdf (4TB EC)
>   0.0014.000.00   26.00 0.00 0.2217.71
>  1.00   38.550.00   38.55   0.68   1.77
> Sdc (4TB EC)
>   0.00 0.000.001.33 0.00 0.01 8.75
>  0.02   12.250.00   12.25  12.25   1.63
>
> While at this time is just complaining about slow osd.0, sometimes the other 
> cache tier ssds show some slow response, but not as frequently.
>
>
> I occasionally see complaints about a client not responding to cache 
> pressure, and yesterday while copying serveral terabytes, the client doing 
> the copy was noted for failing to respond to capability release, and I ended 
> up rebooting it.
>
> I just seems the cluster isn’t handling large amounts of data copies, like 
> and nfs or san based volume would, and I am worried about moving our users to 
> a cluster that already is showing signs of performance issues, even when I am 
> just doing a copy with no other users. I am doing only one rsync at a time.
>
> Is the problem that I need to user a later kernel for the clients mounting 
> the volume ? I have read some posts about that, but the docs say centos 7 
> with 3.10 is ok.

Which docs say to use the stock centos kernel?  >4.4 is recommended
here: http://docs.ceph.com/docs/master/cephfs/best-practices/

> Do I need more drives in my cache pool? I only have 4 ssd drive in the cache 
> pool (one on each host), with each having a separate journal drive.
> But is that too much of a hot spot since all i/o has to go to the cache layer?
> It seems like my ssds should be able to keep up with a single rsync copy.
> Is there something set wrong on my ssds that they cant keep up?

Cache tiering is pretty sensitive to 

Re: [ceph-users] HELP ! Cluster unusable with lots of "hitsuicidetimeout"

2016-10-19 Thread Yoann Moulin
Hello,

>>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is 
>>> compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>>>
>>> We have one rbd partition and a radosGW with 2 data pool, one replicated, 
>>> one EC (8+2)
>>>
>>> in attachment few details on our cluster.
>>>
>>> Currently, our cluster is not usable at all due to too much OSD 
>>> instability. OSDs daemon die randomly with "hit suicide timeout". 
>>> Yesterday, all
>>> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 
>>> time
>>>
>>> here logs from ceph mon and from one OSD :
>>>
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
>>>
>>> We have stopped all clients i/o to see if the cluster get stable without 
>>> success, to avoid  endless rebalancing with OSD flapping, we had to
>>> "set noout" the cluster. For now we have no idea what's going on.
>>>
>>> Anyone can help us to understand what's happening ?
>>>
>>> thanks for your help
>>>
>> no specific ideas, but this somewhat sounds familiar.
>>
>> One thing first, you already stopped client traffic but to make sure your
>> cluster really becomes quiescent, stop all scrubs as well.
>> That's always a good idea in any recovery, overload situation.

this is what we did.

>> Have you verified CPU load (are those OSD processes busy), memory status,
>> etc?
>> How busy are the actual disks?

The CPU and memory seem to not be overloaded, with journal on disk maybe a 
little bit busy.

>> Sudden deaths like this often are the results of network changes,  like a
>> switch rebooting and loosing jumbo frame configuration or whatnot.

We manage all equipments of the cluster, none of them have reboot. We decided 
to reboot node by node yesterday but the switch is healthy.

In the log I found that the problem has started after I start to copy data on 
the RadosGW EC pool (8+2).

At the same time, we had 6 process reading on the rbd partition, three of those 
process was writing on a replicated pool through the RadosGW s3
of the cluster itself and one was writing on a EC pool through the RadosGW s3 
too, 2 other was not writing on the cluster.
Maybe that pressure may slow down enough the disk to create the suicide timeout 
of the OSD ?

but now, we have no more I/O on the cluster and as soon as I re enable scrub 
and  rebalancing, OSDs start to fail again...

> just an additional comment:
> 
> you can disable backfilling and recovery temporarily by setting the 
> 'nobackfill' and 'norecover' flags. It will reduce the backfilling traffic
> and may help the cluster and its OSD to recover. Afterwards you should set 
> the backfill traffic settings to the minimum (e.g. max_backfills = 1)
> and unset the flags to allow the cluster to perform the outstanding recovery 
> operation.
>
> As the others already pointed out, these actions might help to get the 
> cluster up and running again, but you need to find the actual reason for
> the problems.

This is exactly what I want

Thanks for the help !

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Dan van der Ster
On Wed, Oct 19, 2016 at 3:22 PM, Yoann Moulin  wrote:
> Hello,
>
>>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is 
>>> compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>>>
>>> We have one rbd partition and a radosGW with 2 data pool, one replicated, 
>>> one EC (8+2)
>>>
>>> in attachment few details on our cluster.
>>>
>>> Currently, our cluster is not usable at all due to too much OSD 
>>> instability. OSDs daemon die randomly with "hit suicide timeout". 
>>> Yesterday, all
>>> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 
>>> time
>>>
>>> here logs from ceph mon and from one OSD :
>>>
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
>>
>> Do you have an older log showing the start of the incident? The
>> cluster was already down when this log started.
>
> Here the log from Saturday, OSD 134 is the first which had error :
>
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.134.log.4.bz2
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.4.bz2
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.4.bz2


Do you have osd.86's log? I think it was the first to fail:

2016-10-15 14:42:32.109025 mon.0 10.90.37.3:6789/0 5240160 : cluster
[INF] osd.86 10.90.37.15:6823/11625 failed (2 reporters from different
host after 20.000215 >= grace 20.00)

Then these osds a couple seconds later:

2016-10-15 14:42:34.900989 mon.0 10.90.37.3:6789/0 5240180 : cluster
[INF] osd.27 10.90.37.5:6802/5426 failed (2 reporters from different
host after 20.000417 >= grace 20.00)
2016-10-15 14:42:34.902105 mon.0 10.90.37.3:6789/0 5240183 : cluster
[INF] osd.95 10.90.37.12:6822/12403 failed (2 reporters from different
host after 20.001862 >= grace 20.00)
2016-10-15 14:42:34.902653 mon.0 10.90.37.3:6789/0 5240185 : cluster
[INF] osd.131 10.90.37.25:6820/195317 failed (2 reporters from
different host after 20.002387 >= grace 20.00)
2016-10-15 14:42:34.903205 mon.0 10.90.37.3:6789/0 5240187 : cluster
[INF] osd.136 10.90.37.23:6803/5148 failed (2 reporters from different
host after 20.002898 >= grace 20.00)
2016-10-15 14:42:35.576139 mon.0 10.90.37.3:6789/0 5240191 : cluster
[INF] osd.24 10.90.37.3:6800/4587 failed (2 reporters from different
host after 21.384669 >= grace 20.094412)
2016-10-15 14:42:35.580217 mon.0 10.90.37.3:6789/0 5240193 : cluster
[INF] osd.37 10.90.37.11:6838/179566 failed (3 reporters from
different host after 20.680190 >= grace 20.243928)
2016-10-15 14:42:35.581550 mon.0 10.90.37.3:6789/0 5240195 : cluster
[INF] osd.46 10.90.37.9:6800/4811 failed (2 reporters from different
host after 21.389655 >= grace 20.00)
2016-10-15 14:42:35.582286 mon.0 10.90.37.3:6789/0 5240197 : cluster
[INF] osd.64 10.90.37.21:6810/7658 failed (2 reporters from different
host after 21.390167 >= grace 20.409388)
2016-10-15 14:42:35.582823 mon.0 10.90.37.3:6789/0 5240199 : cluster
[INF] osd.107 10.90.37.19:6820/10260 failed (2 reporters from
different host after 21.390516 >= grace 20.074818)


Just a hunch, but do osds 86, 27, 95, etc... all share the same PG?
Use 'ceph pg dump' to check.

>
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
>>
>> In this log the thread which is hanging is doing deep-scrub:
>>
>> 2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
>> [INF] : 39.54 deep-scrub starts
>> 2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
>> 2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150
>>
>> So you can disable scrubbing completely with
>>
>>   ceph osd set noscrub
>>   ceph osd set nodeep-scrub
>>
>> in case you are hitting some corner case with the scrubbing code.
>
> Now the cluster seem to be healthy. but as soon as I re enable scrubbing and 
> rebalancing OSD start to flap and the cluster switch to HEATH_ERR
>

Looks like recover/backfill are enabled and you have otherwise all
clean PGs. Don't be afraid to leave scrubbing disabled until you
understand exactly what is going wrong.

Do you see any SCSI / IO errors on the disks failing to scrub?
Though, it seems unlikely that so many disks are all failing at the
same time. More likely there's at least one object that's giving the
scrubber problems and hanging the related OSDs.


> cluster f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
>   health HEALTH_WARN
>  noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>   monmap e1: 3 mons at
> {iccluster002.iccluster.epfl.ch=10.90.37.3:6789/0,iccluster010.iccluster.epfl.ch=10.90.37.11:6789/0,iccluster018.iccluster.epfl.ch=10.90.37.19:6789/0}
>  election epoch 64, quorum 0,1,2 
> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
>fsmap e131: 1/1/1 up {0=iccluster022.iccluster.epfl.ch=up:active}, 2 
> up:standby
>   

Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Yoann Moulin
Hello,

>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose 
>> by 12 nodes, each nodes have 10 OSD with journal on disk.
>>
>> We have one rbd partition and a radosGW with 2 data pool, one replicated, 
>> one EC (8+2)
>>
>> in attachment few details on our cluster.
>>
>> Currently, our cluster is not usable at all due to too much OSD instability. 
>> OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
>> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 
>> time
>>
>> here logs from ceph mon and from one OSD :
>>
>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
> 
> Do you have an older log showing the start of the incident? The
> cluster was already down when this log started.

Here the log from Saturday, OSD 134 is the first which had error :

http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.134.log.4.bz2
http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.4.bz2
http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.4.bz2

>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
> 
> In this log the thread which is hanging is doing deep-scrub:
> 
> 2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
> [INF] : 39.54 deep-scrub starts
> 2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
> 2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150
> 
> So you can disable scrubbing completely with
> 
>   ceph osd set noscrub
>   ceph osd set nodeep-scrub
> 
> in case you are hitting some corner case with the scrubbing code.

Now the cluster seem to be healthy. but as soon as I re enable scrubbing and 
rebalancing OSD start to flap and the cluster switch to HEATH_ERR

cluster f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
  health HEALTH_WARN
 noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
  monmap e1: 3 mons at
{iccluster002.iccluster.epfl.ch=10.90.37.3:6789/0,iccluster010.iccluster.epfl.ch=10.90.37.11:6789/0,iccluster018.iccluster.epfl.ch=10.90.37.19:6789/0}
 election epoch 64, quorum 0,1,2 
iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
   fsmap e131: 1/1/1 up {0=iccluster022.iccluster.epfl.ch=up:active}, 2 
up:standby
  osdmap e72932: 144 osds: 144 up, 120 in
 flags noout,noscrub,nodeep-scrub,sortbitwise
   pgmap v4834810: 9408 pgs, 28 pools, 153 TB data, 75849 kobjects
 449 TB used, 203 TB / 653 TB avail
 9408 active+clean


>> We have stopped all clients i/o to see if the cluster get stable without 
>> success, to avoid  endless rebalancing with OSD flapping, we had to
>> "set noout" the cluster. For now we have no idea what's going on.
>>
>> Anyone can help us to understand what's happening ?
> 
> Is your network OK?

We have one 10G nic for the private network and one 10G nic for the public 
network. The network is far under loaded right now and there is no
error. We don't use jumbo frame.

> It will be useful to see the start of the incident to better
> understand what caused this situation.
>
> Also, maybe useful for you... you can increase the suicide timeout, e.g.:
> 
>osd op thread suicide timeout: 
> 
> If the cluster is just *slow* somehow, then increasing that might
> help. If there is something systematically broken, increasing would
> just postpone the inevitable.

Ok, I'm going to study this option with my colleagues

thanks

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
I have setup a new linux cluster to allow migration from our old SAN based 
cluster to a new cluster with ceph.
All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
I am basically running stock ceph settings, with just turning the write cache 
off via hdparm on the drives, and temporarily turning of scrubbing.

The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
Server performance should be good.  Since I am running cephfs, I have tiering 
setup.
Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So 
the idea is to ensure a single host failure.
Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
replicated set with size=2
The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
the cache SSD. It has power loss protection
My crush map is setup to ensure the cache pool uses only the 4 850 pro and the 
erasure code uses only the 16 spinning 4TB drives.

The problems that I am seeing is that I start copying data from our old san to 
the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 
TB, I start seeing:

HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
26 ops are blocked > 65.536 sec on osd.0
37 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set

osd.0 is the cache ssd

If I watch iostat on the cache ssd, I see the queue lengths are high and the 
await are high
Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz 
is between 87 and 182 and the await is between 88ms and 1193ms

Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz   
await r_await w_await  svctm  %util
sdb
  0.00 0.339.00   84.33 0.9620.11   462.40
75.92  397.56  125.67  426.58  10.70  99.90
  0.00 0.67   30.00   87.33 5.9621.03   471.20
67.86  910.95   87.00 1193.99   8.27  97.07
  0.0016.67   33.00  289.33 4.2118.80   146.20
29.83   88.99   93.91   88.43   3.10  99.83
  0.00 7.337.67  261.67 1.9219.63   163.81   
117.42  331.97  182.04  336.36   3.71 100.00


If I look at the iostat for all the drives, only the cache ssd drive is backed 
up

Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz   
await r_await w_await  svctm  %util
Sdg (journal for cache drive)
  0.00 6.330.008.00 0.00 0.0719.04 
0.000.330.000.33   0.33   0.27
Sdb (cache drive)
  0.00 0.333.33   82.00 0.8320.07   501.68   
106.75 1057.81  269.40 1089.86  11.72 100.00
Sda (4TB EC)
  0.00 0.000.004.00 0.00 0.02 9.33 
0.000.000.000.00   0.00   0.00
Sdd (4TB EC)
  0.00 0.000.002.33 0.00 0.45   392.00 
0.08   34.000.00   34.00   6.86   1.60
Sdf (4TB EC)
  0.0014.000.00   26.00 0.00 0.2217.71 
1.00   38.550.00   38.55   0.68   1.77
Sdc (4TB EC)
  0.00 0.000.001.33 0.00 0.01 8.75 
0.02   12.250.00   12.25  12.25   1.63

While at this time is just complaining about slow osd.0, sometimes the other 
cache tier ssds show some slow response, but not as frequently.


I occasionally see complaints about a client not responding to cache pressure, 
and yesterday while copying serveral terabytes, the client doing the copy was 
noted for failing to respond to capability release, and I ended up rebooting it.

I just seems the cluster isn’t handling large amounts of data copies, like and 
nfs or san based volume would, and I am worried about moving our users to a 
cluster that already is showing signs of performance issues, even when I am 
just doing a copy with no other users. I am doing only one rsync at a time.

Is the problem that I need to user a later kernel for the clients mounting the 
volume ? I have read some posts about that, but the docs say centos 7 with 3.10 
is ok.
Do I need more drives in my cache pool? I only have 4 ssd drive in the cache 
pool (one on each host), with each having a separate journal drive.
But is that too much of a hot spot since all i/o has to go to the cache layer?
It seems like my ssds should be able to keep up with a single rsync copy.
Is there something set wrong on my ssds that they cant keep up?
I put the metadata pool on the ssd cache tier drives as well.

Any ideas where the problem is or what I need to change to make this stable?


Thanks. Additional details below

The ceph osd tier drives are osd 0,  5,  10, 15

ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
63155G 50960G   12195G 19.31
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS

Re: [ceph-users] offending shards are crashing osd's

2016-10-19 Thread Ronny Aasen

On 06. okt. 2016 13:41, Ronny Aasen wrote:

hello

I have a few osd's in my cluster that are regularly crashing.


[snip]



ofcourse having 3 osd's dying regularly is not good for my health. so i
have set noout, to avoid heavy recoveries.

googeling this error messages gives exactly 1 hit:
https://github.com/ceph/ceph/pull/6946

where it saies:  "the shard must be removed so it can be reconstructed"
but with my 3 osd's failing, i am not certain witch of them contain the
broken shard. (or perhaps all 3 of them?)

a bit reluctant to delete on all 3. I have 4+2 erasure coding.
( erasure size 6 min_size 4 ) so finding out witch one is bad would be
nice.

hope someone have an idea how to progress.

kind regards
Ronny Aasen


i again have this problem with crashing osd's. a more detailed log is on 
the tail of this mail.


Does anyone have any suggestions on how i can identify what shard that 
needs to be removed to allow the EC to recover. ?


and more importantly how i can stop the osd's from crashing?


kind regards
Ronny Aasen





-- query of pg in question --
# ceph pg 5.26 query
{
 "state": "active+undersized+degraded+remapped+wait_backfill",
 "snap_trimq": "[]",
 "epoch": 138744,
 "up": [
 27,
 109,
 2147483647,
 2147483647,
 62,
 75
 ],
 "acting": [
 2147483647,
 2147483647,
 32,
 107,
 62,
 38
 ],
 "backfill_targets": [
 "27(0)",
 "75(5)",
 "109(1)"
 ],
 "actingbackfill": [
 "27(0)",
 "32(2)",
 "38(5)",
 "62(4)",
 "75(5)",
 "107(3)",
 "109(1)"
 ],
 "info": {
 "pgid": "5.26s2",
 "last_update": "84093'35622",
 "last_complete": "84093'35622",
 "log_tail": "82361'32622",
 "last_user_version": 0,
 "last_backfill": "MAX",
 "purged_snaps": "[1~7]",
 "history": {
 "epoch_created": 61149,
 "last_epoch_started": 138692,
 "last_epoch_clean": 136567,
 "last_epoch_split": 0,
 "same_up_since": 138691,
 "same_interval_since": 138691,
 "same_primary_since": 138691,
 "last_scrub": "84093'35622",
 "last_scrub_stamp": "2016-10-18 06:18:28.253508",
 "last_deep_scrub": "84093'35622",
 "last_deep_scrub_stamp": "2016-10-14 05:33:56.701167",
 "last_clean_scrub_stamp": "2016-10-14 05:33:56.701167"
 },
 "stats": {
 "version": "84093'35622",
 "reported_seq": "210475",
 "reported_epoch": "138730",
 "state": "active+undersized+degraded+remapped+wait_backfill",
 "last_fresh": "2016-10-19 12:40:32.982617",
 "last_change": "2016-10-19 12:03:29.377914",
 "last_active": "2016-10-19 12:40:32.982617",
 "last_peered": "2016-10-19 12:40:32.982617",
 "last_clean": "2016-07-19 12:03:54.814292",
 "last_became_active": "0.00",
 "last_became_peered": "0.00",
 "last_unstale": "2016-10-19 12:40:32.982617",
 "last_undegraded": "2016-10-19 12:02:03.030755",
 "last_fullsized": "2016-10-19 12:02:03.030755",
 "mapping_epoch": 138627,
 "log_start": "82361'32622",
 "ondisk_log_start": "82361'32622",
 "created": 61149,
 "last_epoch_clean": 136567,
 "parent": "0.0",
 "parent_split_bits": 0,
 "last_scrub": "84093'35622",
 "last_scrub_stamp": "2016-10-18 06:18:28.253508",
 "last_deep_scrub": "84093'35622",
 "last_deep_scrub_stamp": "2016-10-14 05:33:56.701167",
 "last_clean_scrub_stamp": "2016-10-14 05:33:56.701167",
 "log_size": 3000,
 "ondisk_log_size": 3000,
 "stats_invalid": "0",
 "stat_sum": {
 "num_bytes": 99736657920,
 "num_objects": 12026,
 "num_object_clones": 0,
 "num_object_copies": 84182,
 "num_objects_missing_on_primary": 0,
 "num_objects_degraded": 24052,
 "num_objects_misplaced": 90583,
 "num_objects_unfound": 0,
 "num_objects_dirty": 12026,
 "num_whiteouts": 0,
 "num_read": 86122,
 "num_read_kb": 9446184,
 "num_write": 35622,
 "num_write_kb": 182277312,
 "num_scrub_errors": 0,
 "num_shallow_scrub_errors": 0,
 "num_deep_scrub_errors": 0,
 "num_objects_recovered": 0,
 "num_bytes_recovered": 0,
 "num_keys_recovered": 0,
 "num_objects_omap": 0,
 "num_objects_hit_set_archive": 0,
 

Re: [ceph-users] HELP ! Cluster unusable with lots of "hitsuicidetimeout"

2016-10-19 Thread Burkhard Linke

Hi,

just an additional comment:

you can disable backfilling and recovery temporarily by setting the 
'nobackfill' and 'norecover' flags. It will reduce the backfilling 
traffic and may help the cluster and its OSD to recover. Afterwards you 
should set the backfill traffic settings to the minimum (e.g. 
max_backfills = 1) and unset the flags to allow the cluster to perform 
the outstanding recovery operation.


As the others already pointed out, these actions might help to get the 
cluster up and running again, but you need to find the actual reason for 
the problems.


Regards,
Burkhard

On 19.10.2016 10:04, Christian Balzer wrote:

Hello,

no specific ideas, but this somewhat sounds familiar.

One thing first, you already stopped client traffic but to make sure your
cluster really becomes quiescent, stop all scrubs as well.
That's always a good idea in any recovery, overload situation.

Have you verified CPU load (are those OSD processes busy), memory status,
etc?
How busy are the actual disks?

Sudden deaths like this often are the results of network changes,  like a
switch rebooting and loosing jumbo frame configuration or whatnot.

Christian
  
On Wed, 19 Oct 2016 09:44:01 +0200 Yoann Moulin wrote:



Dear List,

We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 
12 nodes, each nodes have 10 OSD with journal on disk.

We have one rbd partition and a radosGW with 2 data pool, one replicated, one 
EC (8+2)

in attachment few details on our cluster.

Currently, our cluster is not usable at all due to too much OSD instability. OSDs daemon 
die randomly with "hit suicide timeout". Yesterday, all
of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time

here logs from ceph mon and from one OSD :

http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)

We have stopped all clients i/o to see if the cluster get stable without 
success, to avoid  endless rebalancing with OSD flapping, we had to
"set noout" the cluster. For now we have no idea what's going on.

Anyone can help us to understand what's happening ?

thanks for your help





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Christian Balzer

Hello,

no specific ideas, but this somewhat sounds familiar.

One thing first, you already stopped client traffic but to make sure your
cluster really becomes quiescent, stop all scrubs as well.
That's always a good idea in any recovery, overload situation.

Have you verified CPU load (are those OSD processes busy), memory status,
etc?
How busy are the actual disks?

Sudden deaths like this often are the results of network changes,  like a
switch rebooting and loosing jumbo frame configuration or whatnot.

Christian
 
On Wed, 19 Oct 2016 09:44:01 +0200 Yoann Moulin wrote:

> Dear List,
> 
> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose 
> by 12 nodes, each nodes have 10 OSD with journal on disk.
> 
> We have one rbd partition and a radosGW with 2 data pool, one replicated, one 
> EC (8+2)
> 
> in attachment few details on our cluster.
> 
> Currently, our cluster is not usable at all due to too much OSD instability. 
> OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
> 
> here logs from ceph mon and from one OSD :
> 
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
> 
> We have stopped all clients i/o to see if the cluster get stable without 
> success, to avoid  endless rebalancing with OSD flapping, we had to
> "set noout" the cluster. For now we have no idea what's going on.
> 
> Anyone can help us to understand what's happening ?
> 
> thanks for your help
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Dan van der Ster
Hi Yoann,


On Wed, Oct 19, 2016 at 9:44 AM, Yoann Moulin  wrote:
> Dear List,
>
> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose 
> by 12 nodes, each nodes have 10 OSD with journal on disk.
>
> We have one rbd partition and a radosGW with 2 data pool, one replicated, one 
> EC (8+2)
>
> in attachment few details on our cluster.
>
> Currently, our cluster is not usable at all due to too much OSD instability. 
> OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
>
> here logs from ceph mon and from one OSD :
>
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)

Do you have an older log showing the start of the incident? The
cluster was already down when this log started.

> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)

In this log the thread which is hanging is doing deep-scrub:

2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
[INF] : 39.54 deep-scrub starts
2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150

So you can disable scrubbing completely with

  ceph osd set noscrub
  ceph osd set nodeep-scrub

in case you are hitting some corner case with the scrubbing code.

> We have stopped all clients i/o to see if the cluster get stable without 
> success, to avoid  endless rebalancing with OSD flapping, we had to
> "set noout" the cluster. For now we have no idea what's going on.
>
> Anyone can help us to understand what's happening ?

Is your network OK?

It will be useful to see the start of the incident to better
understand what caused this situation.

Also, maybe useful for you... you can increase the suicide timeout, e.g.:

   osd op thread suicide timeout: 

If the cluster is just *slow* somehow, then increasing that might
help. If there is something systematically broken, increasing would
just postpone the inevitable.

-- Dan




>
> thanks for your help
>
> --
> Yoann Moulin
> EPFL IC-IT
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Yoann Moulin
Dear List,

We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 
12 nodes, each nodes have 10 OSD with journal on disk.

We have one rbd partition and a radosGW with 2 data pool, one replicated, one 
EC (8+2)

in attachment few details on our cluster.

Currently, our cluster is not usable at all due to too much OSD instability. 
OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time

here logs from ceph mon and from one OSD :

http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)

We have stopped all clients i/o to see if the cluster get stable without 
success, to avoid  endless rebalancing with OSD flapping, we had to
"set noout" the cluster. For now we have no idea what's going on.

Anyone can help us to understand what's happening ?

thanks for your help

-- 
Yoann Moulin
EPFL IC-IT
$ ceph --version
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

$ uname -a
Linux icadmin004 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4927 flags hashpspool stripe_width 0
	removed_snaps [1~3]
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 258 flags hashpspool stripe_width 0
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 259 flags hashpspool stripe_width 0
pool 5 'default.rgw.data.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 260 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 6 'default.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 261 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 262 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 8 'erasure.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 271 flags hashpspool stripe_width 0
pool 9 'erasure.rgw.buckets.extra' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 272 flags hashpspool stripe_width 0
pool 11 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 276 flags hashpspool stripe_width 0
pool 12 'default.rgw.buckets.extra' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 277 flags hashpspool stripe_width 0
pool 14 'default.rgw.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 311 flags hashpspool stripe_width 0
pool 15 'default.rgw.users.keys' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 313 flags hashpspool stripe_width 0
pool 16 'default.rgw.meta' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 315 flags hashpspool stripe_width 0
pool 17 'default.rgw.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 320 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 18 'default.rgw.users.email' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 322 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 19 'default.rgw.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 353 flags hashpspool stripe_width 0
pool 20 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4918 flags hashpspool stripe_width 0
pool 26 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3549 flags hashpspool stripe_width 0
pool 27 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3551 flags hashpspool stripe_width 0
pool 28 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3552 flags hashpspool stripe_width 0
pool 29 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3553 flags hashpspool stripe_width 0
pool 30 'test' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 4910 flags hashpspool stripe_width 0
pool 31 'data' replicated size 3 min_size 2 crush_ruleset 0