Re: [ceph-users] scrub error with ceph

2015-12-07 Thread GuangYang
Before issuing scrub, you may check if those scrub errors would point to one (or a small subset of) disk/OSD, and if so, did those objects put in a specified interval? It is a large amount of scrub errors in a small cluster, which might be caused by some hardware issue ?

Re: [ceph-users] osd wasn't marked as down/out when it's storage folder was deleted

2015-12-07 Thread GuangYang
It is actually not part of ceph. For some files under the folder, they are only access during OSD booting up, so removal would not cause a problem there. For some other files, OSD would keep a open handle, in which case, even you remove those files from within filesystem, they are not erased

Re: [ceph-users] pgmap question

2015-09-17 Thread GuangYang
IIRC, the version got increased once the stats of the PG got changed, that is properly the reason why you saw changing with client I/O. Thanks, Guang > Date: Thu, 17 Sep 2015 16:55:41 -0600 > From: rob...@leblancnet.us > To: ceph-users@lists.ceph.com >

Re: [ceph-users] Lot of blocked operations

2015-09-17 Thread GuangYang
Which version are you using? My guess is that the request (op) is waiting for lock (might be ondisk_read_lock of the object, but a debug_osd=20 should be helpful to tell what happened to the op). How do you tell the IO wait is near to 0 (by top?)?  Thanks, Guang

Re: [ceph-users] Hammer reduce recovery impact

2015-09-11 Thread GuangYang
If we are talking about requests being blocked 60+ seconds, those tunings might not help (they help a lot for average latency during recovering/backfilling). It would be interesting to see the logs for those blocked requests at OSD side (they have level 0), pattern to search might be "slow

Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread GuangYang
t; > Shinobu > > - Original Message - > From: "GuangYang" <yguan...@outlook.com> > To: "Ben Hines" <bhi...@gmail.com>, "Nick Fisk" <n...@fisk.me.uk> > Cc: "ceph-users" <ceph-users@lists.ceph.com> > Sent: Sa

Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread GuangYang
IIRC, it only triggers the move (merge or split) when that folder is hit by a request, so most likely it happens gradually. Another thing might be helpful (and we have had good experience with), is that we do the folder splitting at the pool creation time, so that we avoid the performance

Re: [ceph-users] Opensource plugin for pulling out cluster recovery and client IO metric

2015-08-29 Thread GuangYang
Date: Fri, 28 Aug 2015 12:07:39 +0100 From: gfar...@redhat.com To: vickey.singh22...@gmail.com CC: ceph-users@lists.ceph.com; ceph-us...@ceph.com; ceph-de...@vger.kernel.org Subject: Re: [ceph-users] Opensource plugin for pulling out cluster recovery

Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped

2015-08-13 Thread GuangYang
/DBd9k56m I realize these nodes are quite large, I have plans to break them out into 12 OSD's/node. On Thu, Aug 13, 2015 at 9:02 AM, GuangYang yguan...@outlook.com wrote: Could you share the 'ceph osd tree dump' and CRUSH map dump ? Thanks, Guang Date

Re: [ceph-users] OSD space imbalance

2015-08-13 Thread GuangYang
it could incur lots of data movement... To: ceph-users@lists.ceph.com From: vedran.fu...@gmail.com Date: Fri, 14 Aug 2015 00:15:17 +0200 Subject: Re: [ceph-users] OSD space imbalance On 13.08.2015 18:01, GuangYang wrote: Try 'ceph osd  int' right

Re: [ceph-users] OSD space imbalance

2015-08-13 Thread GuangYang
Try 'ceph osd reweight-by-pg int' right after creating the pools? What is the typical object size in the cluster? Thanks, Guang To: ceph-users@lists.ceph.com From: vedran.fu...@gmail.com Date: Thu, 13 Aug 2015 14:58:11 +0200 Subject: [ceph-users]

Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped

2015-08-13 Thread GuangYang
Could you share the 'ceph osd tree dump' and CRUSH map dump ? Thanks, Guang Date: Thu, 13 Aug 2015 08:16:09 -0700 From: sdain...@spd1.com To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cluster health_warn 1

Re: [ceph-users] osd out

2015-08-12 Thread GuangYang
If you are using the default configuration to create the pool (3 replicas), after losing 1 OSD and having 2 left, CRUSH would not be able to find enough OSDs (at least 3) to map the PG thus it would stuck at unclean. Thanks, Guang From:

Re: [ceph-users] radosgw crash within libfcgi

2015-06-26 Thread GuangYang
.. Thanks, Guang Date: Wed, 24 Jun 2015 17:21:04 -0400 From: yeh...@redhat.com To: yguan...@outlook.com CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: radosgw crash within libfcgi - Original Message - From: GuangYang To: Yehuda Sadeh-Weinraub Cc: ceph-de

Re: [ceph-users] radosgw crash within libfcgi

2015-06-24 Thread GuangYang
; ceph-users@lists.ceph.com Subject: Re: radosgw crash within libfcgi - Original Message - From: GuangYang yguan...@outlook.com To: ceph-de...@vger.kernel.org, ceph-users@lists.ceph.com, yeh...@redhat.com Sent: Wednesday, June 24, 2015 10:09:58 AM Subject: radosgw crash within

[ceph-users] radosgw crash within libfcgi

2015-06-24 Thread GuangYang
Hello Cephers, Recently we have several radosgw daemon crashes with the same following kernel log: Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip 7ffa069996f2 sp 7ff55c432710 error 6 in libfcgi.so.0.0.0[7ffa06995000+a000] in libfcgi.so.0.0.0[7ffa06995000+a000] Looking

[ceph-users] xattrs vs. omap with radosgw

2015-06-16 Thread GuangYang
Hi Cephers, While looking at disk utilization on OSD, I noticed the disk was constantly busy with large number of small writes, further investigation showed that, as radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), which made the xattrs get from local to extents, which

Re: [ceph-users] xattrs vs. omap with radosgw

2015-06-16 Thread GuangYang
...@newdream.net CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: xattrs vs. omap with radosgw On 06/16/2015 03:48 PM, GuangYang wrote: Thanks Sage for the quick response. It is on Firefly v0.80.4. While trying to put with *rados* directly, the xattrs can be inline

Re: [ceph-users] xattrs vs. omap with radosgw

2015-06-16 Thread GuangYang
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, June 17, 2015 3:43 AM To: GuangYang Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: xattrs vs. omap with radosgw On Tue, 16 Jun 2015, GuangYang wrote: Hi Cephers, While looking at disk

Re: [ceph-users] unfound object(s)

2015-06-15 Thread GuangYang
Thanks to Sam, we can use:   ceph pg pg_id list_missing to get the list of unfound objects. Thanks, Guang From: yguan...@outlook.com To: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Date: Mon, 15 Jun 2015 16:46:53 + Subject: [ceph-users]

[ceph-users] unfound object(s)

2015-06-15 Thread GuangYang
Hello Cephers, On one of our production clusters, there is one *unfound* object reported which make the PG stuck at recovering. While trying to recover the object, I failed to find a way to tell which object is unfound. I tried:   1 PG query   2 Grep from monitor log Did I miss anything?  

[ceph-users] rgw geo-replication

2015-04-24 Thread GuangYang
Hi cephers, Recently I am investigating the geo-replication of rgw, from the example at [1], it looks like if we want to do data geo replication between us east and us west, we will need to build *one* (super) RADOS cluster which cross us east and west, and only deploy two different radosgw

Re: [ceph-users] rgw geo-replication

2015-04-24 Thread GuangYang
___ Date: Fri, 24 Apr 2015 17:29:40 +0530 From: vum...@redhat.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] rgw geo-replication On 04/24/2015 05:17 PM, GuangYang wrote: Hi cephers, Recently I am investigating the geo-replication of rgw

Re: [ceph-users] ceph data not well distributed.

2015-04-14 Thread GuangYang
We have a tiny script which does the CRUSH re-weight based on the PGs/OSD to achieve balance across OSDs, and we run the script right after setup the cluster to avoid data migration after the cluster is filled up. A couple of experiences to share:  1 As suggested, it is helpful to choose a

Re: [ceph-users] Some long running ops may lock osd

2015-03-02 Thread GuangYang
We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency. Generally the bucket sharding should

Re: [ceph-users] who is using radosgw with civetweb?

2015-02-26 Thread GuangYang
Hi Sage, Is there any timeline around the switch? So that we can plan ahead for the testing. We are running apache + mod-fastcgi in production at scale (540 OSDs, 9 RGW hosts) and it looks good so far. Although at the beginning we came across a problem with large volume of 500 error, which

Re: [ceph-users] PG inconsistency

2014-11-09 Thread GuangYang
Thanks Sage! Date: Fri, 7 Nov 2014 02:19:06 -0800 From: s...@newdream.net To: yguan...@outlook.com CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: PG inconsistency On Thu, 6 Nov 2014, GuangYang wrote: Hello Cephers, Recently

[ceph-users] PG inconsistency

2014-11-06 Thread GuangYang
Hello Cephers, Recently we observed a couple of inconsistencies in our Ceph cluster, there were two major patterns leading to inconsistency as I observed: 1) EIO to read the file, 2) the digest is inconsistent (for EC) even there is no read error). While ceph has built-in tool sets to repair

Re: [ceph-users] PG inconsistency

2014-11-06 Thread GuangYang
know if this is a best or even good practise, but it works for us. Cheers, Dan On Thu Nov 06 2014 at 2:24:32 PM GuangYang yguan...@outlook.commailto:yguan...@outlook.com wrote: Hello Cephers, Recently we observed a couple of inconsistencies in our Ceph cluster, there were two major

Re: [ceph-users] PG inconsistency

2014-11-06 Thread GuangYang
-users@lists.ceph.com What is your version of the ceph? 0.80.0 - 0.80.3 https://github.com/ceph/ceph/commit/7557a8139425d1705b481d7f010683169fd5e49b Thu Nov 06 2014 at 16:24:21, GuangYang yguan...@outlook.commailto:yguan...@outlook.com: Hello Cephers, Recently we observed a couple

Re: [ceph-users] Filestore throttling

2014-10-28 Thread GuangYang
Date: Thu, 23 Oct 2014 21:26:07 -0700 From: s...@newdream.net To: yguan...@outlook.com CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: RE: Filestore throttling On Fri, 24 Oct 2014, GuangYang wrote: commit

Re: [ceph-users] Filestore throttling

2014-10-23 Thread GuangYang
--- Date: Thu, 23 Oct 2014 06:58:58 -0700 From: s...@newdream.net To: yguan...@outlook.com CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: RE: Filestore throttling On Thu, 23 Oct 2014, GuangYang wrote: Thanks Sage for the quick

[ceph-users] Filestore throttling

2014-10-22 Thread GuangYang
Hello Cephers, During our testing, I found that the filestore throttling became a limiting factor for performance, the four settings (with default value) are:  filestore queue max ops = 50  filestore queue max bytes = 100 20  filestore queue committing max ops = 500  filestore queue committing

Re: [ceph-users] Filestore throttling

2014-10-22 Thread GuangYang
CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: Filestore throttling On Thu, 23 Oct 2014, GuangYang wrote: Hello Cephers, During our testing, I found that the filestore throttling became a limiting factor for performance, the four settings (with default value

Re: [ceph-users] Ceph GET latency

2014-02-21 Thread GuangYang
Thanks Greg for the response, my comments inline… Thanks, Guang On Feb 20, 2014, at 11:16 PM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 18, 2014 at 7:24 AM, Guang Yang yguan...@yahoo.com wrote: Hi ceph-users, We are using Ceph (radosgw) to store user generated images, as GET latency

Re: [ceph-users] Ceph GET latency

2014-02-19 Thread GuangYang
Thanks Yehuda. Try looking at the perfcounters, see if there's any other throttling happening. Also, make sure you have enough pgs for your data pool. One other thing to try is disabling leveldb xattrs and see if it affects your latency. 1. There is not throttling happening. 2. According to

[ceph-users] Ceph cluster is unreachable because of authentication failure

2014-01-14 Thread GuangYang
Hi ceph-users and ceph-devel, I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command. After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being