puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread 姚宁
Hi, guys I analyze the architecture of the ceph souce code. I know that, in order to keep journal atomic and consistent, the journal write mode should be set with O_DSYNC or called fdatasync() system call after every write operation. However, this kind of operation is really killing the

[PATCH] ceph: remove redundant io_iter_advance()

2014-09-17 Thread Yan, Zheng
ceph_sync_read and generic_file_read_iter() have already advanced the IO iterator. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/file.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 1c1df08..d7e0da8 100644 --- a/fs/ceph/file.c +++

RE: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Somnath Roy
Hi Nicheal, Not only recovery , IMHO the main purpose of ceph journal is to support transaction semantics since XFS doesn't have that. I guess it can't be achieved with pg_log/pg_info. Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org

RE: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Chen, Xiaoxi
Hi Nicheal, 1. The main purpose of journal is provide transaction semantics (prevent partially update). Peer is not enough for this need because ceph writes all replica at the same time, so when crush, you have no idea about which replica has right data. For example, say if we have 2 replica,

[PATCH] ceph: remove redundant code for max file size verification

2014-09-17 Thread Chao Yu
Both ceph_update_writeable_page and ceph_setattr will verify file size with max size ceph supported. There are two caller for ceph_update_writeable_page, ceph_write_begin and ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in generic_write_checks of ceph_write_iter; for

Re: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Alexandre DERUMIER
2. Have you got any data to prove the O_DSYNC or fdatasync kill the performance of journal? In our previous test, the journal SSD (use a partition of a SSD as a journal for a particular OSD, and 4 OSD share a same SSD) could reach its peak performance (300-400MB/s) Hi, I have done some

RadosGW objects to Rados object mapping

2014-09-17 Thread Abhishek L
Hi, I'm trying to understand the internals of RadosGW, on how buckets/containers, objects are mapped back to rados objects. I couldn't find any docs, however a previous mailing list discussion[1] explained how an S3/Swift objects are cut into rados objects and about manifests. I was able to

Re: [ceph-users] Crushmap ruleset for rack aware PG placement

2014-09-17 Thread Johnu George (johnugeo)
Adding ceph-devel On 9/17/14, 1:27 AM, Loic Dachary l...@dachary.org wrote: Could you resend with ceph-devel in cc ? It's better for archive purposes ;-) On 17/09/2014 09:37, Johnu George (johnugeo) wrote: Hi Sage, I was looking at the crash that was reported in this mail chain. I

Re: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Mark Nelson
On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote: 2. Have you got any data to prove the O_DSYNC or fdatasync kill the performance of journal? In our previous test, the journal SSD (use a partition of a SSD as a journal for a particular OSD, and 4 OSD share a same SSD) could reach its peak

Re: RadosGW objects to Rados object mapping

2014-09-17 Thread Yehuda Sadeh
On Wed, Sep 17, 2014 at 7:39 AM, Abhishek L abhishek.lekshma...@gmail.com wrote: Hi, I'm trying to understand the internals of RadosGW, on how buckets/containers, objects are mapped back to rados objects. I couldn't find any docs, however a previous mailing list discussion[1] explained how

Re: [ceph-users] Crushmap ruleset for rack aware PG placement

2014-09-17 Thread Loic Dachary
Hi, If the number of replica desired is 1, then https://github.com/ceph/ceph/blob/firefly/src/crush/CrushWrapper.h#L915 will be called with maxout = 1 and scratch will be maxout * 3. But if the rule always selects 4 items, then it overflows. Is it what you also read ? Cheers On 17/09/2014

Next Week: Ceph Day San Jose

2014-09-17 Thread Ross Turk
Hey everyone!  We just posted the agenda for next week’s Ceph Day in San Jose: http://ceph.com/cephdays/san-jose/ This Ceph Day will be held in a beautiful facility provided by our friends at Brocade.  We have a lot of great speakers from Brocade, Red Hat, Dell, Fujitsu, HGST, and Supermicro,

Re: RadosGW objects to Rados object mapping

2014-09-17 Thread Abhishek L
Yehuda Sadeh writes: On Wed, Sep 17, 2014 at 7:39 AM, Abhishek L abhishek.lekshma...@gmail.com wrote: Hi, I'm trying to understand the internals of RadosGW, on how buckets/containers, objects are mapped back to rados objects. I couldn't find any docs, however a previous mailing list

Re: [ceph-users] Crushmap ruleset for rack aware PG placement

2014-09-17 Thread Johnu George (johnugeo)
Loic, You are right. Are we planning to support configurations where replica number is different from the number of osds selected from a rule? If not, One solution is to add a validation check when a rule is activated for a pool of a specific replica. Johnu On 9/17/14, 9:10 AM, Loic

Re: [ceph-users] Crushmap ruleset for rack aware PG placement

2014-09-17 Thread Loic Dachary
On 17/09/2014 22:03, Johnu George (johnugeo) wrote: Loic, You are right. Are we planning to support configurations where replica number is different from the number of osds selected from a rule? I think crush should support it, yes. If a rule can provide 10 OSDs there is no reason

Re: severe librbd performance degradation in Giant

2014-09-17 Thread Mark Nelson
On 09/17/2014 03:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Somnath Roy
Mark, All are running with concurrency 32. Thanks Regards Somnath -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Wednesday, September 17, 2014 1:59 PM To: Somnath Roy; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Somnath Roy
But, this time is ~10X degradation :-( -- From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag]

Re: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Alexandre DERUMIER
FWIW, the journal will coalesce writes quickly when there are many concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the journal will start coalescing. For say 100-150 IOPs (what a spinning disk can handle), expect around 9ish 100KB journal writes (with padding and

Re: severe librbd performance degradation in Giant

2014-09-17 Thread Josh Durgin
On 09/17/2014 01:55 PM, Somnath Roy wrote: Hi Sage, We are experiencing severe librbd performance degradation in Giant over firefly release. Here is the experiment we did to isolate it as a librbd problem. 1. Single OSD is running latest Giant and client is running fio rbd on top of firefly

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Somnath Roy
I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this setting applies ? Next, I tried to tweak the rbd_cache setting to false and I *got back* the

Re: severe librbd performance degradation in Giant

2014-09-17 Thread Mark Nelson
Any chance read ahead could be causing issues? On 09/17/2014 04:29 PM, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd. rbd_cache_writethrough_until_flush = false But, no difference. BTW, I am doing Random read, not write. Still this

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Sage Weil
What was the io pattern? Sequential or random? For random a slowdown makes sense (tho maybe not 10x!) but not for sequentail s On Wed, 17 Sep 2014, Somnath Roy wrote: I set the following in the client side /etc/ceph/ceph.conf where I am running fio rbd.

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Somnath Roy
It's default read ahead setting. I am doing random read , so, I don't think read ahead is the issue. Also, in the cluster side, ceph -s is reporting same iops, so, ios are hitting the cluster. -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Wednesday,

Re: severe librbd performance degradation in Giant

2014-09-17 Thread Josh Durgin
No, it's not merged yet. The ObjectCacher (which implements rbd and ceph-fuse caching) has a global lock, which could be a bottleneck in this case. On 09/17/2014 02:34 PM, Mark Nelson wrote: Any chance read ahead could be causing issues? On 09/17/2014 04:29 PM, Somnath Roy wrote: I set the

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Somnath Roy
Sage, It's a 4K random read. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 17, 2014 2:36 PM To: Somnath Roy Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE: severe librbd performance degradation in Giant What was

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Somnath Roy
Created a tracker for this. http://tracker.ceph.com/issues/9513 Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 2:39 PM To: Sage Weil Cc: Josh

Re: [ceph-users] Crushmap ruleset for rack aware PG placement

2014-09-17 Thread Johnu George (johnugeo)
In such a case, we can initialize scratch array in crush/CrushWrapper.h#L919 with maximum number of osds that can be selected. Since we know the rule no, it should be possible to calculate the maximum osds that can be selected. Johnu On 9/17/14, 1:11 PM, Loic Dachary l...@dachary.org wrote:

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Somnath Roy
Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20% degradation over Firefly. Thanks Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, September 17, 2014 2:44 PM To: Sage Weil Cc: Josh Durgin; ceph-devel@vger.kernel.org Subject: RE:

RE: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Chen, Xiaoxi
When benching the crucial m550, I only see time to time (maybe each 30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up to normal speed around 4000iops Wow, that indicate m550 is busying with garbage collection , maybe just try to overprovision a bit (say if

Re: [PATCH] ceph: remove redundant code for max file size verification

2014-09-17 Thread Yan, Zheng
On Wed, Sep 17, 2014 at 5:26 PM, Chao Yu chao2...@samsung.com wrote: Both ceph_update_writeable_page and ceph_setattr will verify file size with max size ceph supported. There are two caller for ceph_update_writeable_page, ceph_write_begin and ceph_page_mkwrite. For ceph_write_begin, we have

Re: puzzled with the design pattern of ceph journal, really ruining performance

2014-09-17 Thread Mark Nelson
On 09/17/2014 08:05 PM, Chen, Xiaoxi wrote: When benching the crucial m550, I only see time to time (maybe each 30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up to normal speed around 4000iops Wow, that indicate m550 is busying with garbage collection ,

RE: [ceph-users] Crushmap ruleset for rack aware PG placement

2014-09-17 Thread Chen, Xiaoxi
The rule has max_size, can we just use that value? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Johnu George (johnugeo) Sent: Thursday, September 18, 2014 6:41 AM To: Loic Dachary; ceph-devel Subject: Re: [ceph-users]

Re: severe librbd performance degradation in Giant

2014-09-17 Thread Haomai Wang
According http://tracker.ceph.com/issues/9513, do you mean that rbd cache will make 10x performance degradation for random read? On Thu, Sep 18, 2014 at 7:44 AM, Somnath Roy somnath@sandisk.com wrote: Josh/Sage, I should mention that even after turning off rbd cache I am getting ~20%

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Somnath Roy
Yes Haomai... -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 17, 2014 7:28 PM To: Somnath Roy Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: Re: severe librbd performance degradation in Giant According

RE: severe librbd performance degradation in Giant

2014-09-17 Thread Sage Weil
On Thu, 18 Sep 2014, Somnath Roy wrote: Yes Haomai... I would love to what a profiler says about the matter. There is going to be some overhead on the client associated with the cache for a random io workload, but 10x is a problem! sage -Original Message- From: Haomai Wang