Re: ceph status reporting non-existing osd

2012-07-19 Thread Andrey Korolyov
On Thu, Jul 19, 2012 at 1:28 AM, Gregory Farnum g...@inktank.com wrote: On Wed, Jul 18, 2012 at 12:07 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jul 18, 2012 at 10:30 PM, Gregory Farnum g...@inktank.com wrote: On Wed, Jul 18, 2012 at 12:47 AM, Andrey Korolyov and...@xdel.ru wrote: On

Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev
It's actually the sum of the latencies of all 3971 asynchronous reads, in seconds, so the average latency was ~200ms, which is still pretty high. OK. I did realize it later that day when I've noticed that sum does go up only. So sum is number of seconds spent and divided by avgcount gives an

Re: Slow request warnings on 0.48

2012-07-19 Thread Matthew Richardson
I'd just like to report the same behaviour on my test cluster with 0.48. I've set up a single box (Sl6.1 - 2.6.32-220.23.1 kernel) with 1 mds, mon and osd, and replication set to '1' for both data and metadata. Having mounted using ceph-fuse, I'm running a simple fio job to create load:

Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev
Try to determine how much of the 200ms avg latency comes from osds vs the qemu block driver. Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed by btrfs over LVM2. May be issue lie in backing fs selection? All

[PATCH 0/6] rbd: old patches from Josh

2012-07-19 Thread Alex Elder
Late last year Josh Durgin had put together a series of fixes for rbd that never got committed. I told him I would get them in, and this series represents the last six that remain. Here's a summary: [PATCH 1/6] rbd: return errors for mapped but deleted snapshot This adds code to distinguish

[PATCH 1/6] rbd: return errors for mapped but deleted snapshot

2012-07-19 Thread Alex Elder
When a snapshot is deleted, the OSD will return ENOENT when reading from it. This is normally interpreted as a hole by rbd, which will return zeroes. To minimize the time in which this can happen, stop requests early when we are notified that our snapshot no longer exists. [el...@inktank.com:

[PATCH 2/6] rbd: only reset capacity when pointing to head

2012-07-19 Thread Alex Elder
Snapshots cannot be resized, and the new capacity of head should not be reflected by the snapshot. Signed-off-by: Josh Durgin josh.dur...@inktank.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git

[PATCH 3/6] rbd: expose the correct size of the device in sysfs

2012-07-19 Thread Alex Elder
If an image was mapped to a snapshot, the size of the head version would be shown. Protect capacity with header_rwsem, since it may change. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 11 --- 1 files changed,

[PATCH 4/6] rbd: set image size when header is updated

2012-07-19 Thread Alex Elder
The image may have been resized. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 9c3a1db..a6bbda2 100644

[PATCH 5/6] rbd: use reference counting for the snap context

2012-07-19 Thread Alex Elder
This prevents a race between requests with a given snap context and header updates that free it. The osd client was already expecting the snap context to be reference counted, since it get()s it in ceph_osdc_build_request and put()s it when the request completes. Also remove the second

[PATCH 6/6] rbd: send header version when notifying

2012-07-19 Thread Alex Elder
Previously the original header version was sent. Now, we update it when the header changes. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git

Re: [PATCH] rbd: fix the memory leak of bio_chain_clone

2012-07-19 Thread Guangliang Zhao
On Tue, Jul 17, 2012 at 01:18:50PM -0700, Yehuda Sadeh wrote: On Wed, Jul 11, 2012 at 5:34 AM, Guangliang Zhao gz...@suse.com wrote: The bio_pair alloced in bio_chain_clone would not be freed, this will cause a memory leak. It could be freed actually only after 3 times release, because

[PATCH] rbd: fix the repeat initialization of semaphore

2012-07-19 Thread Guangliang Zhao
The header_rwsem of rbd_dev initializes twice in function rbd_add. Signed-off-by: Guangliang Zhao gz...@suse.com --- drivers/block/rbd.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 013c7a5..50117dd 100644 ---

Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Sébastien Han
Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Wido den Hollander
Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Sébastien Han
Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. Cheers. On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote: Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Tommi Virtanen
On Thu, Jul 19, 2012 at 8:26 AM, Sébastien Han han.sebast...@gmail.com wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. Do your LVM volumes have partition tables inside them? That might be

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Tommi Virtanen
On Thu, Jul 19, 2012 at 8:38 AM, Tommi Virtanen t...@inktank.com wrote: Do your LVM volumes have partition tables inside them? That might be the difference. Of course, you can put your filesystem straight on the RBD; that would be a good experiment to run. Oops, I see you did put your fs

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Wido den Hollander
On 19-07-12 17:26, Sébastien Han wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. LVM runs through the device mapper and are not regular block devices. If you resize the disk underneath

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Sébastien Han
Hum ok, I see. Thanks! But if you have any clue to force the kernel to re-read without unmont/mounting :) On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote: On 19-07-12 17:26, Sébastien Han wrote: Ok I got your point seems logic, but why is this possible with LVM for

Re: Poor read performance in KVM

2012-07-19 Thread Tommi Virtanen
On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed by btrfs over LVM2. May be issue lie in backing fs selection? All four osds

[PATCH 0/4] rbd: use snapc-seq the way server does

2012-07-19 Thread Alex Elder
This series of patches changes the way the snap context seq field is used. Currently it is used in a way that isn't really useful, and as such is a bit confusing. This behavior seems to be a hold over from a time when there was no snap_id field maintained for an rbd_dev. Summary: [PATCH 1/4]

[PATCH 1/4] rbd: don't use snapc-seq that way

2012-07-19 Thread Alex Elder
In what appears to be an artifact of a different way of encoding whether an rbd image maps a snapshot, __rbd_refresh_header() has code that arranges to update the seq value in an rbd image's snapshot context to point to the first entry in its snapshot array if that's where it was pointing

[PATCH 3/4] rbd: set snapc-seq only when refreshing header

2012-07-19 Thread Alex Elder
In rbd_header_add_snap() there is code to set snapc-seq to the just-added snapshot id. This is the only remnant left of the use of that field for recording which snapshot an rbd_dev was associated with. That functionality is no longer supported, so get rid of that final bit of code. Doing so

[PATCH 4/4] rbd: kill rbd_image_header-snap_seq

2012-07-19 Thread Alex Elder
The snap_seq field in an rbd_image_header structure held the value from the rbd image header when it was last refreshed. We now maintain this value in the snapc-seq field. So get rid of the other one. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |2 -- 1 files

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Calvin Morrow
I've had a little more luck using cfdisk than vanilla fdisk when it comes to detecting changes. You might try running partprobe and then cfdisk and seeing if you get anything different. Calvin On Thu, Jul 19, 2012 at 9:50 AM, Sébastien Han han.sebast...@gmail.com wrote: Hum ok, I see. Thanks!

Re: Poor read performance in KVM

2012-07-19 Thread Calvin Morrow
On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanen t...@inktank.com wrote: On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed

Re: Poor read performance in KVM

2012-07-19 Thread Mark Nelson
On 07/19/2012 01:06 PM, Calvin Morrow wrote: On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanent...@inktank.com wrote: On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: Look like that osd.0 performs with low latency but osd.1 latency is way too high and on

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Sébastien Han
With LVM, you can re-scan the scsi bus to extend a physical drive and then run a pvextend. @Calvin: I tried your solution # partprobe /dev/rbd1 Unfortunatly nothing changed. Did you make it working? Cheers! On Thu, Jul 19, 2012 at 5:50 PM, Sébastien Han han.sebast...@gmail.com wrote: Hum

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Calvin Morrow
I haven't tried resizing an rbd yet, but I was changing partitions on a non-ceph two-node cluster with shared storage yesterday while certain partitions were in use (partitions 1,2,5 were mounted, deleting partition ids 6+, adding new ones) and fdisk wasn't re-reading disk changes. Partprobe

Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Andreas Kurz
On 07/19/2012 09:44 PM, Sébastien Han wrote: With LVM, you can re-scan the scsi bus to extend a physical drive and then run a pvextend. @Calvin: I tried your solution # partprobe /dev/rbd1 Did you try blockdev? # blockdev --rereadpt /dev/rbd1 Regards, Andreas Unfortunatly nothing

[PATCH 00/12] rbd: cleanup series

2012-07-19 Thread Alex Elder
This series includes a bunch of relatively small cleanups. They're grouped a bit below, but they apply together in this sequence and the later ones may have dependencies on those earlier in the series. Summaries: [PATCH 01/12] rbd: drop extra header_rwsem init [PATCH 02/12] rbd: simplify

Re: [PATCH 1/4] rbd: don't use snapc-seq that way

2012-07-19 Thread Josh Durgin
On 07/19/2012 10:11 AM, Alex Elder wrote: We now use rbd_dev-snap_id to record the snapshot id--using the special value SNAP_NONE to indicate the rbd_dev is not mapping a snapshot at all. That's CEPH_NOSNAP, not SNAP_NONE, right? In any case, Reviewed-by: Josh Durgin josh.dur...@inktank.com

[PATCH 01/12] rbd: drop extra header_rwsem init

2012-07-19 Thread Alex Elder
In commit c01a there was inadvertently added an extra initialization of rbd_dev-header_rwsem. This gets rid of the duplicate. (Guangliang Zhao also offered up the same fix.) Reported-by: Guangliang Zhao gz...@suse.com Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |

[PATCH 02/12] rbd: simplify __rbd_remove_all_snaps()

2012-07-19 Thread Alex Elder
This just replaces a while loop with list_for_each_entry_safe() in __rbd_remove_all_snaps(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index

[PATCH 03/12] rbd: clean up a few dout() calls

2012-07-19 Thread Alex Elder
There was a dout() call in rbd_do_request() that was reporting the reporting the offset as the length and vice versa. While fixing that I did a quick scan of other dout() calls and fixed a couple of other minor things. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |7

[PATCH 04/12] ceph: define snap counts as u32 everywhere

2012-07-19 Thread Alex Elder
There are two structures in which a count of snapshots are maintained: struct ceph_snap_context { ... u32 num_snaps; ... } and struct ceph_snap_realm { ... u32 num_prior_parent_snaps; /* had prior to parent_since */ ... u32

[PATCH 05/12] rbd: snapc is unused in rbd_req_sync_read()

2012-07-19 Thread Alex Elder
The snapc parameter to in rbd_req_sync_read() is not used, so get rid of it. Reported-by: Josh Durgin josh.dur...@inktank.com Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c

[PATCH 06/12] rbd: drop rbd_header_from_disk() gfp_flags parameter

2012-07-19 Thread Alex Elder
The function rbd_header_from_disk() is only called in one spot, and it passes GFP_KERNEL as its value for the gfp_flags parameter. Just drop that parameter and substitute GFP_KERNEL everywhere within that function it had been used. (If we find we need the parameter again in the future it's easy

[PATCH 07/12] rbd: drop rbd_dev parameter in snap functions

2012-07-19 Thread Alex Elder
Both rbd_register_snap_dev() and __rbd_remove_snap_dev() have rbd_dev parameters that are unused. Remove them. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 19 +++ 1 files changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/block/rbd.c

[PATCH 08/12] rbd: drop rbd_req_sync_exec() ver parameter

2012-07-19 Thread Alex Elder
The only place that passes a version pointer to rbd_req_sync_exec() is in rbd_header_add_snap(), and that spot ignores the result. The only thing rbd_req_sync_exec() does with its ver parameter is pass it directly to rbd_req_sync_op(). So we can just use a null pointer there, and drop the ver

[PATCH 09/12] rbd: have __rbd_add_snap_dev() return a pointer

2012-07-19 Thread Alex Elder
It's not obvious whether the snapshot pointer whose address is provided to __rbd_add_snap_dev() will be assigned by that function. Change it to return the snapshot, or a pointer-coded errno in the event of a failure. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 37

[PATCH 10/12] rbd: make rbd_create_rw_ops() return a pointer

2012-07-19 Thread Alex Elder
Either rbd_create_rw_ops() will succeed, or it will fail because a memory allocation failed. Have it just return a valid pointer or null rather than stuffing a pointer into a provided address and returning an errno. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 68

[PATCH 11/12] rbd: always pass ops array to rbd_req_sync_op()

2012-07-19 Thread Alex Elder
All of the callers of rbd_req_sync_op() except one pass a non-null ops pointer. The only one that does not is rbd_req_sync_read(), which passes CEPH_OSD_OP_READ as its opcode and, CEPH_OSD_FLAG_READ for flags. By allocating the ops array in rbd_req_sync_read() and moving the special case code

[PATCH 12/12] rbd: fixes in rbd_header_from_disk()

2012-07-19 Thread Alex Elder
This fixes a few issues in rbd_header_from_disk(): - The memcmp() call at the beginning of the function is really looking at the text field of struct rbd_image_header_ondisk. While it does lie at the beginning of the structure, the comparison should be done against the field,

Re: [PATCH 1/4] rbd: don't use snapc-seq that way

2012-07-19 Thread Alex Elder
On 07/19/2012 04:02 PM, Josh Durgin wrote: On 07/19/2012 10:11 AM, Alex Elder wrote: We now use rbd_dev-snap_id to record the snapshot id--using the special value SNAP_NONE to indicate the rbd_dev is not mapping a snapshot at all. That's CEPH_NOSNAP, not SNAP_NONE, right? In any case, Yes.

Re: [PATCH 0/4] rbd: use snapc-seq the way server does

2012-07-19 Thread Josh Durgin
On 07/19/2012 10:09 AM, Alex Elder wrote: This series of patches changes the way the snap context seq field is used. Currently it is used in a way that isn't really useful, and as such is a bit confusing. This behavior seems to be a hold over from a time when there was no snap_id field

[GIT PULL] Ceph fixes for 3.5

2012-07-19 Thread Sage Weil
Hi Linus, Please pull these last minute fixes for Ceph from: git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus The important one fixes a bug in the socket failure handling behavior that was turned up in some recent failure injection testing. The other two are

making objdump useful

2012-07-19 Thread Sage Weil
I finally figured out how to make objdump interleave the source code in the .ko file dumps on our qa machines. The problem is that the debug info refeferences the path where the kernel was compiled (which is non-obvious since the info is compressed). For our environment, this is a quick

mkcephfs problem

2012-07-19 Thread Tim Flavin
I am trying to get Ceph running on an ARM system, currently one quad core node, running Ubuntu 12.04. It compiles fine, currently without tcmalloc and google perf tools, but I am running into a problem with mkcephfs. 'mkcephfs -a -c ceph.conf' didn't work so I did it piece by piece until I got

Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev
On 20/07/2012 1:22 AM, Tommi Virtanen wrote: On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed by btrfs over LVM2. May be issue

Re: How to compile Java-Rados.

2012-07-19 Thread ramu
Hi Noah, Thank you for fixes and suggestions of compilation of java-rados , It's working fine now. Thanks, Ramu. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at

Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev
We are seeing degradation at 64k node/leaf sizes as well. So far the degradation is most obvious with small writes. it affects XFS as well, though not as severely. We are vigorously looking into it. :) Just confirming that one of our clients has run fair amount (on gigabytes scale) of

Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev
What node/leaf size are you using on your btrfs volume? Default 4K. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev
Yes, they can hold up reads to the same object. Depending on where they're stuck, they may be blocking other requests as well if they're e.g. taking up all the filestore threads. Waiting for subops means they're waiting for replicas to acknowledge the write and commit it to disk. The real cause