from:"Ilya Dryomov"

Re: 答复: Reboot blocked when undoing unmap op.

2016-01-04 Thread Ilya Dryomov

On Mon, Jan 4, 2016 at 10:51 AM, Wukongming  wrote:
> Hi, Ilya,
>
> It is an old problem.
> When you say "when you issue a reboot, daemons get killed and the kernel 
> client ends up waiting for the them to come back, because of outstanding 
> writes issued by umount called by systemd (or whatever)."
>
> Do you mean if umount rbd successfully, the process of kernel client will 
> stop waiting? What kind of Communication mechanism between libceph and 
> daemons(or ceph userspace)?

If you umount the filesystem on top of rbd and unmap rbd image, there
won't be anything to wait for.  In fact, if there aren't any other rbd
images mapped, libceph will clean up after itself and exit.

If you umount the filesystem on top of rbd but don't unmap the image,
libceph will remain there, along with some amount of communication
(keepalive messages, watch requests, etc).  However, all of that is
internal and is unlikely to block reboot.

If you don't umount the filesystem, your init system will try to umount
it, issuing FS requests to the rbd device.  We don't want to drop those
requests, so, if daemons are gone by then, libceph ends up blocking.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Ilya Dryomov

On Fri, Dec 18, 2015 at 1:38 PM, Loic Dachary  wrote:
> Hi Ilya,
>
> It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
> them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe 
> does. It is used intensively by ceph-disk and inevitably leads to races where 
> a device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 
> 14.04 with a 3.13.0-62-generic kernel only generates two udev change events 
> and does not remove / add partitions. The source code between sgdisk 0.8.6 
> and sgdisk 0.8.8 did not change in a significant way and the output of strace 
> -e ioctl sgdisk -i 2 /dev/vdb is identical in both environments.
>
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
>
> This leads me to the conclusion that the difference is in how the kernel 
> reacts to these ioctl.

I'm pretty sure it's not the kernel versions that matter here, but
systemd versions.  Those are all get-property ioctls, and I don't think
sgdisk -i does anything with the partition table.

What it probably does though is it opens the disk for write for some
reason.  When it closes it, udevd (systemd-udevd process) picks that
close up via inotify and issues the BLKRRPART ioctl, instructing the
kernel to re-read the partition table.  Technically, that's different
from what partprobe does, but it still generates those udev events you
are seeing in the monitor.

AFAICT udevd started doing this in v214.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: puzzling disapearance of /dev/sdc1

2015-12-17 Thread Ilya Dryomov

On Thu, Dec 17, 2015 at 3:10 PM, Loic Dachary  wrote:
> Hi Sage,
>
> On 17/12/2015 14:31, Sage Weil wrote:
>> On Thu, 17 Dec 2015, Loic Dachary wrote:
>>> Hi Ilya,
>>>
>>> This is another puzzling behavior (the log of all commands is at
>>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a
>>> series of sgdisk -i commands to examine various devices including
>>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup
>>> again although I don't have a definitive proof of this).
>>>
>>> It looks like a side effect of a previous partprobe command, the only
>>> command I can think of that removes / re-adds devices. I thought calling
>>> udevadm settle after running partprobe would be enough to ensure
>>> partprobe completed (and since it takes as much as 2mn30 to return, I
>>> would be shocked if it does not ;-).

Yeah, IIRC partprobe goes through every slot in the partition table,
trying to first remove and then add the partition back.  But, I don't
see any mention of partprobe in the log you referred to.

Should udevadm settle for a few vd* devices be taking that much time?
I'd investigate that regardless of the issue at hand.

>>>
>>> Any idea ? I desperately try to find a consistent behavior, something
>>> reliable that we could use to say : "wait for the partition table to be
>>> up to date in the kernel and all udev events generated by the partition
>>> table update to complete".
>>
>> I wonder if the underlying issue is that we shouldn't be calling udevadm
>> settle from something running from udev.  Instead, of a udev-triggered
>> run of ceph-disk does something that changes the partitions, it
>> should just exit and let udevadm run ceph-disk again on the new
>> devices...?

>
> Unless I missed something this is on CentOS 7 and ceph-disk is only called 
> from udev as ceph-disk trigger which does nothing else but asynchronously 
> delegate the work to systemd. Therefore there is no udevadm settle from 
> within udev (which would deadlock and timeout every time... I hope ;-).

That's a sure lockup, until one of them times out.

How are you delegating to systemd?  Is it to avoid long-running udev
events?  I'm probably missing something - udevadm settle wouldn't block
on anything other than udev, so if you are shipping work off to
somewhere else, udev can't be relied upon for waiting.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: understanding partprobe failure

2015-12-17 Thread Ilya Dryomov

On Thu, Dec 17, 2015 at 1:19 PM, Loic Dachary  wrote:
> Hi Ilya,
>
> I'm seeing a partprobe failure right after a disk was zapped with sgdisk 
> --clear --mbrtogpt -- /dev/vdb:
>
> partprobe /dev/vdb failed : Error: Partition(s) 1 on /dev/vdb have been 
> written, but we have been unable to inform the kernel of the change, probably 
> because it/they are in use. As a result, the old partition(s) will remain in 
> use. You should reboot now before making further changes.
>
> waiting 60 seconds (see the log below) and trying again succeeds. The 
> partprobe call is guarded by udevadm settle to prevent udev actions from 
> racing and nothing else goes on in the machine.
>
> Any idea how that could happen ?
>
> Cheers
>
> 2015-12-17 11:46:10,356.356 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:get_dm_uuid
>  /dev/vdb uuid path is /sys/dev/block/253:16/dm/uuid
> 2015-12-17 11:46:10,357.357 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:Zapping
>  partition table on /dev/vdb
> 2015-12-17 11:46:10,358.358 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>  command: /usr/sbin/sgdisk --zap-all -- /dev/vdb
> 2015-12-17 11:46:10,365.365 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Caution:
>  invalid backup GPT header, but valid main header; regenerating
> 2015-12-17 11:46:10,366.366 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:backup 
> header from main header.
> 2015-12-17 11:46:10,366.366 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:10,366.366 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning!
>  Main and backup partition tables differ! Use the 'c' and 'e' options
> 2015-12-17 11:46:10,367.367 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:on the 
> recovery & transformation menu to examine the two tables.
> 2015-12-17 11:46:10,367.367 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:10,367.367 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning!
>  One or more CRCs don't match. You should repair the disk!
> 2015-12-17 11:46:10,368.368 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:11,413.413 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:11,414.414 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Caution:
>  Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
> 2015-12-17 11:46:11,414.414 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:verification
>  and recovery are STRONGLY recommended.
> 2015-12-17 11:46:11,414.414 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:
> 2015-12-17 11:46:11,415.415 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning:
>  The kernel is still using the old partition table.
> 2015-12-17 11:46:11,415.415 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The new 
> table will be used at the next reboot.
> 2015-12-17 11:46:11,416.416 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:GPT 
> data structures destroyed! You may now partition the disk using fdisk or
> 2015-12-17 11:46:11,416.416 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:other 
> utilities.
> 2015-12-17 11:46:11,416.416 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>  command: /usr/sbin/sgdisk --clear --mbrtogpt -- /dev/vdb
> 2015-12-17 11:46:12,504.504 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Creating
>  new GPT entries.
> 2015-12-17 11:46:12,505.505 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning:
>  The kernel is still using the old partition table.
> 2015-12-17 11:46:12,505.505 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The new 
> table will be used at the next reboot.
> 2015-12-17 11:46:12,505.505 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The 
> operation has completed successfully.
> 2015-12-17 11:46:12,506.506 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:Calling
>  partprobe on zapped device /dev/vdb
> 2015-12-17 11:46:12,507.507 
> INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running
>  command: /usr/bin/udevadm settle --timeout=600
> 2015-12-17 11:46:15,427.427 
>

Re: Rbd map failure in 3.16.0-55

2015-12-12 Thread Ilya Dryomov

On Sat, Dec 12, 2015 at 7:56 AM, Varada Kari  wrote:
> Hi all,
>
> We are working on jewel branch on a test cluster to validate some of the 
> fixes. But landed up in the following error when mapping an image using  krbd 
> on Ubuntu 14.04.2 with 3.16.0-55 kernel version.
>
> $ sudo rbd map -p pool1 rbd1
> rbd: sysfs write failed
> rbd: map failed: (5) Input/output error
>
>
> $ uname -a
> Linux 3.16.0-55-generic #74~14.04.1-Ubuntu SMP Tue Nov 17 10:15:59 UTC 2015 
> x86_64 x86_64 x86_64 GNU/Linux
>
> $dmesg
> 
> [11082.199006] libceph: read_partial_message bad hdr  crc 2112154322 != 
> expected 0
> [11082.209414] libceph: mon0 x.x.x.x:6789 socket error on read
> [11092.238317] libceph: read_partial_message bad hdr  crc 2112154322 != 
> expected 0
> [11092.248982] libceph: mon0 x.x.x.x:6789 socket error on read
> .
>
> When I looked at the ceph_msg_header on 3.16 kernel 
> (http://lxr.free-electrons.com/source/include/linux/ceph/msgr.h?v=3.16)
>
> There is one field change from jewel branch
>
> 145 struct ceph_msg_header {
> ...
> 159
> 160 /* oldest code we think can decode this.  unknown if zero. */
> 161 __le16 compat_version; <<< New one, which is present from 3.19 
> onwards
> 162 __le16 reserved;
> 163 __le32 crc;   /* header crc32c */
> 164 } __attribute__ ((packed));

That's not a problem - those 16 bits used to be reserved.

>
> How can map the image in 3.16 kernel apart from upgrading the kernel? Do we 
> have any branch where I can build the modules with latest changes?

I'm guessing you have ms_crc_header set to false - the kernel client
always checks header (and middle) checksums, hence the mismatch.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Rbd map failure in 3.16.0-55

2015-12-12 Thread Ilya Dryomov

On Sat, Dec 12, 2015 at 6:42 PM, Somnath Roy  wrote:
> Ilya,
> If we map with 'nocrc' would that help ?

No, it will disable data crcs, header and middle crcs will still be
checked.  The header/data separation in userspace is fairly new, if
that's something you care about, I can port it for 4.5 - it should be
trivial enough.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CEPH_MAX_OID_NAME_LEN in rbd kernel driver

2015-12-11 Thread Ilya Dryomov

On Fri, Dec 11, 2015 at 3:48 PM, Jean-Tiare Le Bigot
 wrote:
> Hi,
>
> I hit a use case where rbd was failing to map an image because of the
> name length.
>
> dmesg output:
>
> WARNING: CPU: 0 PID: 20851 at include/linux/ceph/osdmap.h:97
> rbd_osd_req_create.isra.29+0xdd/0x140()
>
> ceph_oid_set_name
> 'rbd_id.docker:devel:ib8k2cxvci9sh7b4ly8wwvpyjf6kjoyuero1k25zal3ypc916i04x78tbmn60gjm6pyp38lmjq:latest'
> len 101 vs 100, truncating
>
>
> It turns out the maximum ceph object name length is controlled by the
> constant CEPH_MAX_OID_NAME_LEN. I naively tried to increase it to 500
> which appears to workaround (fix ?) the issue.
>
>
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index 49ff69f..cf34bb0 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -50,7 +50,7 @@ struct ceph_object_locator {
>   *
>   * (probably outdated: must be >= RBD_MAX_MD_NAME_LEN -- currently 100)
>   */
> -#define CEPH_MAX_OID_NAME_LEN 100
> +#define CEPH_MAX_OID_NAME_LEN 500
>
>  struct ceph_object_id {
> char name[CEPH_MAX_OID_NAME_LEN];
>
>
> What is the impact of such a change ? Should I submit a proper patch ?
> And, last but not least, shouldn't it return an error instead of trucate
> it and fail because, obviously the object does not exist ? Worse, It
> could exist with a different data. Which would be an interesting vector
> for injecting data in a third party image...

The impact is heavier allocations - each request has two of those and
the encoding buffer is also CEPH_MAX_OID_NAME_LEN, not the actual name
length.  I was actually considering lowering it, but I didn't consider
image names from hell like that.

librbd has #define RBD_MAX_OBJ_NAME_SIZE 96, but it's not enforced.

Is that ID between devel: and :latest fixed-size in docker world?  I.e.
would bumping it to something less than 500 do?

I don't think injecting data is a valid concern - if you can create an
image in a pool, you can write to it and if you can write to it...  It's
better to return an error, of course.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel RBD hang on OSD Failure

2015-12-11 Thread Ilya Dryomov

On Fri, Dec 11, 2015 at 1:37 AM, Matt Conner  wrote:
> Hi Ilya,
>
> I had already recovered but I managed to recreate the problem again. I ran

How did you recover?

> the commands against rbd_data.f54f9422698a8. which was one
> of those listed in osdc this time. We have 2048 PGs in the pool so the list
> is long.

That's not going to work - I need it in a consistent "stuck" state so
I can match the outputs.  I understand you can't keep it that way for
a long time, so can you please reproduce it and email me off list with
osdmap, osdc and ceph -s right away?

You have a lot of OSDs but it also doesn't seem to take a long time to
recreate.  Bump debug ms to 1 on all OSDs.

>
> As for when I fetched the object using rados, it grabbed it without issue.
>
> osd map:
> osdmap e6247 pool 'NAS-ic2gw01' (3) object
> 'rbd_data.f54f9422698a8.' -> pg 3.cac46c43 (3.443) -> up
> ([33,56], p33) acting ([33,56], p33)
>
> osdmaptool:
> pool 3 pg_num 2048
> 3.0 [23,138]23

If it's long, better to compress and attach.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel RBD hang on OSD Failure

2015-12-10 Thread Ilya Dryomov

On Fri, Dec 11, 2015 at 12:16 AM, Matt Conner
 wrote:
> Hi Ilya,
>
> Took me a little time to reproduce, but I'm once again got into the
> state. In the past I had reproduced the issue with a single OSD
> failure but in this case I failed an entire server. Ignore the "noout"
> flag as I set it to prevent rebalance during my attempts to reproduce
> the issue.
>
> osdmap:
>
> epoch 6191
> flags
> pool 0 pg_num 64 (63) read_tier -1 write_tier -1
> pool 1 pg_num 64 (63) read_tier -1 write_tier -1
> pool 2 pg_num 64 (63) read_tier -1 write_tier -1
> pool 3 pg_num 2048 (2047) read_tier -1 write_tier -1
> pool 5 pg_num 2048 (2047) read_tier -1 write_tier -1
> pool 6 pg_num 2048 (2047) read_tier -1 write_tier -1
> pool 7 pg_num 2048 (2047) read_tier -1 write_tier -1
> osd0 10.10.30.55:6830 100% (exists, up) 100%
> osd1 10.10.30.53:6809 100% (exists, up) 100%
> osd2 10.10.30.54:6839 100% (exists, up) 100%
> osd3 10.10.30.56:6833 100% (exists, up) 100%
> osd4 10.10.30.55:6848 100% (exists, up) 100%
> osd5 10.10.30.53:6854 100% (exists, up) 100%
> osd6 10.10.30.56:6818 100% (exists, up) 100%
> osd7 10.10.30.54:6830 100% (exists, up) 100%
> osd8 10.10.30.55:6839 100% (exists, up) 100%
> osd9 10.10.30.56:6845 100% (exists, up) 100%
> osd10 10.10.30.53:6860 100% (exists, up) 100%
> osd11 10.10.30.54:6842 100% (exists, up) 100%
> osd12 10.10.30.55:6836 100% (exists, up) 100%
> osd13 10.10.30.56:6863 100% (exists, up) 100%
> osd14 10.10.30.53:6839 100% (exists, up) 100%
> osd15 10.10.30.54:6866 100% (exists, up) 100%
> osd16 10.10.30.55:6869 100% (exists, up) 100%
> osd17 10.10.30.56:6812 100% (exists, up) 100%
> osd18 10.10.30.53:6824 100% (exists, up) 100%
> osd19 10.10.30.54:6818 100% (exists, up) 100%
> osd20 10.10.30.55:6833 100% (exists, up) 100%
> osd21 10.10.30.56:6869 100% (exists, up) 100%
> osd22 10.10.30.53:6857 100% (exists, up) 100%
> osd23 10.10.30.54:6833 100% (exists, up) 100%
> osd24 10.10.30.55:6803 100% (exists, up) 100%
> osd25 10.10.30.56:6821 100% (exists, up) 100%
> osd26 10.10.30.54:6848 100% (exists, up) 100%
> osd27 10.10.30.53:6848 100% (exists, up) 100%
> osd28 10.10.30.55:6863 100% (exists, up) 100%
> osd29 10.10.30.56:6827 100% (exists, up) 100%
> osd30 10.10.30.54:6836 100% (exists, up) 100%
> osd31 10.10.30.53:6863 100% (exists, up) 100%
> osd32 10.10.30.55:6845 100% (exists, up) 100%
> osd33 10.10.30.56:6839 100% (exists, up) 100%
> osd34 10.10.30.54:6851 100% (exists, up) 100%
> osd35 10.10.30.53:6845 100% (exists, up) 100%
> osd36 10.10.30.55:6854 100% (exists, up) 100%
> osd37 10.10.30.56:6851 100% (exists, up) 100%
> osd38 10.10.30.54:6806 100% (exists, up) 100%
> osd39 10.10.30.53:6824  0% (doesn't exist) 100%
> osd40 10.10.30.55:6815 100% (exists, up) 100%
> osd41 10.10.30.56:6842 100% (exists, up) 100%
> osd42 10.10.30.54:6863 100% (exists, up) 100%
> osd43 10.10.30.53:6806 100% (exists, up) 100%
> osd44 10.10.30.55:6821 100% (exists, up) 100%
> osd45 10.10.30.56:6803 100% (exists, up) 100%
> osd46 10.10.30.54:6812 100% (exists, up) 100%
> osd47 10.10.30.53:6866 100% (exists, up) 100%
> osd48 10.10.30.55:6812 100% (exists, up) 100%
> osd49 10.10.30.56:6815 100% (exists, up) 100%
> osd50 10.10.30.53:6842 100% (exists, up) 100%
> osd51 10.10.30.54:6854 100% (exists, up) 100%
> osd52 10.10.30.55:6842 100% (exists, up) 100%
> osd53 10.10.30.56:6836 100% (exists, up) 100%
> osd54 10.10.30.53:6803 100% (exists, up) 100%
> osd55 10.10.30.54:6869 100% (exists, up) 100%
> osd56 10.10.30.55:6866 100% (exists, up) 100%
> osd57 10.10.30.56:6860 100% (exists, up) 100%
> osd58 10.10.30.53:6800 100% (exists, up) 100%
> osd59 10.10.30.54:6803 100% (exists, up) 100%
> osd60 10.10.30.55:6824 100% (exists, up) 100%
> osd61 10.10.30.56:6804 100% (exists, up) 100%
> osd62 10.10.30.53:6815 100% (exists, up) 100%
> osd63 10.10.30.54:6857 100% (exists, up) 100%
> osd64 10.10.30.55:6800 100% (exists, up) 100%
> osd65 10.10.30.56:6866 100% (exists, up) 100%
> osd66 10.10.30.53:6836 100% (exists, up) 100%
> osd67 10.10.30.54:6845 100% (exists, up) 100%
> osd68 10.10.30.55:6860 100% (exists, up) 100%
> osd69 10.10.30.56:6824 100% (exists, up) 100%
> osd70 10.10.30.53:6821 100% (exists, up) 100%
> osd71 10.10.30.54:6809 100% (exists, up) 100%
> osd72 10.10.30.55:6806 100% (exists, up) 100%
> osd73 10.10.30.56:6809 100% (exists, up) 100%
> osd74 10.10.30.53:6818 100% (exists, up) 100%
> osd75 10.10.30.54:6815 100% (exists, up) 100%
> osd76 10.10.30.55:6827 100% (exists, up) 100%
> osd77 10.10.30.56:6830 100% (exists, up) 100%
> osd78 10.10.30.53:6833 100% (exists, up) 100%
> osd79 10.10.30.54:6800 100% (exists, up) 100%
> osd80 10.10.30.55:6851 100% (exists, up) 100%
> osd81 10.10.30.56:6800 100% (exists, up) 100%
> osd82 10.10.30.53:6812 100% (exists, up) 100%
> osd83 10.10.30.54:6821 100% (exists, up) 100%
> osd84 10.10.30.55:6857 100% (exists, up) 100%
> osd85 10.10.30.56:6848 100% (exists, up) 100%
> osd86 10.10.30.53:6830 100% (exists, up) 100%
> osd87 10.10.30.54:6827 100%

Re: [ceph-users] Kernel RBD hang on OSD Failure

2015-12-08 Thread Ilya Dryomov

On Tue, Dec 8, 2015 at 10:57 AM, Tom Christensen  wrote:
> We aren't running NFS, but regularly use the kernel driver to map RBDs and
> mount filesystems in same.  We see very similar behavior across nearly all
> kernel versions we've tried.  In my experience only very few versions of the
> kernel driver survive any sort of crush map change/update while something is
> mapped.  In fact in the last 2 years I think I've only seen this work on 1
> kernel version unfortunately its badly out of date and we can't run it in
> our environment anymore, I think it was a 3.0 kernel version running on
> ubuntu 12.04.  We have just recently started trying to find a kernel that
> will survive OSD outages or changes to the cluster.  We're on ubuntu 14.04,
> and have tried 3.16, 3.19.0-25, 4.3, and 4.2 without success in the last
> week.  We only map 1-3 RBDs per client machine at a time but we regularly
> will get processes stuck in D state which are accessing the filesystem
> inside the RBD and will have to hard reboot the RBD client machine.  This is
> always associated with a cluster change in some way, reweighting OSDs,
> rebooting an OSD host, restarting an individual OSD, adding OSDs, and
> removing OSDs all cause the kernel client to hang.  If no change is made to
> the cluster, the kernel client will be happy for weeks.

There are a couple of known bugs in the remap/resubmit area, but those
are supposedly corner cases (like *all* the OSDs going down and then
back up, etc).  I had no idea it was that severe and goes that back.
Apparently triggering it requires a heavier load, as we've never seen
anything like that in our tests.

For unrelated reasons, remap/resubmit code is getting entirely
rewritten for kernel 4.5, so, if you've been dealing with this issue
for the last two years (I don't remember seeing any tickets listing
that many kernel versions and not mentioning NFS), I'm afraid the best
course of action for you would be to wait for 4.5 to come out and try
it.  If you'd be willing to test out an early version on one of more of
your client boxes, I can ping you when it's ready.

I'll take a look at 3.0 vs 3.16 with an eye on remap code.  Did you
happen to try 3.10?

It sounds like you can reproduce this pretty easily.  Can you get it to
lock up and do:

# cat /sys/kernel/debug/ceph/*/osdmap
# cat /sys/kernel/debug/ceph/*/osdc
$ ceph status

and bunch of times?  I have a hunch that kernel client simply fails to
request enough of new osdmaps after the cluster topology changes under
load.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel RBD hang on OSD Failure

2015-12-08 Thread Ilya Dryomov

On Mon, Dec 7, 2015 at 9:56 PM, Matt Conner  wrote:
> Hi,
>
> We have a Ceph cluster in which we have been having issues with RBD
> clients hanging when an OSD failure occurs. We are using a NAS gateway
> server which maps RBD images to filesystems and serves the filesystems
> out via NFS. The gateway server has close to 180 NFS clients and
> almost every time even 1 OSD goes down during heavy load, the NFS
> exports lock up and the clients are unable to access the NAS share via
> NFS. When the OSD fails, Ceph recovers without issue, but the gateway
> kernel RBD module appears to get stuck waiting on the now failed OSD.
> Note that this works correctly when under lighter loads.
>
> From what we have been able to determine, the NFS server daemon hangs
> waiting for I/O from the OSD that went out and never recovers.
> Similarly, attempting to access files from the exported FS locally on
> the gateway server will result in a similar hang. We also noticed that
> Ceph health details will continue to report blocked I/O on the now
> down OSD until either the OSD is recovered or the gateway server is
> rebooted.  Based on a few kernel logs from NFS and PVS, we were able
> to trace the problem to the RBD kernel module.
>
> Unfortunately, the only way we have been able to recover our gateway
> is by hard rebooting the server.
>
> Has anyone else encountered this issue and/or have a possible solution?
> Are there suggestions for getting more detailed debugging information
> from the RBD kernel module?

Dumping osdc, osdmap and ceph status when it gets stuck would be
a start:

# cat /sys/kernel/debug/ceph/*/osdmap
# cat /sys/kernel/debug/ceph/*/osdc
$ ceph status

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: testing the /dev/cciss/c0d0 device names

2015-12-06 Thread Ilya Dryomov

On Sat, Dec 5, 2015 at 7:36 PM, Loic Dachary  wrote:
> Hi Ilya,
>
> ceph-disk has special handling for device names like /dev/cciss/c0d1 [1] and 
> it was partially broken when support for device mapper was introduced. 
> Ideally there would be a way to test that support when running the ceph-disk 
> suite [2]. Do you know of a way to do that without having the hardware for 
> which this driver is designed ?
>
> Maybe this convention (/dev/cciss/c0d0 being mapped to /sys/block/cciss!c0d0 
> is not unique to this driver and I could use another to validate the name 
> conversion from X/Y to X!Y and vice versa is handled as it should ?

No, it's not unique.  driver core does strreplace(s, '/', '!') at
register time to work around such block devices.  The list includes
DAC960, aoeblk, cciss, cpqarray, sx8 and probably more, but I don't
think anything widespread uses this naming scheme.  IIRC dm actually
won't let you name a device with anything that contains a slash.

If you really wanted you could set up aoeblk I guess, but a simple unit
test should be be more than enough ;)

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: don't put snap_context twice in rbd_queue_workfn()

2015-12-01 Thread Ilya Dryomov

Commit 4e752f0ab0e8 ("rbd: access snapshot context and mapping size
safely") moved ceph_get_snap_context() out of rbd_img_request_create()
and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the
error path in rbd_queue_workfn().  However, rbd_img_request_create()
consumes a ref on snapc, so calling ceph_put_snap_context() after
a successful rbd_img_request_create() leads to an extra put.  Fix it.

Cc: sta...@vger.kernel.org # 3.18+
Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 24a757e97d56..4a876785b68c 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3442,6 +3442,7 @@ static void rbd_queue_workfn(struct work_struct *work)
goto err_rq;
}
img_request->rq = rq;
+   snapc = NULL; /* img_request consumes a ref */
 
if (op_type == OBJ_OP_DISCARD)
result = rbd_img_request_fill(img_request, OBJ_REQUEST_NODATA,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block-rbd: One function call less in rbd_dev_probe_parent() after error detection

2015-11-26 Thread Ilya Dryomov

On Thu, Nov 26, 2015 at 8:54 AM, SF Markus Elfring
 wrote:
>>> I interpreted the eventual passing of a null pointer to the 
>>> rbd_dev_destroy()
>>> function as an indication for further source code adjustments.
>>
>> If all error paths could be adjusted so that NULL pointers are never passed 
>> in,
>> destroy functions wouldn't need to have a NULL check, would they?
>
> How do you think about to clarify corresponding implementation details a bit 
> more?
>
> * Why was the function "rbd_dev_probe_parent" implemented in the way
>   that it relies on a sanity check in the function "rbd_dev_destroy" then?

Because it's not a bad thing?  What's wrong with an init to NULL,
a possible assignment, in this case from rbd_dev_create(), and an
unconditional rbd_dev_destroy()?

The NULL check in rbd_dev_destroy() is not a sanity check, it's
a feature.  It's not there to "fixup" callers that pass NULL - it's
there because it is _expected_ that some callers will pass NULL.

> * How are the chances to restructure the source code a bit (like changing a 
> few
>   jump labels) so that it should also work without an extra function call
>   during error handling there?

As I said in my reply to Dan, the problem with rbd_dev_probe_parent()
is the calling code which expects it to call unparent if ->parent_spec.
This makes it stand out and confuses people, but can't be fixed without
refactoring a bunch of other code.

The extra function call is *not* a problem.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] block-rbd: One function call less in rbd_dev_probe_parent() after error detection

2015-11-25 Thread Ilya Dryomov

On Wed, Nov 25, 2015 at 12:55 PM, Dan Carpenter
<dan.carpen...@oracle.com> wrote:
> On Tue, Nov 24, 2015 at 09:21:06PM +0100, Ilya Dryomov wrote:
>> >> Cleanup here is (and should be) done in reverse order.
>> >
>
> Yes.  This is true.
>
>> > I have got an other impression about the appropriate order for the 
>> > corresponding
>> > clean-up function calls.
>> >
>> >
>> >> We allocate parent rbd_device and then link it with what we already have,
>> >
>> > I guess that we have got a different understanding about the relevant 
>> > "linking".
>>
>> Well, there isn't any _literal_ linking (e.g. adding to a link list,
>> etc) in this case.  We just bump some refs and do probe to fill in the
>> newly allocated parent.  If probe fails, we put refs and free parent,
>> reversing the "alloc parent, bump refs" order.
>>
>> The actual linking (rbd_dev->parent = parent) is done right before
>> returning so we never have to undo it in rbd_dev_probe_parent() and
>> that's the only reason your patch probably doesn't break anything.
>> Think about what happens if, after your patch is applied, someone moves
>> that assignment up or adds an extra step that can fail after it...
>>
>
> The problem is that the unwind code should be a mirror of the allocate
> code but rbd_dev_unparent() doesn't mirror anything.  Generally, writing
> future proof stubs like this is a wrong thing because predicting the
> future is hard and in the mean time we are left stubs which confuse
> everyone.

It's not a future proof stub.  It's just some crufty code that was
fixed over time to not leak things.  I won't defend it - it is
confusing and could definitely be improved - but that can't be done
without refactoring a fair bunch of calling code.  A patch changing
rbd_dev_probe_parent() alone just won't do it.

>
>> If all error paths could be adjusted so that NULL pointers are never
>> passed in, destroy functions wouldn't need to have a NULL check, would
>> they?
>
> Yep.  We agree on the right way to do it.  I am probably the number one
> kernel developer for removing the most sanity checks.  :P  (As opposed
> to patch 1/1 where we now rely on the sanity check inside
> rbd_dev_destroy().)
>
> drivers/block/rbd.c
>   5149  static int rbd_dev_probe_parent(struct rbd_device *rbd_dev, int depth)
>   5150  {
>   5151  struct rbd_device *parent = NULL;
>   5152  int ret;
>   5153
>   5154  if (!rbd_dev->parent_spec)
>   5155  return 0;
>   5156
>   5157  if (++depth > RBD_MAX_PARENT_CHAIN_LEN) {
>   5158  pr_info("parent chain is too long (%d)\n", depth);
>   5159  ret = -EINVAL;
>   5160  goto out_err;
>
> We haven't allocated anything so this should just be return -EINVAL;
> In the original code, we decrement the kref count on ->parent_spec on
> this error path so that is a classic One Err Bug.

The caller expects rbd_dev->parent_spec to be put on any error.  Notice
that we return right away if !rbd_dev->parent_spec.

>
>   5161  }
>   5162
>   5163  parent = rbd_dev_create(rbd_dev->rbd_client, 
> rbd_dev->parent_spec,
>   5164  NULL);
>   5165  if (!parent) {
>   5166  ret = -ENOMEM;
>   5167  goto out_err;
>
> Still haven't allocated anything so return -ENOMEM, but if we fail after
> this point we will need to call rbd_dev_destroy().
>
>   5168  }
>   5169
>   5170  /*
>   5171   * Images related by parent/child relationships always share
>   5172   * rbd_client and spec/parent_spec, so bump their refcounts.
>   5173   */
>   5174  __rbd_get_client(rbd_dev->rbd_client);
>   5175  rbd_spec_get(rbd_dev->parent_spec);
>
> We will need to put these on any later error paths.

And we do, in rbd_dev_destroy(parent), since these are references for
the parent.

>
>   5176
>   5177  ret = rbd_dev_image_probe(parent, depth);
>   5178  if (ret < 0)
>   5179  goto out_err;
>
> Ok.  We need to put the ->parent_spec, ->rbd_client and free the parent.
>
>   5180
>   5181  rbd_dev->parent = parent;
>   5182  atomic_set(_dev->parent_ref, 1);
>   5183  return 0;
>   5184
>   5185  out_err:
>   5186  rbd_dev_unparent(rbd_dev);
>
> This is a complicated way to say rbd_spec_put(rbd_dev->parent_spec);
>
> Also, is it really necessary to set ->parent_spec to NULL?  If

Re: block-rbd: One function call less in rbd_dev_probe_parent() after error detection

2015-11-24 Thread Ilya Dryomov

On Tue, Nov 24, 2015 at 9:34 PM, SF Markus Elfring
 wrote:
>> Well, there isn't any _literal_ linking (e.g. adding to a link list,
>> etc) in this case.  We just bump some refs and do probe to fill in the
>> newly allocated parent.
>
> Thanks for your clarification.
>
>
>> The actual linking (rbd_dev->parent = parent) is done right before
>> returning so we never have to undo it in rbd_dev_probe_parent() and
>> that's the only reason your patch probably doesn't break anything.
>
> Is this function implementation just also affected by an issue
> which is mentioned in the Linux document "CodingStyle" as "one err bugs"?

No, why?  "one err bug" as per CodingStyle is a NULL deref on line 2 if
foo is NULL.  If it was just "err: kfree(foo); return ret;", a NULL foo
would be perfectly OK.

1   err:
2   kfree(foo->bar);
3   kfree(foo);
4   return ret;

If you can spot such a NULL deref in rbd_dev_probe_parent(), I'd gladly
take a patch.

>
>
>> Think about what happens if, after your patch is applied, someone moves
>> that assignment up or adds an extra step that can fail after it...
>
> Is such a software maintenance concern really enough to delay (or reject)
> my second update suggestion in this small patch series?

Yes - it's rejected because it messes up the order of cleanup for no
good reason.  I realize why you think the patch is correct and it's not
without merit, but it just doesn't fit the weird rbd_dev_probe_parent()
contract.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CEPH-DEVEL] [ceph-users] occasional failure to unmap rbd

2015-11-24 Thread Ilya Dryomov

On Tue, Nov 24, 2015 at 12:12 AM, Markus Kienast <elias1...@gmail.com> wrote:
> Kernel Version
> elias@paris3:~$ uname -a
> Linux paris3.sfe.tv 3.16.0-28-generic #38-Ubuntu SMP Sat Dec 13
> 16:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> Output of dmesg and /var/log/dmesg attached.
> But does not show much except for one mon being down.
> The mon is down for hardware reasons.
>
>
>
> On Mon, Nov 23, 2015 at 11:26 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
>>
>> On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <m...@trickkiste.at> wrote:
>> > I am having the same issue here.
>>
>> Which kernel are you running?  Could you attach your dmesg?
>>
>> >
>> > root@paris3:/etc/neutron# rbd unmap /dev/rbd0
>> > rbd: failed to remove rbd device: (16) Device or resource busy
>> > rbd: remove failed: (16) Device or resource busy
>> >
>> > root@paris3:/etc/neutron# rbd info -p volumes
>> > volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
>> > 2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
>> > 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
>> > c=0x17734e0).fault
>> > rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
>> > size 500 GB in 128000 objects
>> > order 22 (4096 kB objects)
>> > block_name_prefix: rbd_data.1b6d9e2aaa998b
>> > format: 2
>> > features: layering
>> > root@paris3:/etc/neutron# rados -p volumes listwatchers
>> > rbd_header.1b6d9e2aaa998b
>> > 2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
>> > 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 
>> > c=0x9cf4f0).fault
>>
>> Did you root cause these faults?
>
> Hardware failure caused these faults.
>
>>
>> > watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1
>> >
>> > root@paris3:/etc/neutron# ps ax | grep rbd
>> >  7814 ?S  0:00 [jbd2/rbd0-8]
>>
>> Was there an ext filesystem involved?  How was it umounted - do you
>> have a "umount " process stuck in D state?
>
> Yes, all these RBDs are formatted with ext4. I am regularly using them
> with openstack and have never had any problems.
> I did "unmount " and the unmount process did actually
> finish just fine.
> Where can I look up, if it is stuck in "D" state?
>
>>
>> > 11003 ?S  0:00 [jbd2/rbd1-8]
>> > 14042 ?S  0:00 [jbd2/rbd2p1-8]
>> > 24228 ?S  0:00 [jbd2/rbd3-8]
>> >
>> > root@paris3:/etc/neutron# ceph --version
>> > ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)
>> >
>> > root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
>> > returns nothing
>> >
>> > root@paris3:/etc/neutron# fuser -amv /dev/rbd0
>> >  USERPID ACCESS COMMAND
>> > /dev/rbd0:
>>
>> What's the output of "cat /sys/bus/rbd/devices/0/client_id"?
>
> root@paris3:~# cat /sys/bus/rbd/devices/0/client_id
> client8471177
>
>>
>> What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"?
>
> root@paris3:~# ls -l /sys/kernel/debug/ceph/
> total 0
> drwxr-xr-x 2 root root 0 Feb  4  2015
> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711
> drwxr-xr-x 2 root root 0 Nov 23 11:41
> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177
>
> root@paris3:~# cat
> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177/osdc
> has no output

This means there are no outstanding/hung rbd I/Os.  According to you,
umount completed successfully, and yet there is a jbd2/rbd0-8 kthread
hanging around, keeping /dev/rbd0 open and holding a ref to it.
A quick search produced two similar reports:

[1] 
https://ask.fedoraproject.org/en/question/7572/how-to-stop-kernel-ext4-journaling-thread/
[2] http://lists.openwall.net/linux-ext4/2015/10/24/11

The only difference as far I can tell is those people noticed the jbd2
thread because they wanted to run fsck, while you ran into it because
you tried to do "rbd unmap".  Neither mentions rbd.

Look at [2], did you at any point see any similar errors in dmesg?

>
> root@paris3:~# cat
> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
> hangs with no output

It shouldn't hang, so it could be unrelated.  Given the "Feb  4  2015"
timestamp, I'm going to assume you haven't rebooted this box in a long
time?  If so, do you remember what happened around that date?

Do you keep syslog archives?  I'd be interested in seeing everything
you have for Feb 3 - Feb 5.

To try to figure out where it's hanging, can you do

# cat 
>/sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
< it'll hang, grab its PID from ps output >
# cat /proc/$PID/stack

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Reboot blocked when undoing unmap op.

2015-11-20 Thread Ilya Dryomov

On Fri, Nov 20, 2015 at 3:19 AM, Wukongming  wrote:
> Hi Sage,
>
> I created a rbd image, and mapped to a local which means I can find 
> /dev/rbd0, at this time I reboot the system, in last step of shutting down, 
> it blocked with an error
>
> [235618.0202207] libceph: connect 172.16.57.252:6789 error -101.
>
> My Works’ Env:
>
> Ubuntu kernel 3.19.0
> Ceph 0.94.5
> A cluster of 2 Servers with iscsitgt and open-iscsi, both as server and 
> client. Multipath process is on but not affect this issue. I’ve tried 
> stopping multipath, but the issue still there.
> I map a rbd image to a local, why show me a connect error?
>
> I saw your reply on 
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/13077, but just 
> apart. Is this issue resolved and how?

Yeah, this has been a long standing problem with libceph/rbd.  The
issue is that you *have* to umount (and ideally also unmap, but unmap
isn't strictly necessary) before you reboot.  Otherwise (and I assume
by mapped to a local you mean you've got MONs and OSDs on the same node
as you do rbd map), when you issue a reboot, daemons get killed and the
kernel client ends up waiting for the them to come back, because of
outstanding writes issued by umount called by systemd (or whatever).
There are other variations of this, but it all comes down to you having
to cold reboot.

The right fix is to have all init systems sequence the killing of ceph
daemons after the umount/unmap.  I also toyed with adding a reboot
notifier for libceph to save a cold reboot, but the problem with that
in the general case is data integrity.  However, in cases like the one
I described above, there is no going back so we might as well kill
libceph through a notifier.  I have an incomplete patch somewhere, but
it really shouldn't be necessary...

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel client OSD message version

2015-11-19 Thread Ilya Dryomov

On Thu, Nov 19, 2015 at 11:27 AM, Lakis, Jacek  wrote:
> Ilya, thank you for the quick reply.
>
> For example it's about split decoding. I'm asking not because of specific 
> changes, I'm rather curious about when we should sync the kernel client 
> encoding to the master (or stable versions).

It's usually done when the need arises, like if the kernel client
starts using a field that isn't in the old enconding.  We don't update
for each new ceph release.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] net/ceph: do not define list_entry_next

2015-11-18 Thread Ilya Dryomov

On Wed, Nov 18, 2015 at 1:13 PM, Sergey Senozhatsky
 wrote:
> Cosmetic.
>
> Do not define list_entry_next() and use list_next_entry()
> from list.h.
>
> Signed-off-by: Sergey Senozhatsky 
> ---
>  net/ceph/messenger.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 9981039..77ccca9 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #ifdef CONFIG_BLOCK
>  #include 
>  #endif /* CONFIG_BLOCK */
> @@ -23,9 +24,6 @@
>  #include 
>  #include 
>
> -#define list_entry_next(pos, member)   \
> -   list_entry(pos->member.next, typeof(*pos), member)
> -
>  /*
>   * Ceph uses the messenger to exchange ceph_msg messages with other
>   * hosts in the system.  The messenger provides ordered and reliable
> @@ -1042,7 +1040,7 @@ static bool ceph_msg_data_pagelist_advance(struct 
> ceph_msg_data_cursor *cursor,
> /* Move on to the next page */
>
> BUG_ON(list_is_last(>page->lru, >head));
> -   cursor->page = list_entry_next(cursor->page, lru);
> +   cursor->page = list_next_entry(cursor->page, lru);
> cursor->last_piece = cursor->resid <= PAGE_SIZE;
>
> return true;
> @@ -1166,7 +1164,7 @@ static bool ceph_msg_data_advance(struct 
> ceph_msg_data_cursor *cursor,
> if (!cursor->resid && cursor->total_resid) {
> WARN_ON(!cursor->last_piece);
> BUG_ON(list_is_last(>data->links, cursor->data_head));
> -   cursor->data = list_entry_next(cursor->data, links);
> +   cursor->data = list_next_entry(cursor->data, links);
> __ceph_msg_data_cursor_init(cursor);
> new_piece = true;
> }

Someone beat you to it ;)

https://github.com/ceph/ceph-client/commit/76b4a27faebb369c1c50df01ef08b614a2854fc5

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] libceph: use list_next_entry instead of list_entry_next

2015-11-17 Thread Ilya Dryomov

On Mon, Nov 16, 2015 at 2:46 PM, Geliang Tang  wrote:
> list_next_entry has been defined in list.h, so I replace list_entry_next
> with it.
>
> Signed-off-by: Geliang Tang 
> ---
>  net/ceph/messenger.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 9981039..b1d1489 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -23,9 +23,6 @@
>  #include 
>  #include 
>
> -#define list_entry_next(pos, member)   \
> -   list_entry(pos->member.next, typeof(*pos), member)
> -
>  /*
>   * Ceph uses the messenger to exchange ceph_msg messages with other
>   * hosts in the system.  The messenger provides ordered and reliable
> @@ -1042,7 +1039,7 @@ static bool ceph_msg_data_pagelist_advance(struct 
> ceph_msg_data_cursor *cursor,
> /* Move on to the next page */
>
> BUG_ON(list_is_last(>page->lru, >head));
> -   cursor->page = list_entry_next(cursor->page, lru);
> +   cursor->page = list_next_entry(cursor->page, lru);
> cursor->last_piece = cursor->resid <= PAGE_SIZE;
>
> return true;
> @@ -1166,7 +1163,7 @@ static bool ceph_msg_data_advance(struct 
> ceph_msg_data_cursor *cursor,
> if (!cursor->resid && cursor->total_resid) {
> WARN_ON(!cursor->last_piece);
> BUG_ON(list_is_last(>data->links, cursor->data_head));
> -   cursor->data = list_entry_next(cursor->data, links);
> +   cursor->data = list_next_entry(cursor->data, links);
> __ceph_msg_data_cursor_init(cursor);
> new_piece = true;
> }

Applied.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

request_queue use-after-free - inode_detach_wb()

2015-11-16 Thread Ilya Dryomov

Hello,

Last week, while running an rbd test which does a lot of maps and
unmaps (read losetup / losetup -d) with slab debugging enabled, I hit
the attached splat.  That 6a byte corresponds to the atomic_long_t
count of the percpu_ref refcnt in request_queue::backing_dev_info::wb,
pointing to a percpu_ref_put() on a freed memory.  It hasn't reproduced
since.

After a prolonged stare at rbd (we've just fixed an rbd vs sysfs
lifecycle issue, so I naturally assumed we either missed something or
it had something to do with that patch) I looked wider and concluded
that the most likely place a stray percpu_ref_put() could have come
from was inode_detach_wb().  It's called from __destroy_inode(), which
means iput(), which means bdput().

Looking at __blkdev_put(), the issue becomes clear: we are taking
precautions to flush before calling out to ->release() because, at
least according to the comment, ->release() can free queue; we are
recording owner pointer because put_disk() may free both gendisk and
queue, and then, after all that, we are calling bdput() which may
touch the queue through wb_put() in inode_detach_wb().  (The fun part
is wb_put() is supposed to be a noop for root wbs, but slab debugging
interferes with that by poisoning wb->bdi pointer.)

1514  * dirty data before.
1515  */
1516 bdev_write_inode(bdev);
1517 }
1518 if (bdev->bd_contains == bdev) {
1519 if (disk->fops->release)
1520 disk->fops->release(disk, mode);
1521 }
1522 if (!bdev->bd_openers) {
1523 struct module *owner = disk->fops->owner;
1524
1525 disk_put_part(bdev->bd_part);
1526 bdev->bd_part = NULL;
1527 bdev->bd_disk = NULL;
1528 if (bdev != bdev->bd_contains)
1529 victim = bdev->bd_contains;
1530 bdev->bd_contains = NULL;
1531
1532 put_disk(disk); <-- may free q
1533 module_put(owner);
1534 }
1535 mutex_unlock(>bd_mutex);
1536 bdput(bdev); <-- may touch q.backing_dev_info.wb

To reproduce, apply the attached patch (systemd-udevd condition is just
a convenience: udev reacts to change events by getting the bdev which
it then has to put), boot with slub_debug=,blkdev_queue and do:

$ sudo modprobe loop
$ sudo losetup /dev/loop0 foo.img
$ sudo dd if=/dev/urandom of=/dev/loop0 bs=1M count=1
$ sudo losetup -d /dev/loop0
$ sudo rmmod loop

(rmmod is key - it's the only way to get loop to do put_disk().  For
rbd, it's just rbd map - dd - rbd unmap.)

In the past we used to reassign to default_backing_dev_info here, but
it was nuked in b83ae6d42143 ("fs: remove mapping->backing_dev_info").
Shortly after that cgroup-specific writebacks patches from Tejun got
merged, adding inode::i_wb and inode_detach_wb() call.  The fix seems
to be to detach the inode earlier, but I'm not familiar enough with
cgroups code, so sending my findings instead of a patch.  Christoph,
Tejun?

Thanks,

Ilya
[18513.199040] 
=
[18513.199459] BUG blkdev_queue (Not tainted): Poison overwritten
[18513.199459] 
-
[18513.199459] 
[18513.205765] Disabling lock debugging due to kernel taint
[18513.205765] INFO: 0x8800659a05d8-0x8800659a05d8. First byte 0x6a 
instead of 0x6b
[18513.205765] INFO: Allocated in blk_alloc_queue_node+0x28/0x2c0 age=10215 
cpu=1 pid=1920
[18513.205765]  __slab_alloc.constprop.50+0x4d5/0x540
[18513.205765]  kmem_cache_alloc+0x2ba/0x320
[18513.205765]  blk_alloc_queue_node+0x28/0x2c0
[18513.205765]  blk_mq_init_queue+0x20/0x60
[18513.205765]  do_rbd_add.isra.23+0x833/0xd70
[18513.205765]  rbd_add+0x1d/0x30
[18513.205765]  bus_attr_store+0x25/0x30
[18513.205765]  sysfs_kf_write+0x45/0x60
[18513.205765]  kernfs_fop_write+0x141/0x190
[18513.205765]  __vfs_write+0x28/0xe0
[18513.205765]  vfs_write+0xa2/0x180
[18513.205765]  SyS_write+0x49/0xa0
[18513.219265]  entry_SYSCALL_64_fastpath+0x12/0x6f
[18513.219265] INFO: Freed in blk_free_queue_rcu+0x1c/0x20 age=122 cpu=1 
pid=1959
[18513.219265]  __slab_free+0x148/0x290
[18513.219265]  kmem_cache_free+0x2b7/0x340
[18513.219265]  blk_free_queue_rcu+0x1c/0x20
[18513.219265]  rcu_process_callbacks+0x2fb/0x820
[18513.219265]  __do_softirq+0xd4/0x460
[18513.219265]  irq_exit+0x95/0xa0
[18513.219265]  smp_apic_timer_interrupt+0x42/0x50
[18513.219265]  apic_timer_interrupt+0x81/0x90
[18513.219265]  __slab_free+0xb5/0x290
[18513.219265]  kmem_cache_free+0x2b7/0x340
[18513.219265]  ptlock_free+0x19/0x20
[18513.219265]  ___pte_free_tlb+0x22/0x50
[18513.219265]  free_pgd_range+0x258/0x440
[18513.219265]  free_pgtables+0xc4/0x120
[18513.219265]  exit_mmap+0xc3/0x130
[18513.219265]  mmput+0x3d/0xf0
[18513.219265] INFO: Slab 0xea0001966800 objects=9 used=9

Re: [PATCH] ceph:Fix error handling in the function down_reply

2015-11-09 Thread Ilya Dryomov

On Mon, Nov 9, 2015 at 11:15 AM, Yan, Zheng  wrote:
>
>> On Nov 9, 2015, at 11:11, Nicholas Krause  wrote:
>>
>> This fixes error handling in the function down_reply in order to
>> properly check and jump to the goto label, out_err for this
>> particular function if a error code is returned by any function
>> called in down_reply and therefore make checking be included
>> for the call to ceph_update_snap_trace in order to comply with
>> these error handling checks/paths.
>>
>> Signed-off-by: Nicholas Krause 
>> ---
>> fs/ceph/mds_client.c | 11 +++
>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index 51cb02d..0b01f94 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -2495,14 +2495,17 @@ static void handle_reply(struct ceph_mds_session 
>> *session, struct ceph_msg *msg)
>>   realm = NULL;
>>   if (rinfo->snapblob_len) {
>>   down_write(>snap_rwsem);
>> - ceph_update_snap_trace(mdsc, rinfo->snapblob,
>> - rinfo->snapblob + rinfo->snapblob_len,
>> - le32_to_cpu(head->op) == CEPH_MDS_OP_RMSNAP,
>> - );
>> + err = ceph_update_snap_trace(mdsc, rinfo->snapblob,
>> +  rinfo->snapblob + 
>> rinfo->snapblob_len,
>> +  le32_to_cpu(head->op) == 
>> CEPH_MDS_OP_RMSNAP,
>> +  );
>>   downgrade_write(>snap_rwsem);
>>   } else {
>>   down_read(>snap_rwsem);
>>   }
>> +
>> + if (err)
>> + goto out_err;
>>
>>   /* insert trace into our cache */
>>   mutex_lock(>r_fill_mutex);
>
> Applied, thanks

This looks to me like it'd leave snap_rwsem locked for read?  Also, the
name of the function in question is handle_reply(), not down_reply().

I'll revert testing.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] libceph: drop authorizer check from cephx msg signing routines

2015-11-02 Thread Ilya Dryomov

I don't see a way for auth->authorizer to be NULL in
ceph_x_sign_message() or ceph_x_check_message_signature().

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 net/ceph/auth_x.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c
index 65054fd31b97..3a544ca6b5ce 100644
--- a/net/ceph/auth_x.c
+++ b/net/ceph/auth_x.c
@@ -697,8 +697,7 @@ static int ceph_x_sign_message(struct ceph_auth_handshake 
*auth,
   struct ceph_msg *msg)
 {
int ret;
-   if (!auth->authorizer)
-   return 0;
+
ret = calcu_signature((struct ceph_x_authorizer *)auth->authorizer,
  msg, >footer.sig);
if (ret < 0)
@@ -713,8 +712,6 @@ static int ceph_x_check_message_signature(struct 
ceph_auth_handshake *auth,
__le64 sig_check;
int ret;
 
-   if (!auth->authorizer)
-   return 0;
ret = calcu_signature((struct ceph_x_authorizer *)auth->authorizer,
  msg, _check);
if (ret < 0)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] libceph: msg signing callouts don't need con argument

2015-11-02 Thread Ilya Dryomov

We can use msg->con instead - at the point we sign an outgoing message
or check the signature on the incoming one, msg->con is always set.  We
wouldn't know how to sign a message without an associated session (i.e.
msg->con == NULL) and being able to sign a message using an explicitly
provided authorizer is of no use.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 fs/ceph/mds_client.c   | 14 --
 include/linux/ceph/messenger.h |  5 ++---
 net/ceph/messenger.c   |  4 ++--
 net/ceph/osd_client.c  | 14 --
 4 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 89838a226fe9..e7b130a637f9 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3942,17 +3942,19 @@ static struct ceph_msg *mds_alloc_msg(struct 
ceph_connection *con,
return msg;
 }
 
-static int sign_message(struct ceph_connection *con, struct ceph_msg *msg)
+static int mds_sign_message(struct ceph_msg *msg)
 {
-   struct ceph_mds_session *s = con->private;
+   struct ceph_mds_session *s = msg->con->private;
struct ceph_auth_handshake *auth = >s_auth;
+
return ceph_auth_sign_message(auth, msg);
 }
 
-static int check_message_signature(struct ceph_connection *con, struct 
ceph_msg *msg)
+static int mds_check_message_signature(struct ceph_msg *msg)
 {
-   struct ceph_mds_session *s = con->private;
+   struct ceph_mds_session *s = msg->con->private;
struct ceph_auth_handshake *auth = >s_auth;
+
return ceph_auth_check_message_signature(auth, msg);
 }
 
@@ -3965,8 +3967,8 @@ static const struct ceph_connection_operations 
mds_con_ops = {
.invalidate_authorizer = invalidate_authorizer,
.peer_reset = peer_reset,
.alloc_msg = mds_alloc_msg,
-   .sign_message = sign_message,
-   .check_message_signature = check_message_signature,
+   .sign_message = mds_sign_message,
+   .check_message_signature = mds_check_message_signature,
 };
 
 /* eof */
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index b2371d9b51fa..3687ff0f0133 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -43,10 +43,9 @@ struct ceph_connection_operations {
struct ceph_msg * (*alloc_msg) (struct ceph_connection *con,
struct ceph_msg_header *hdr,
int *skip);
-   int (*sign_message) (struct ceph_connection *con, struct ceph_msg *msg);
 
-   int (*check_message_signature) (struct ceph_connection *con,
-   struct ceph_msg *msg);
+   int (*sign_message) (struct ceph_msg *msg);
+   int (*check_message_signature) (struct ceph_msg *msg);
 };
 
 /* use format string %s%d */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index fce6ad636613..805f6f82139f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1205,7 +1205,7 @@ static void prepare_write_message_footer(struct 
ceph_connection *con)
con->out_kvec[v].iov_base = >footer;
if (con->peer_features & CEPH_FEATURE_MSG_AUTH) {
if (con->ops->sign_message)
-   con->ops->sign_message(con, m);
+   con->ops->sign_message(m);
else
m->footer.sig = 0;
con->out_kvec[v].iov_len = sizeof(m->footer);
@@ -2422,7 +2422,7 @@ static int read_partial_message(struct ceph_connection 
*con)
}
 
if (need_sign && con->ops->check_message_signature &&
-   con->ops->check_message_signature(con, m)) {
+   con->ops->check_message_signature(m)) {
pr_err("read_partial_message %p signature check failed\n", m);
return -EBADMSG;
}
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 191bc21cecea..118e4ce37ecc 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2979,17 +2979,19 @@ static int invalidate_authorizer(struct ceph_connection 
*con)
return ceph_monc_validate_auth(>client->monc);
 }
 
-static int sign_message(struct ceph_connection *con, struct ceph_msg *msg)
+static int osd_sign_message(struct ceph_msg *msg)
 {
-   struct ceph_osd *o = con->private;
+   struct ceph_osd *o = msg->con->private;
struct ceph_auth_handshake *auth = >o_auth;
+
return ceph_auth_sign_message(auth, msg);
 }
 
-static int check_message_signature(struct ceph_connection *con, struct 
ceph_msg *msg)
+static int osd_check_message_signature(struct ceph_msg *msg)
 {
-   struct ceph_osd *o = con->private;
+   struct ceph_osd *o = msg->con->private;
struct ceph_auth_handshake *auth = >o_auth;
+
return ceph_auth_check_message_signature(auth, ms

[PATCH 0/4] libceph: nocephx_sign_messages option + misc

2015-11-02 Thread Ilya Dryomov

Hello,

This adds nocephx_sign_messages libceph option (a lack of which is
something people are running into, see [1]), plus a couple of related
cleanups.

[1] https://forum.proxmox.com/threads/24116-new-krbd-option-on-pve4-don-t-work

Thanks,

Ilya


Ilya Dryomov (4):
  libceph: msg signing callouts don't need con argument
  libceph: drop authorizer check from cephx msg signing routines
  libceph: stop duplicating client fields in messenger
  libceph: add nocephx_sign_messages option

 fs/ceph/mds_client.c   | 14 --
 include/linux/ceph/libceph.h   |  4 +++-
 include/linux/ceph/messenger.h | 16 +++-
 net/ceph/auth_x.c  |  8 ++--
 net/ceph/ceph_common.c | 18 +-
 net/ceph/messenger.c   | 32 
 net/ceph/osd_client.c  | 14 --
 7 files changed, 53 insertions(+), 53 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: [PATCH] mark rbd requiring stable pages

2015-10-30 Thread Ilya Dryomov

On Fri, Oct 23, 2015 at 9:06 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
> On Fri, Oct 23, 2015 at 9:00 PM, ronny.hegew...@online.de
> <ronny.hegew...@online.de> wrote:
>>> Could you share the entire log snippet for those 10 minutes?
>>
>> Thats all in the logs.  But if more information would be useful tell me 
>> which logs
>> to activate and i will give it another run. At least this part is easy to 
>> reproduce.
>>
>>> Which kernel was this on?
>>
>> The latest kernel i used which produced the corruption was 3.19.8.
>> The earliest one was 3.11.
>
> No need for now, I'll poke around and report back.

So the "bad crc" errors are of course easily reproducible, but
I haven't managed to reproduce ext4 corruptions.  I amended your patch
to only require stable pages in case we actually compute checksums, see
https://github.com/ceph/ceph-client/commit/4febcceb866822c1a1aee2836c9c810e3ef29bbf.

Any other data points you can share?  Can you describe your cluster
(boxes, OSDs, clients, rbds mapped - where, how many, ext4 mkfs and
mount options, etc) in more detail?  Is there anything special about
your setup that you can think of?

You've mentioned that the best test case in your experience is kernel
compilation.  What .config are you using, how many threads (make -jX)
and how long does it take to build a kernel with that .config and that
number of threads?  You have more than one rbd device mapped on the
same box - how many exactly, do you put any load on the rest while the
kernel is compiling on one of them?  What about rbd devices mapped on
other boxes?

You get the idea - every bit counts.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: ceph: osd_client: change osd_req_op_data() macro

2015-10-29 Thread Ilya Dryomov

On Thu, Oct 22, 2015 at 5:06 PM, Ioana Ciornei  wrote:
> This patch changes the osd_req_op_data() macro to not evaluate
> parameters more than once in order to follow the kernel coding style.
>
> Signed-off-by: Ioana Ciornei 
> Reviewed-by: Alex Elder 
> ---
>  net/ceph/osd_client.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index a362d7e..856e8f8 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -120,10 +120,12 @@ static void ceph_osd_data_bio_init(struct ceph_osd_data 
> *osd_data,
>  }
>  #endif /* CONFIG_BLOCK */
>
> -#define osd_req_op_data(oreq, whch, typ, fld)  \
> -   ({  \
> -   BUG_ON(whch >= (oreq)->r_num_ops);  \
> -   &(oreq)->r_ops[whch].typ.fld;   \
> +#define osd_req_op_data(oreq, whch, typ, fld)\
> +   ({\
> +   struct ceph_osd_request *__oreq = (oreq); \
> +   unsigned int __whch = (whch);   \
> +   BUG_ON(__whch >= __oreq->r_num_ops);  \
> +   &__oreq->r_ops[__whch].typ.fld;   \
> })
>
>  static struct ceph_osd_data *

For some reason this ended up in Spam - you should CC subsystem
maintainer(s) on any patch, as noted in Documentation/SubmittingPatches.

Applied with minor changes, see
https://github.com/ceph/ceph-client/commit/51dcb83f3d1143819fe0791c112e1b1f830e457d.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libceph: introduce ceph_x_authorizer_cleanup()

2015-10-26 Thread Ilya Dryomov

Commit ae385eaf24dc ("libceph: store session key in cephx authorizer")
introduced ceph_x_authorizer::session_key, but didn't update all the
exit/error paths.  Introduce ceph_x_authorizer_cleanup() to encapsulate
ceph_x_authorizer cleanup and switch to it.  This fixes ceph_x_destroy(),
which currently always leaks key and ceph_x_build_authorizer() error
paths.

Cc: Yan, Zheng <z...@redhat.com>
Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 net/ceph/auth_x.c | 28 +---
 net/ceph/crypto.h |  4 +++-
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c
index ba6eb17226da..65054fd31b97 100644
--- a/net/ceph/auth_x.c
+++ b/net/ceph/auth_x.c
@@ -279,6 +279,15 @@ bad:
return -EINVAL;
 }
 
+static void ceph_x_authorizer_cleanup(struct ceph_x_authorizer *au)
+{
+   ceph_crypto_key_destroy(>session_key);
+   if (au->buf) {
+   ceph_buffer_put(au->buf);
+   au->buf = NULL;
+   }
+}
+
 static int ceph_x_build_authorizer(struct ceph_auth_client *ac,
   struct ceph_x_ticket_handler *th,
   struct ceph_x_authorizer *au)
@@ -297,7 +306,7 @@ static int ceph_x_build_authorizer(struct ceph_auth_client 
*ac,
ceph_crypto_key_destroy(>session_key);
ret = ceph_crypto_key_clone(>session_key, >session_key);
if (ret)
-   return ret;
+   goto out_au;
 
maxlen = sizeof(*msg_a) + sizeof(msg_b) +
ceph_x_encrypt_buflen(ticket_blob_len);
@@ -309,8 +318,8 @@ static int ceph_x_build_authorizer(struct ceph_auth_client 
*ac,
if (!au->buf) {
au->buf = ceph_buffer_new(maxlen, GFP_NOFS);
if (!au->buf) {
-   ceph_crypto_key_destroy(>session_key);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out_au;
}
}
au->service = th->service;
@@ -340,7 +349,7 @@ static int ceph_x_build_authorizer(struct ceph_auth_client 
*ac,
ret = ceph_x_encrypt(>session_key, _b, sizeof(msg_b),
 p, end - p);
if (ret < 0)
-   goto out_buf;
+   goto out_au;
p += ret;
au->buf->vec.iov_len = p - au->buf->vec.iov_base;
dout(" built authorizer nonce %llx len %d\n", au->nonce,
@@ -348,9 +357,8 @@ static int ceph_x_build_authorizer(struct ceph_auth_client 
*ac,
BUG_ON(au->buf->vec.iov_len > maxlen);
return 0;
 
-out_buf:
-   ceph_buffer_put(au->buf);
-   au->buf = NULL;
+out_au:
+   ceph_x_authorizer_cleanup(au);
return ret;
 }
 
@@ -624,8 +632,7 @@ static void ceph_x_destroy_authorizer(struct 
ceph_auth_client *ac,
 {
struct ceph_x_authorizer *au = (void *)a;
 
-   ceph_crypto_key_destroy(>session_key);
-   ceph_buffer_put(au->buf);
+   ceph_x_authorizer_cleanup(au);
kfree(au);
 }
 
@@ -653,8 +660,7 @@ static void ceph_x_destroy(struct ceph_auth_client *ac)
remove_ticket_handler(ac, th);
}
 
-   if (xi->auth_authorizer.buf)
-   ceph_buffer_put(xi->auth_authorizer.buf);
+   ceph_x_authorizer_cleanup(>auth_authorizer);
 
kfree(ac->private);
ac->private = NULL;
diff --git a/net/ceph/crypto.h b/net/ceph/crypto.h
index d1498224c49d..2e9cab09f37b 100644
--- a/net/ceph/crypto.h
+++ b/net/ceph/crypto.h
@@ -16,8 +16,10 @@ struct ceph_crypto_key {
 
 static inline void ceph_crypto_key_destroy(struct ceph_crypto_key *key)
 {
-   if (key)
+   if (key) {
kfree(key->key);
+   key->key = NULL;
+   }
 }
 
 int ceph_crypto_key_clone(struct ceph_crypto_key *dst,
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: [PATCH] mark rbd requiring stable pages

2015-10-23 Thread Ilya Dryomov

On Fri, Oct 23, 2015 at 9:00 PM, ronny.hegew...@online.de
 wrote:
>> Could you share the entire log snippet for those 10 minutes?
>
> Thats all in the logs.  But if more information would be useful tell me which 
> logs
> to activate and i will give it another run. At least this part is easy to 
> reproduce.
>
>> Which kernel was this on?
>
> The latest kernel i used which produced the corruption was 3.19.8.
> The earliest one was 3.11.

No need for now, I'll poke around and report back.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: remove duplicate calls to rbd_dev_mapping_clear()

2015-10-23 Thread Ilya Dryomov

Commit d1cf5788450e ("rbd: set mapping info earlier") defined
rbd_dev_mapping_clear(), but, just a few days after, commit
f35a4dee14c3 ("rbd: set the mapping size and features later") moved
rbd_dev_mapping_set() calls and added another rbd_dev_mapping_clear()
call instead of moving the old one.  Around the same time, another
duplicate was introduced in rbd_dev_device_release() - kill both.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index cf087c5a6aa5..f8f4beb72b84 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5251,8 +5251,6 @@ err_out_blkdev:
unregister_blkdev(rbd_dev->major, rbd_dev->name);
 err_out_id:
rbd_dev_id_put(rbd_dev);
-   rbd_dev_mapping_clear(rbd_dev);
-
return ret;
 }
 
@@ -5507,7 +5505,6 @@ static void rbd_dev_device_release(struct rbd_device 
*rbd_dev)
if (!single_major)
unregister_blkdev(rbd_dev->major, rbd_dev->name);
rbd_dev_id_put(rbd_dev);
-   rbd_dev_mapping_clear(rbd_dev);
 }
 
 static void rbd_dev_remove_parent(struct rbd_device *rbd_dev)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mark rbd requiring stable pages

2015-10-23 Thread Ilya Dryomov

On Fri, Oct 23, 2015 at 1:56 AM, Ronny Hegewald
<ronny.hegew...@online.de> wrote:
> On Thursday 22 October 2015, Ilya Dryomov wrote:
>> Well, checksum mismatches are to be expected given what we are doing
>> now, but I wouldn't expect any data corruptions.  Ronny writes that he
>> saw frequent ext4 corruptions on krbd devices before he enabled stable
>> pages, which leads me to believe that the !crc case, for which we won't
>> be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken.  Ronny,
>> could you describe it in more detail and maybe share some of those osd
>> logs with bad crc messages?
>>
> This is from a 10 minute period from one of the OSDs.
>
> 23:11:02.423728 ce5dfb70  0 bad crc in data 1657725429 != exp 496797267
> 23:11:37.586411 ce5dfb70  0 bad crc in data 1216602498 != exp 111161
> 23:12:07.805675 cc3ffb70  0 bad crc in data 3140625666 != exp 2614069504
> 23:12:44.485713 c96ffb70  0 bad crc in data 1712148977 != exp 3239079328
> 23:13:24.746217 ce5dfb70  0 bad crc in data 144620426 != exp 3156694286
> 23:13:52.792367 ce5dfb70  0 bad crc in data 4033880920 != exp 4159672481
> 23:14:22.958999 c96ffb70  0 bad crc in data 847688321 != exp 1551499144
> 23:16:35.015629 ce5dfb70  0 bad crc in data 2790209714 != exp 3779604715
> 23:17:48.482049 c96ffb70  0 bad crc in data 1563466764 != exp 528198494
> 23:19:28.925357 cc3ffb70  0 bad crc in data 1764275395 != exp 2075504274
> 23:19:59.039843 cc3ffb70  0 bad crc in data 2960172683 != exp 1215950691

Could you share the entire log snippet for those 10 minutes?

>
> The filesystem corruptions are usually ones with messages of
>
> EXT4-fs error (device rbd4): ext4_mb_generate_buddy:757: group 155, block
> bitmap and bg descriptor inconsistent: 23625 vs 23660 free clusters
>
> These were pretty common, at least every other day, often multiple times a
> day.
>
> Sometimes there was a additional
>
> JBD2: Spotted dirty metadata buffer (dev = rbd4, blocknr = 0). There's a risk
> of filesystem corruption in case of system crash.
>
> Another type of Filesystem corruption i experienced through kernel
> compilations, that lead to the following messages.
>
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282221) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273062) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272270) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282254) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273070) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272308) -
> no `.' or `..'
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted
> inode referenced: 270039
> last message repeated 2 times
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271534) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271275) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282290) -
> no `.' or `..'
> EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #281914) -
> no `.' or `..'
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted
> inode referenced: 270039
> last message repeated 2 times
> kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm:
> deleted inode referenced: 282221
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted
> inode referenced: 282221
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted
> inode referenced: 281914
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted
> inode referenced: 281914
> EXT4-fs error: 243 callbacks suppressed
> EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: deleted
> inode referenced: 45375
> kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp:
> deleted inode referenced: 45371
>
> The result was that various files and directories in the kernel sourcedir
> couldn't be accessed anymore and even fsck couldn't repair it, so i had to
> finally delete it. But these ones were pretty rare.
>
> Another issue were the data-corruptions in the files itself, that happened
> independently from the filesystem-corruptions.  These happened on most days,
> sometimes only once, sometimes multiple times a day.
>
> Newly written files that contained corrupted data seem to always have it only
> at one place. These corrupt data replaced the original data from the file, but
> never changed the

Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Ilya Dryomov

On Thu, Oct 22, 2015 at 7:22 PM, Mike Christie <micha...@cs.wisc.edu> wrote:
> On 10/22/15, 11:52 AM, Ilya Dryomov wrote:
>>
>> On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie <micha...@cs.wisc.edu>
>> wrote:
>>>
>>> On 10/22/2015 06:20 AM, Ilya Dryomov wrote:
>>>>
>>>>
>>>>>>
>>>>>> If we are just talking about if stable pages are not used, and someone
>>>>>> is re-writing data to a page after the page has already been submitted
>>>>>> to the block layer (I mean the page is on some bio which is on a
>>>>>> request
>>>>>> which is on some request_queue scheduler list or basically anywhere in
>>>>>> the block layer), then I was saying this can occur with any block
>>>>>> driver. There is nothing that is preventing this from happening with a
>>>>>> FC driver or nvme or cciss or in dm or whatever. The app/user can
>>>>>> rewrite as late as when we are in the make_request_fn/request_fn.
>>>>>>
>>>>>> I think I am misunderstanding your question because I thought this is
>>>>>> expected behavior, and there is nothing drivers can do if the app is
>>>>>> not
>>>>>> doing a flush/sync between these types of write sequences.
>>>>
>>>> I don't see a problem with rewriting as late as when we are in
>>>> request_fn() (or in a wq after being put there by request_fn()).  Where
>>>> I thought there *might* be an issue is rewriting after sendpage(), if
>>>> sendpage() is used - perhaps some sneaky sequence similar to that
>>>> retransmit bug that would cause us to *transmit* incorrect bytes (as
>>>> opposed to *re*transmit) or something of that nature?
>>>
>>>
>>>
>>> Just to make sure we are on the same page.
>>>
>>> Are you concerned about the tcp/net layer retransmitting due to it
>>> detecting a issue as part of the tcp protocol, or are you concerned
>>> about rbd/libceph initiating a retry like with the nfs issue?
>>
>>
>> The former, tcp/net layer.  I'm just conjecturing though.
>>
>
> For iscsi, we normally use the sendpage path. Data digests are off by
> default and some distros do not even allow you to turn them on, so our
> sendpage path has got a lot of testing and we have not seen any corruptions.
> Not saying it is not possible, but just saying we have not seen any.

Great, that's reassuring.

>
> It could be due to a recent change. Ronny, tell us about the workload and I
> will check iscsi.
>
> Oh yeah, for the tcp/net retransmission case, I had said offlist, I thought
> there might be a issue with iscsi but I guess I was wrong, so I have not
> seen any issues with that either.

I'll drop my concerns then.  Those corruptions could be a bug in ceph
reconnect code or something else - regardless, that's separate from the
issue at hand.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Ilya Dryomov

On Thu, Oct 22, 2015 at 6:07 AM, Mike Christie <micha...@cs.wisc.edu> wrote:
> On 10/21/2015 03:57 PM, Ilya Dryomov wrote:
>> On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
>>> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
>>>> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
>>>> so enabling this unconditionally is probably too harsh.  OTOH we are
>>>> talking to the network, which means all sorts of delays, retransmission
>>>> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
>>>> added to an skb - you can't write anything to a page until networking
>>>> is fully done with it and expect it to work.  It's particularly
>>>> alarming that you've seen corruptions.
>>>>
>>>> Currently the only users of this flag are block integrity stuff and
>>>> md-raid5, which makes me wonder what iscsi, nfs and others do in this
>>>> area.  There's an old ticket on this topic somewhere on the tracker, so
>>>> I'll need to research this.  Thanks for bringing this up!
>>>
>>> Hi Mike,
>>>
>>> I was hoping to grab you for a few minutes, but you weren't there...
>>>
>>> I spent a better part of today reading code and mailing lists on this
>>> topic.  It is of course a bug that we use sendpage() which inlines
>>> pages into an skb and do nothing to keep those pages stable.  We have
>>> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
>>> case is an obvious fix.
>>>
>>> I looked at drbd and iscsi and I think iscsi could do the same - ditch
>>> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
>>> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
>>> rather than having a roll-your-own implementation which doesn't close
>>> the race but only narrows it sounds like a win, unless copying through
>>> sendmsg() is for some reason cheaper than stable-waiting?
>
> Yeah, that is what I was saying on the call the other day, but the
> reception was bad. We only have the sendmsg code path when digest are on
> because that code came before stable pages. When stable pages were
> created, it was on by default but did not cover all the cases, so we
> left the code. It then handled most scenarios, but I just never got
> around to removing old the code. However, it was set to off by default
> so I left it and made this patch for iscsi to turn on stable pages:
>
> [this patch only enabled stable pages when digests/crcs are on and dif
> not remove the code yet]
> https://groups.google.com/forum/#!topic/open-iscsi/n4jvWK7BPYM
>
> I did not really like the layering so I have not posted it for inclusion.

Good to know I got it right ;)

>
>
>
>>>
>>> drbd still needs the non-zero-copy version for its async protocol for
>>> when they free the pages before the NIC has chance to put them on the
>>> wire.  md-raid5 it turns out has an option to essentially disable most
>>> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
>>> if that option is enabled.
>>>
>>> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
>>> convince myself that mucking with sendpage()ed pages while they sit in
>>> the TCP queue (or anywhere in the networking stack, really), is safe -
>>> there is nothing to prevent pages from being modified after sendpage()
>>> returned and Ronny reports data corruptions that pretty much went away
>>> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
>>> this, starting to confuse fs with block, though.  How does that work in
>>> iscsi land?
>
> This is what I was trying to ask about in the call the other day. Where
> is the corruption that Ronny was seeing. Was it checksum mismatches on
> data being written, or is incorrect meta data being written, etc?

Well, checksum mismatches are to be expected given what we are doing
now, but I wouldn't expect any data corruptions.  Ronny writes that he
saw frequent ext4 corruptions on krbd devices before he enabled stable
pages, which leads me to believe that the !crc case, for which we won't
be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken.  Ronny,
could you describe it in more detail and maybe share some of those osd
logs with bad crc messages?

>
> If we are just talking about if stable pages are not used, and someone
> is re-writing data to a page after the page has already been submitted
> to the block layer (I mean the pag

Re: [PATCH] mark rbd requiring stable pages

2015-10-22 Thread Ilya Dryomov

On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie <micha...@cs.wisc.edu> wrote:
> On 10/22/2015 06:20 AM, Ilya Dryomov wrote:
>>
>>> >
>>> > If we are just talking about if stable pages are not used, and someone
>>> > is re-writing data to a page after the page has already been submitted
>>> > to the block layer (I mean the page is on some bio which is on a request
>>> > which is on some request_queue scheduler list or basically anywhere in
>>> > the block layer), then I was saying this can occur with any block
>>> > driver. There is nothing that is preventing this from happening with a
>>> > FC driver or nvme or cciss or in dm or whatever. The app/user can
>>> > rewrite as late as when we are in the make_request_fn/request_fn.
>>> >
>>> > I think I am misunderstanding your question because I thought this is
>>> > expected behavior, and there is nothing drivers can do if the app is not
>>> > doing a flush/sync between these types of write sequences.
>> I don't see a problem with rewriting as late as when we are in
>> request_fn() (or in a wq after being put there by request_fn()).  Where
>> I thought there *might* be an issue is rewriting after sendpage(), if
>> sendpage() is used - perhaps some sneaky sequence similar to that
>> retransmit bug that would cause us to *transmit* incorrect bytes (as
>> opposed to *re*transmit) or something of that nature?
>
>
> Just to make sure we are on the same page.
>
> Are you concerned about the tcp/net layer retransmitting due to it
> detecting a issue as part of the tcp protocol, or are you concerned
> about rbd/libceph initiating a retry like with the nfs issue?

The former, tcp/net layer.  I'm just conjecturing though.

(We don't have the nfs issue, because even if the client sends such
a retransmit (which it won't), the primary OSD will reject it as
a dup.)

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mark rbd requiring stable pages

2015-10-21 Thread Ilya Dryomov

On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
>> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
>> so enabling this unconditionally is probably too harsh.  OTOH we are
>> talking to the network, which means all sorts of delays, retransmission
>> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
>> added to an skb - you can't write anything to a page until networking
>> is fully done with it and expect it to work.  It's particularly
>> alarming that you've seen corruptions.
>>
>> Currently the only users of this flag are block integrity stuff and
>> md-raid5, which makes me wonder what iscsi, nfs and others do in this
>> area.  There's an old ticket on this topic somewhere on the tracker, so
>> I'll need to research this.  Thanks for bringing this up!
>
> Hi Mike,
>
> I was hoping to grab you for a few minutes, but you weren't there...
>
> I spent a better part of today reading code and mailing lists on this
> topic.  It is of course a bug that we use sendpage() which inlines
> pages into an skb and do nothing to keep those pages stable.  We have
> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
> case is an obvious fix.
>
> I looked at drbd and iscsi and I think iscsi could do the same - ditch
> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
> rather than having a roll-your-own implementation which doesn't close
> the race but only narrows it sounds like a win, unless copying through
> sendmsg() is for some reason cheaper than stable-waiting?
>
> drbd still needs the non-zero-copy version for its async protocol for
> when they free the pages before the NIC has chance to put them on the
> wire.  md-raid5 it turns out has an option to essentially disable most
> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
> if that option is enabled.
>
> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
> convince myself that mucking with sendpage()ed pages while they sit in
> the TCP queue (or anywhere in the networking stack, really), is safe -
> there is nothing to prevent pages from being modified after sendpage()
> returned and Ronny reports data corruptions that pretty much went away
> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
> this, starting to confuse fs with block, though.  How does that work in
> iscsi land?
>
> (There was/is also this [1] bug, which is kind of related and probably
> worth looking into at some point later.  ceph shouldn't be that easily
> affected - we've got state, but there is a ticket for it.)
>
> [1] http://www.spinics.net/lists/linux-nfs/msg34913.html

And now with Mike on the CC and a mention that at least one scenario of
[1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption
issue when retransmitting RPC calls").

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mark rbd requiring stable pages

2015-10-21 Thread Ilya Dryomov

On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
> so enabling this unconditionally is probably too harsh.  OTOH we are
> talking to the network, which means all sorts of delays, retransmission
> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
> added to an skb - you can't write anything to a page until networking
> is fully done with it and expect it to work.  It's particularly
> alarming that you've seen corruptions.
>
> Currently the only users of this flag are block integrity stuff and
> md-raid5, which makes me wonder what iscsi, nfs and others do in this
> area.  There's an old ticket on this topic somewhere on the tracker, so
> I'll need to research this.  Thanks for bringing this up!

Hi Mike,

I was hoping to grab you for a few minutes, but you weren't there...

I spent a better part of today reading code and mailing lists on this
topic.  It is of course a bug that we use sendpage() which inlines
pages into an skb and do nothing to keep those pages stable.  We have
csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
case is an obvious fix.

I looked at drbd and iscsi and I think iscsi could do the same - ditch
the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
rather than having a roll-your-own implementation which doesn't close
the race but only narrows it sounds like a win, unless copying through
sendmsg() is for some reason cheaper than stable-waiting?

drbd still needs the non-zero-copy version for its async protocol for
when they free the pages before the NIC has chance to put them on the
wire.  md-raid5 it turns out has an option to essentially disable most
of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
if that option is enabled.

What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
convince myself that mucking with sendpage()ed pages while they sit in
the TCP queue (or anywhere in the networking stack, really), is safe -
there is nothing to prevent pages from being modified after sendpage()
returned and Ronny reports data corruptions that pretty much went away
with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
this, starting to confuse fs with block, though.  How does that work in
iscsi land?

(There was/is also this [1] bug, which is kind of related and probably
worth looking into at some point later.  ceph shouldn't be that easily
affected - we've got state, but there is a ticket for it.)

[1] http://www.spinics.net/lists/linux-nfs/msg34913.html

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()

2015-10-19 Thread Ilya Dryomov

On Mon, Oct 19, 2015 at 4:49 AM, Shraddha Barke  wrote:
> Use local variable cursor in place of >cursor in
>  read_partial_msg_data()
>
> Signed-off-by: Shraddha Barke 
> ---
> Changes in v2-
>  Drop incorrect use of cursor
>
>  net/ceph/messenger.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index b9b0e3b..b087edd 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2246,7 +2246,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
> if (do_datacrc)
> crc = con->in_data_crc;
> while (cursor->resid) {
> -   page = ceph_msg_data_next(>cursor, _offset, ,
> +   page = ceph_msg_data_next(cursor, _offset, ,
> NULL);
> ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
> if (ret <= 0) {
> @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
>
> if (do_datacrc)
> crc = ceph_crc32c_page(crc, page, page_offset, ret);
> -   (void) ceph_msg_data_advance(>cursor, (size_t)ret);
> +   (void) ceph_msg_data_advance(cursor, (size_t)ret);
> }
> if (do_datacrc)
> con->in_data_crc = crc;

This looks good.  I'd like you to update this patch to do the same for
write_partial_message_data().

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] net: ceph: messenger: Use local variable cursor instead of >cursor

2015-10-19 Thread Ilya Dryomov

On Mon, Oct 19, 2015 at 6:29 PM, Shraddha Barke  wrote:
> Use local variable cursor in place of >cursor in
> read_partial_msg_data() and write_partial_msg_data()
>
> Signed-off-by: Shraddha Barke 
> ---
> Changes in v3-
>  Replace >cursor with cursor in write_partial_msg_data() too
>
>  net/ceph/messenger.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index b9b0e3b..01b567b 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1552,8 +1552,8 @@ static int write_partial_message_data(struct 
> ceph_connection *con)
> bool need_crc;
> int ret;
>
> -   page = ceph_msg_data_next(>cursor, _offset, ,
> -   _piece);
> +   page = ceph_msg_data_next(cursor, _offset, ,
> + _piece);
> ret = ceph_tcp_sendpage(con->sock, page, page_offset,
> length, !last_piece);
> if (ret <= 0) {
> @@ -1564,7 +1564,7 @@ static int write_partial_message_data(struct 
> ceph_connection *con)
> }
> if (do_datacrc && cursor->need_crc)
> crc = ceph_crc32c_page(crc, page, page_offset, 
> length);
> -   need_crc = ceph_msg_data_advance(>cursor, (size_t)ret);
> +   need_crc = ceph_msg_data_advance(cursor, (size_t)ret);
> }
>
> dout("%s %p msg %p done\n", __func__, con, msg);
> @@ -2246,8 +2246,8 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
> if (do_datacrc)
> crc = con->in_data_crc;
> while (cursor->resid) {
> -   page = ceph_msg_data_next(>cursor, _offset, ,
> -   NULL);
> +   page = ceph_msg_data_next(cursor, _offset, ,
> + NULL);
> ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
> if (ret <= 0) {
> if (do_datacrc)
> @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
>
> if (do_datacrc)
> crc = ceph_crc32c_page(crc, page, page_offset, ret);
> -   (void) ceph_msg_data_advance(>cursor, (size_t)ret);
> +   (void) ceph_msg_data_advance(cursor, (size_t)ret);
> }
> if (do_datacrc)
> con->in_data_crc = crc;

Applied, see
https://github.com/ceph/ceph-client/commit/621a56fd69751d263795f4f35e65eff7daa3a470.

Generally, we prefix net/ceph commits with "libceph".  Different
subsystems have different conventions, use "git log --oneline "
to get a sense of what is preferred.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails

2015-10-19 Thread Ilya Dryomov

Returning pool id (i.e. >= 0) from a sysfs ->store() callback makes
userspace think it needs to retry the write.  Fix it - it's a leftover
from the times when the equivalent of rbd_dev_create() was the first
action in rbd_add().

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 07f666f4ca18..df795deffe77 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5394,7 +5394,7 @@ static ssize_t do_rbd_add(struct bus_type *bus,
struct rbd_spec *spec = NULL;
struct rbd_client *rbdc;
bool read_only;
-   int rc = -ENOMEM;
+   int rc;
 
if (!try_module_get(THIS_MODULE))
return -ENODEV;
@@ -5429,8 +5429,10 @@ static ssize_t do_rbd_add(struct bus_type *bus,
}
 
rbd_dev = rbd_dev_create(rbdc, spec, rbd_opts);
-   if (!rbd_dev)
+   if (!rbd_dev) {
+   rc = -ENOMEM;
goto err_out_client;
+   }
rbdc = NULL;/* rbd_dev now owns this */
spec = NULL;/* rbd_dev now owns this */
rbd_opts = NULL;/* rbd_dev now owns this */
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: set device_type::release instead of device::release

2015-10-19 Thread Ilya Dryomov

No point in providing an empty device_type::release callback and then
setting device::release for each rbd_dev dynamically.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4917de726bdb..fa6767e9ed2a 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3986,14 +3986,12 @@ static const struct attribute_group *rbd_attr_groups[] 
= {
NULL
 };
 
-static void rbd_sysfs_dev_release(struct device *dev)
-{
-}
+static void rbd_dev_release(struct device *dev);
 
 static struct device_type rbd_device_type = {
.name   = "rbd",
.groups = rbd_attr_groups,
-   .release= rbd_sysfs_dev_release,
+   .release= rbd_dev_release,
 };
 
 static struct rbd_spec *rbd_spec_get(struct rbd_spec *spec)
@@ -4074,7 +4072,6 @@ static struct rbd_device *rbd_dev_create(struct 
rbd_client *rbdc,
rbd_dev->dev.bus = _bus_type;
rbd_dev->dev.type = _device_type;
rbd_dev->dev.parent = _root_dev;
-   rbd_dev->dev.release = rbd_dev_release;
device_initialize(_dev->dev);
 
rbd_dev->rbd_client = rbdc;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: don't free rbd_dev outside of the release callback

2015-10-19 Thread Ilya Dryomov

struct rbd_device has struct device embedded in it, which means it's
part of kobject universe and has an unpredictable life cycle.  Freeing
its memory outside of the release callback is flawed, yet commits
200a6a8be5db ("rbd: don't destroy rbd_dev in device release function")
and 8ad42cd0c002 ("rbd: don't have device release destroy rbd_dev")
moved rbd_dev_destroy() out to rbd_dev_image_release().

This commit reverts most of that, the key points are:

- rbd_dev->dev is initialized in rbd_dev_create(), making it possible
  to use rbd_dev_destroy() - which is just a put_device() - both before
  we register with device core and after.

- rbd_dev_release() (the release callback) is the only place we
  kfree(rbd_dev).  It's also where we do module_put(), keeping the
  module unload race window as small as possible.

- We pin the module in rbd_dev_create(), but only for mapping
  rbd_dev-s.  Moving image related stuff out of struct rbd_device into
  another struct which isn't tied with sysfs and device core is long
  overdue, but until that happens, this will keep rbd module refcount
  (which users can observe with lsmod) sane.

Fixes: http://tracker.ceph.com/issues/12697

Cc: Alex Elder <el...@linaro.org>
Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 89 -
 1 file changed, 47 insertions(+), 42 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index df795deffe77..4917de726bdb 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -418,8 +418,6 @@ MODULE_PARM_DESC(single_major, "Use a single major number 
for all rbd devices (d
 
 static int rbd_img_request_submit(struct rbd_img_request *img_request);
 
-static void rbd_dev_device_release(struct device *dev);
-
 static ssize_t rbd_add(struct bus_type *bus, const char *buf,
   size_t count);
 static ssize_t rbd_remove(struct bus_type *bus, const char *buf,
@@ -4038,6 +4036,25 @@ static void rbd_spec_free(struct kref *kref)
kfree(spec);
 }
 
+static void rbd_dev_release(struct device *dev)
+{
+   struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
+   bool need_put = !!rbd_dev->opts;
+
+   rbd_put_client(rbd_dev->rbd_client);
+   rbd_spec_put(rbd_dev->spec);
+   kfree(rbd_dev->opts);
+   kfree(rbd_dev);
+
+   /*
+* This is racy, but way better than putting module outside of
+* the release callback.  The race window is pretty small, so
+* doing something similar to dm (dm-builtin.c) is overkill.
+*/
+   if (need_put)
+   module_put(THIS_MODULE);
+}
+
 static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
 struct rbd_spec *spec,
 struct rbd_options *opts)
@@ -4054,6 +4071,12 @@ static struct rbd_device *rbd_dev_create(struct 
rbd_client *rbdc,
INIT_LIST_HEAD(_dev->node);
init_rwsem(_dev->header_rwsem);
 
+   rbd_dev->dev.bus = _bus_type;
+   rbd_dev->dev.type = _device_type;
+   rbd_dev->dev.parent = _root_dev;
+   rbd_dev->dev.release = rbd_dev_release;
+   device_initialize(_dev->dev);
+
rbd_dev->rbd_client = rbdc;
rbd_dev->spec = spec;
rbd_dev->opts = opts;
@@ -4065,15 +4088,21 @@ static struct rbd_device *rbd_dev_create(struct 
rbd_client *rbdc,
rbd_dev->layout.fl_object_size = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER);
rbd_dev->layout.fl_pg_pool = cpu_to_le32((u32) spec->pool_id);
 
+   /*
+* If this is a mapping rbd_dev (as opposed to a parent one),
+* pin our module.  We have a ref from do_rbd_add(), so use
+* __module_get().
+*/
+   if (rbd_dev->opts)
+   __module_get(THIS_MODULE);
+
return rbd_dev;
 }
 
 static void rbd_dev_destroy(struct rbd_device *rbd_dev)
 {
-   rbd_put_client(rbd_dev->rbd_client);
-   rbd_spec_put(rbd_dev->spec);
-   kfree(rbd_dev->opts);
-   kfree(rbd_dev);
+   if (rbd_dev)
+   put_device(_dev->dev);
 }
 
 /*
@@ -4699,27 +4728,6 @@ static int rbd_dev_header_info(struct rbd_device 
*rbd_dev)
return rbd_dev_v2_header_info(rbd_dev);
 }
 
-static int rbd_bus_add_dev(struct rbd_device *rbd_dev)
-{
-   struct device *dev;
-   int ret;
-
-   dev = _dev->dev;
-   dev->bus = _bus_type;
-   dev->type = _device_type;
-   dev->parent = _root_dev;
-   dev->release = rbd_dev_device_release;
-   dev_set_name(dev, "%d", rbd_dev->dev_id);
-   ret = device_register(dev);
-
-   return ret;
-}
-
-static void rbd_bus_del_dev(struct rbd_device *rbd_dev)
-{
-   device_unregister(_dev->dev);
-}
-
 /*
  * Get a unique rbd identifier for the given new rbd_dev, and add
  * the rbd_dev to the global lis

Re: [PATCH] Net: ceph: osd_client: Remove con argument in handle_reply

2015-10-18 Thread Ilya Dryomov

On Sun, Oct 18, 2015 at 10:25 AM, Shraddha Barke
 wrote:
> Since the function handle_reply does not use it's con argument,
> remove it.
>
> Signed-off-by: Shraddha Barke 
> ---
>  net/ceph/osd_client.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 80b94e3..e281f60 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -1745,8 +1745,7 @@ static void complete_request(struct ceph_osd_request 
> *req)
>   * handle osd op reply.  either call the callback if it is specified,
>   * or do the completion to wake up the waiting thread.
>   */
> -static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg,
> -struct ceph_connection *con)
> +static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg)
>  {
> void *p, *end;
> struct ceph_osd_request *req;
> @@ -2802,7 +2801,7 @@ static void dispatch(struct ceph_connection *con, 
> struct ceph_msg *msg)
> ceph_osdc_handle_map(osdc, msg);
> break;
> case CEPH_MSG_OSD_OPREPLY:
> -   handle_reply(osdc, msg, con);
> +   handle_reply(osdc, msg);
> break;
> case CEPH_MSG_WATCH_NOTIFY:
> handle_watch_notify(osdc, msg);

Applied, with a slightly tweaked change log - see
https://github.com/ceph/ceph-client/commit/81c842af2b39009ec1f8272a33b149ad0e845aaf.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()

2015-10-18 Thread Ilya Dryomov

On Sun, Oct 18, 2015 at 12:00 PM, Shraddha Barke
 wrote:
> Use local variable cursor in place of >cursor in
>  read_partial_msg_data()
>
> Signed-off-by: Shraddha Barke 
> ---
>  net/ceph/messenger.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index b9b0e3b..b3d1973 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2231,7 +2231,7 @@ static int read_partial_message_section(struct 
> ceph_connection *con,
>  static int read_partial_msg_data(struct ceph_connection *con)
>  {
> struct ceph_msg *msg = con->in_msg;
> -   struct ceph_msg_data_cursor *cursor = >cursor;
> +   struct ceph_msg_data_cursor *cursor = cursor;
  
This is not going to work...

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-10-16 Thread Ilya Dryomov

On Thu, Oct 15, 2015 at 11:10 PM, Alex Elder <el...@ieee.org> wrote:
> On 10/11/2015 01:03 PM, Ilya Dryomov wrote:
>> Currently we leak parent_spec and trigger a "parent reference
>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
>> The problem is we take the !parent out_err branch and that only drops
>> refcounts; parent_spec that would've been freed had we called
>> rbd_dev_unparent() remains and triggers rbd_warn() in
>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and
>> parent_ref == 0, so counter ends up being -1 after the decrement.
>
> OK, now that I understand the context...
>
> You now get extra references for the spec and client
> for the parent only after creating the parent device.
> I think the reason they logically belonged before
> the call to rbd_device_create() for the parent is
> because the client and spec pointers passed to that
> function carry with them references that are passed
> to the resulting rbd_device if successful.

Let me stress that those two get()s that I moved is not the point of
the patch at all.  Whether we do it before or after rbd_dev_create() is
pretty much irrelevant, the only reason I moved those calls is to make
the error path simpler.

>
> If creating the parent device fails, you unparent the
> original device, which will still have a null parent
> pointer.  The effect of unparenting in this case is
> dropping a reference to the parent's spec, and clearing
> the device's pointer to it.  This is confusing, but
> let's run with it.
>
> If creating the parent device succeeds, references to
> the client and parent spec are taken (basically, these
> belong to the just-created parent device).  The parent
> image is now probed.  If this fails, you again
> unparent the device.  We still have not set the
> device's parent pointer, so the effect is as before,
> dropping the parent spec reference and clearing
> the device's reference to it.  The error handling
> now destroys the parent, which drops references to
> its client and the spec.  Again, this seems
> confusing, as if we've dropped one more reference
> to the parent spec than desired.
>
> This logic now seems to work.  But it's working
> around a flaw in the caller I think.  Upon entry
> to rbd_dev_probe_parent(), a layered device will
> have have a non-null parent_spec pointer (and a
> reference to it), which will have been filled in
> by rbd_dev_v2_parent_info().
>
> Really, it should not be rbd_dev_probe_parent()
> that drops that parent spec reference on error.
> Instead, rbd_dev_image_probe() (which got the
> reference to the parent spec) should handle
> cleaning up the device's parent spec if an
> error occurs after it has been assigned.
>
> I'll wait for your response, I'd like to know
> if what I'm saying makes sense.

All of the above makes sense.  I agree that it is asymmetrical and that
it would have been better to have rbd_dev_image_probe() drop the last
ref on ->parent_spec.  But, that's not what all of the existing code is
doing.

Consider rbd_dev_probe_parent() without my patch.  There are two
out_err jumps.  As it is, if rbd_dev_create() fails, we only revert
those two get()s and return with alive ->parent_spec.  However, if
rbd_dev_image_probe() fails, we call rbd_dev_unparent(), which will put
->parent_spec.  Really, the issue is rbd_dev_unparent(), which not only
removes the parent rbd_dev, but also the parent spec.  All I did with
this patch was align both out_err cases to do the same, namely undo the
effects of rbd_dev_v2_parent_info() - I didn't want to create yet
another error path.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mark rbd requiring stable pages

2015-10-16 Thread Ilya Dryomov

On Thu, Oct 15, 2015 at 8:50 PM, Ronny Hegewald
 wrote:
> rbd requires stable pages, as it performs a crc of the page data before they
> are send to the OSDs.
>
> But since kernel 3.9 (patch 1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0 "mm: only
> enforce stable page writes if the backing device requires it") it is not
> assumed anymore that block devices require stable pages.
>
> This patch sets the necessary flag to get stable pages back for rbd.
>
> In a ceph installation that provides multiple ext4 formatted rbd devices "bad
> crc" messages appeared regularly (ca 1 message every 1-2 minutes on every OSD
> that provided the data for the rbd) in the OSD-logs before this patch. After
> this patch this messages are pretty much gone (only ca 1-2 / month / OSD).
>
> This patch seems also to fix data and filesystem corruptions on ext4 formatted
> rbd devices that were previously seen on pretty much a daily basis. But it is
> unknown at the moment why this is the case.
>
> Signed-off-by: Ronny Hegewald 
>
> ---
>
> That the mentioned corruption issue is really affected through this patch
> i could verify through the system logs. Since installation of this patch i
> have seen only a 2-3 filesystem corruptions. But these could be also just
> leftovers of corruptions that happened before the installation but got noticed
> from ext4 only later after the patched kernel was installed. This seems even
> more likely as i have seen not a single data corruption issue since the patch.
>
> --- linux/drivers/block/rbd.c.org   2015-10-07 01:32:55.90667 +
> +++ linux/drivers/block/rbd.c   2015-09-04 02:21:22.34999 +
> @@ -3786,6 +3786,7 @@
>
> blk_queue_merge_bvec(q, rbd_merge_bvec);
> disk->queue = q;
> +   disk->queue->backing_dev_info.capabilities |= BDI_CAP_STABLE_WRITES;
>
> q->queuedata = rbd_dev;

Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
so enabling this unconditionally is probably too harsh.  OTOH we are
talking to the network, which means all sorts of delays, retransmission
issues, etc, so I wonder how exactly "unstable" pages behave when, say,
added to an skb - you can't write anything to a page until networking
is fully done with it and expect it to work.  It's particularly
alarming that you've seen corruptions.

Currently the only users of this flag are block integrity stuff and
md-raid5, which makes me wonder what iscsi, nfs and others do in this
area.  There's an old ticket on this topic somewhere on the tracker, so
I'll need to research this.  Thanks for bringing this up!

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-10-16 Thread Ilya Dryomov

On Fri, Oct 16, 2015 at 1:50 PM, Alex Elder <el...@ieee.org> wrote:
> On 10/16/2015 04:50 AM, Ilya Dryomov wrote:
>> On Thu, Oct 15, 2015 at 11:10 PM, Alex Elder <el...@ieee.org> wrote:
>>> On 10/11/2015 01:03 PM, Ilya Dryomov wrote:
>>>> Currently we leak parent_spec and trigger a "parent reference
>>>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
>>>> The problem is we take the !parent out_err branch and that only drops
>>>> refcounts; parent_spec that would've been freed had we called
>>>> rbd_dev_unparent() remains and triggers rbd_warn() in
>>>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and
>>>> parent_ref == 0, so counter ends up being -1 after the decrement.
>>>
>>> OK, now that I understand the context...
>>>
>>> You now get extra references for the spec and client
>>> for the parent only after creating the parent device.
>>> I think the reason they logically belonged before
>>> the call to rbd_device_create() for the parent is
>>> because the client and spec pointers passed to that
>>> function carry with them references that are passed
>>> to the resulting rbd_device if successful.
>>
>> Let me stress that those two get()s that I moved is not the point of
>> the patch at all.  Whether we do it before or after rbd_dev_create() is
>> pretty much irrelevant, the only reason I moved those calls is to make
>> the error path simpler.
>
> Yes I understand that.
>
>>> If creating the parent device fails, you unparent the
>>> original device, which will still have a null parent
>>> pointer.  The effect of unparenting in this case is
>>> dropping a reference to the parent's spec, and clearing
>>> the device's pointer to it.  This is confusing, but
>>> let's run with it.
>>>
>>> If creating the parent device succeeds, references to
>>> the client and parent spec are taken (basically, these
>>> belong to the just-created parent device).  The parent
>>> image is now probed.  If this fails, you again
>>> unparent the device.  We still have not set the
>>> device's parent pointer, so the effect is as before,
>>> dropping the parent spec reference and clearing
>>> the device's reference to it.  The error handling
>>> now destroys the parent, which drops references to
>>> its client and the spec.  Again, this seems
>>> confusing, as if we've dropped one more reference
>>> to the parent spec than desired.
>>>
>>> This logic now seems to work.  But it's working
>>> around a flaw in the caller I think.  Upon entry
>>> to rbd_dev_probe_parent(), a layered device will
>>> have have a non-null parent_spec pointer (and a
>>> reference to it), which will have been filled in
>>> by rbd_dev_v2_parent_info().
>>>
>>> Really, it should not be rbd_dev_probe_parent()
>>> that drops that parent spec reference on error.
>>> Instead, rbd_dev_image_probe() (which got the
>>> reference to the parent spec) should handle
>>> cleaning up the device's parent spec if an
>>> error occurs after it has been assigned.
>>>
>>> I'll wait for your response, I'd like to know
>>> if what I'm saying makes sense.
>>
>> All of the above makes sense.  I agree that it is asymmetrical and that
>> it would have been better to have rbd_dev_image_probe() drop the last
>> ref on ->parent_spec.  But, that's not what all of the existing code is
>> doing.
>
> Agreed.
>
>> Consider rbd_dev_probe_parent() without my patch.  There are two
>> out_err jumps.  As it is, if rbd_dev_create() fails, we only revert
>> those two get()s and return with alive ->parent_spec.  However, if
>> rbd_dev_image_probe() fails, we call rbd_dev_unparent(), which will put
>> ->parent_spec.  Really, the issue is rbd_dev_unparent(), which not only
>> removes the parent rbd_dev, but also the parent spec.  All I did with
>> this patch was align both out_err cases to do the same, namely undo the
>> effects of rbd_dev_v2_parent_info() - I didn't want to create yet
>> another error path.
>
> OK.  Walking through the code I did concluded that what
> you did made things "line up" properly.  But I just felt
> like it wasn't necessarily the *right* fix, but it was
> a correct fix.
>
> The only point in suggesting what I think would be a better
> fix is that it would make things easier to reason about
> and get correct, and in so doing, make it easier to
> maintain and adapt in the future.

Well, I'm fixing another rbd_dev refcount/use-after-free problem right
now, so I think the only thing that will really make it easier to
maintain this is refactor into just a couple of symmetrical functions.
Making a single change (e.g. patching rbd_dev_unparent() to only touch
->parent and not ->parent_spec) is going to be invasive - there are
just way too many different error paths.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: set max_sectors explicitly

2015-10-16 Thread Ilya Dryomov

On Fri, Oct 16, 2015 at 2:22 PM, Alex Elder <el...@ieee.org> wrote:
> On 10/07/2015 12:00 PM, Ilya Dryomov wrote:
>> Commit 30e2bc08b2bb ("Revert "block: remove artifical max_hw_sectors
>> cap"") restored a clamp on max_sectors.  It's now 2560 sectors instead
>> of 1024, but it's not good enough: we set max_hw_sectors to rbd object
>> size because we don't want object sized I/Os to be split, and the
>> default object size is 4M.
>>
>> So, set max_sectors to max_hw_sectors in rbd at queue init time.
>>
>> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
>> ---
>>  drivers/block/rbd.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index 05072464d25e..04e69b4df664 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -3760,6 +3760,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
>>   /* set io sizes to object size */
>>   segment_size = rbd_obj_bytes(_dev->header);
>>   blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
>> + q->limits.max_sectors = queue_max_hw_sectors(q);
>
> This currently is done by default by blk_queue_max_hw_sectors().
> Do you see any different behavior with this patch in place?
>
> This change seems at least harmless so if it's not too late:

Not quite.  The commit mentioned above reintroduced this bit into
blk_limits_max_hw_sectors() in 4.3-rc1:

241 limits->max_sectors = min_t(unsigned int, max_hw_sectors,
242 BLK_DEF_MAX_SECTORS);

So this patch fixes a performance regression.  If you spin up 4.3-rc1+
and map an rbd image without changing any settings, the best you'll see
is 1280k I/Os despite a 4096k (typically) max_hw_sectors.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ceph/osd_client: add support for CEPH_OSD_OP_GETXATTR

2015-10-15 Thread Ilya Dryomov

On Thu, Oct 15, 2015 at 12:51 AM, David Disseldorp <dd...@suse.de> wrote:
> On Wed, 14 Oct 2015 19:57:46 +0200, Ilya Dryomov wrote:
>
>> On Wed, Oct 14, 2015 at 7:37 PM, David Disseldorp <dd...@suse.de> wrote:
> ...
>> > Ping, any feedback on the patch?
>>
>> The patch itself looks OK, except for the part where you rename a local
>> variable for no reason, AFACT.
>
> Thanks for the feedback Ilya. I presume you're referring to the pagelist
> -> req_pagelist rename - I'll drop that change and send an update.

Also, the prefix for net/ceph patches is libceph.

>
>> We've discussed some of this last week.  As it is, all rbd image
>> properties are stored in omap, so PR info strings stored in xattrs is
>> something different, but it should be fine for now.  I'd rather not
>> merge any kernel patches related to rbd-target work until we see
>> a complete patchset though.  There's been a lot of back and forth
>> between Mike, Christoph and target people and the general approach had
>> changed at least twice, so I'd like to wait for things to settle down.
>
> I understand the desire to wait on any RBD-target changes until upstream
> consensus has been reached, but I'd argue that this change is isolated,
> and complementary to the existing SETXATTR / CMPXATTR support.

That's true, but we already have a bunch of unused code in net/ceph,
which was added using precisely this reasoning, so I'm a bit hesitant.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-10-15 Thread Ilya Dryomov

On Thu, Oct 15, 2015 at 7:10 PM, Alex Elder <el...@ieee.org> wrote:
> On 10/11/2015 01:03 PM, Ilya Dryomov wrote:
>> Currently we leak parent_spec and trigger a "parent reference
>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
>> The problem is we take the !parent out_err branch and that only drops
>> refcounts; parent_spec that would've been freed had we called
>> rbd_dev_unparent() remains and triggers rbd_warn() in
>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and
>> parent_ref == 0, so counter ends up being -1 after the decrement.
>>
>> Redo rbd_dev_probe_parent() to fix this.
>
> I'm just starting to look at this.
>
> My up-front problem is that I want to understand why
> it's OK to no longer grab references to the parent spec
> and the client before creating the new parent device.
> Is it simply because the rbd_dev that's the subject
> of this call is already holding a reference to both,
> so no need to get another?  I think that's right, but
> you should say that in the explanation.  It's a
> change that's independent of the other change (which
> you describe).

I still grab references, I just moved that code after rbd_dev_create().
Fundamentally rbd_dev_create() is just a memory allocation, so whether
we acquire references before or after doesn't matter, except that
acquiring them before complicates the error path.

>
> OK, onto the change you describe.
>
> You say that we get the underflow if rbd_dev_create() fails.
>
> At that point in the function, we have:
> parent == NULL
> rbd_dev->parent_spec != NULL
> parent_spec holds a new reference to that
> rbdc holds a new reference to the client
>
> rbd_dev_create() only returns NULL if the kzalloc() fails.
> And if so, we jump to out_err, take the !parent branch,
> and we drop the references we took in rbdc and parent_spec
> before returning.
>
> Where is the leak?  (Actually, underflow means we're
> dropping more references than we took.)

We enter rbd_dev_probe_parent().  If ->parent_spec != NULL (and
therefore has a refcount of 1), we bump refcounts and proceed to
allocating parent rbd_dev.  If rbd_dev_create() fails, we drop
refcounts and exit rbd_dev_probe_parent() with ->parent_spec != NULL
and a refcount of 1 on that spec.  We take err_out_probe in
rbd_dev_image_probe() and through rbd_dev_unprobe() end up in
rbd_dev_parent_put().  Now, ->parent_spec != NULL but ->parent_ref == 0
because we never got to atomic_set(_dev->parent_ref, 1) in
rbd_dev_probe_parent().  Local counter ends up -1 after the decrement
and so instead of calling rbd_dev_unparent() which would have freed
->parent_spec we issue an underflow warning and bail.  ->parent_spec is
leaked.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ceph/osd_client: add support for CEPH_OSD_OP_GETXATTR

2015-10-14 Thread Ilya Dryomov

On Wed, Oct 14, 2015 at 7:37 PM, David Disseldorp  wrote:
> On Fri,  9 Oct 2015 16:43:09 +0200, David Disseldorp wrote:
>
>> Allows for xattr retrieval. Response data buffer allocation is the
>> responsibility of the osd_req_op_xattr_init() caller.
>
> Ping, any feedback on the patch?

The patch itself looks OK, except for the part where you rename a local
variable for no reason, AFACT.

We've discussed some of this last week.  As it is, all rbd image
properties are stored in omap, so PR info strings stored in xattrs is
something different, but it should be fine for now.  I'd rather not
merge any kernel patches related to rbd-target work until we see
a complete patchset though.  There's been a lot of back and forth
between Mike, Christoph and target people and the general approach had
changed at least twice, so I'd like to wait for things to settle down.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel RBD Readahead

2015-10-13 Thread Ilya Dryomov

On Tue, Oct 13, 2015 at 11:33 AM, Olivier Bonvalet <ceph.l...@daevel.fr> wrote:
> Le mardi 25 août 2015 à 17:50 +0300, Ilya Dryomov a écrit :
>> > Ok. I might try and create a 4.1 kernel with the blk-mq queue
>> depth/IO size + readahead +max_segments fixes in as I'm think the
>> TCP_NODELAY bug will still be present in my old 3.14 kernel.
>>
>> I can build 4.2-rc8 + readahead patch on gitbuilders for you.
>
>
> Hi,
>
> are thoses patches available in 4.3 kernel ?

4.3 isn't released yet.

Everything but a readahead fix will be in 4.3.  AFAICT readahead patch
was picked up by Andrew for 4.4 [1] - the problem is not in rbd, so we
have no direct control over it.

[1] http://comments.gmane.org/gmane.linux.kernel.commits.mm/96124

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Reply: [PATCH] rbd: prevent kernel stack blow up on rbd map

2015-10-12 Thread Ilya Dryomov

On Mon, Oct 12, 2015 at 4:22 AM, Caoxudong  wrote:
> By the way, do you think it's necessary that we add the clone-chain-length 
> limit in user-space code too?

librbd is different in a lot of ways and there isn't a clean separation
between the client part (i.e. what is essentially reimplemented in the
kernel) and the rest (management and maintenance parts, etc).  It's
certainly not necessary, whether it's desirable - I'm not sure.  Also,
librbd limit, if we were to introduce one, would probably have to be
bigger.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-10-11 Thread Ilya Dryomov

Currently we leak parent_spec and trigger a "parent reference
underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
The problem is we take the !parent out_err branch and that only drops
refcounts; parent_spec that would've been freed had we called
rbd_dev_unparent() remains and triggers rbd_warn() in
rbd_dev_parent_put() - at that point we have parent_spec != NULL and
parent_ref == 0, so counter ends up being -1 after the decrement.

Redo rbd_dev_probe_parent() to fix this.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 36 
 1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index cd00e4653e49..ccbc3cbbf24e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5134,41 +5134,37 @@ out_err:
 static int rbd_dev_probe_parent(struct rbd_device *rbd_dev)
 {
struct rbd_device *parent = NULL;
-   struct rbd_spec *parent_spec;
-   struct rbd_client *rbdc;
int ret;
 
if (!rbd_dev->parent_spec)
return 0;
-   /*
-* We need to pass a reference to the client and the parent
-* spec when creating the parent rbd_dev.  Images related by
-* parent/child relationships always share both.
-*/
-   parent_spec = rbd_spec_get(rbd_dev->parent_spec);
-   rbdc = __rbd_get_client(rbd_dev->rbd_client);
 
-   ret = -ENOMEM;
-   parent = rbd_dev_create(rbdc, parent_spec, NULL);
-   if (!parent)
+   parent = rbd_dev_create(rbd_dev->rbd_client, rbd_dev->parent_spec,
+   NULL);
+   if (!parent) {
+   ret = -ENOMEM;
goto out_err;
+   }
+
+   /*
+* Images related by parent/child relationships always share
+* rbd_client and spec/parent_spec, so bump their refcounts.
+*/
+   __rbd_get_client(rbd_dev->rbd_client);
+   rbd_spec_get(rbd_dev->parent_spec);
 
ret = rbd_dev_image_probe(parent, false);
if (ret < 0)
goto out_err;
+
rbd_dev->parent = parent;
atomic_set(_dev->parent_ref, 1);
-
return 0;
+
 out_err:
-   if (parent) {
-   rbd_dev_unparent(rbd_dev);
+   rbd_dev_unparent(rbd_dev);
+   if (parent)
rbd_dev_destroy(parent);
-   } else {
-   rbd_put_client(rbdc);
-   rbd_spec_put(parent_spec);
-   }
-
return ret;
 }
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: prevent kernel stack blow up on rbd map

2015-10-11 Thread Ilya Dryomov

Mapping an image with a long parent chain (e.g. image foo, whose parent
is bar, whose parent is baz, etc) currently leads to a kernel stack
overflow, due to the following recursion in the reply path:

  rbd_osd_req_callback()
rbd_obj_request_complete()
  rbd_img_obj_callback()
rbd_img_parent_read_callback()
  rbd_obj_request_complete()
...

Limit the parent chain to 8 images, which is ~3K worth of stack.  It's
probably a good thing to do regardless - performance with more than
a few images long parent chain is likely to be pretty bad.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index ccbc3cbbf24e..07f666f4ca18 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -96,6 +96,8 @@ static int atomic_dec_return_safe(atomic_t *v)
 #define RBD_MINORS_PER_MAJOR   256
 #define RBD_SINGLE_MAJOR_PART_SHIFT4
 
+#define RBD_MAX_PARENT_CHAIN_LEN   8
+
 #define RBD_SNAP_DEV_NAME_PREFIX   "snap_"
 #define RBD_MAX_SNAP_NAME_LEN  \
(NAME_MAX - (sizeof (RBD_SNAP_DEV_NAME_PREFIX) - 1))
@@ -426,7 +428,7 @@ static ssize_t rbd_add_single_major(struct bus_type *bus, 
const char *buf,
size_t count);
 static ssize_t rbd_remove_single_major(struct bus_type *bus, const char *buf,
   size_t count);
-static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping);
+static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth);
 static void rbd_spec_put(struct rbd_spec *spec);
 
 static int rbd_dev_id_to_minor(int dev_id)
@@ -5131,7 +5133,12 @@ out_err:
return ret;
 }
 
-static int rbd_dev_probe_parent(struct rbd_device *rbd_dev)
+/*
+ * @depth is rbd_dev_image_probe() -> rbd_dev_probe_parent() ->
+ * rbd_dev_image_probe() recursion depth, which means it's also the
+ * length of the already discovered part of the parent chain.
+ */
+static int rbd_dev_probe_parent(struct rbd_device *rbd_dev, int depth)
 {
struct rbd_device *parent = NULL;
int ret;
@@ -5139,6 +5146,12 @@ static int rbd_dev_probe_parent(struct rbd_device 
*rbd_dev)
if (!rbd_dev->parent_spec)
return 0;
 
+   if (++depth > RBD_MAX_PARENT_CHAIN_LEN) {
+   pr_info("parent chain is too long (%d)\n", depth);
+   ret = -EINVAL;
+   goto out_err;
+   }
+
parent = rbd_dev_create(rbd_dev->rbd_client, rbd_dev->parent_spec,
NULL);
if (!parent) {
@@ -5153,7 +5166,7 @@ static int rbd_dev_probe_parent(struct rbd_device 
*rbd_dev)
__rbd_get_client(rbd_dev->rbd_client);
rbd_spec_get(rbd_dev->parent_spec);
 
-   ret = rbd_dev_image_probe(parent, false);
+   ret = rbd_dev_image_probe(parent, depth);
if (ret < 0)
goto out_err;
 
@@ -5282,7 +5295,7 @@ static void rbd_dev_image_release(struct rbd_device 
*rbd_dev)
  * parent), initiate a watch on its header object before using that
  * object to get detailed information about the rbd image.
  */
-static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping)
+static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth)
 {
int ret;
 
@@ -5300,7 +5313,7 @@ static int rbd_dev_image_probe(struct rbd_device 
*rbd_dev, bool mapping)
if (ret)
goto err_out_format;
 
-   if (mapping) {
+   if (!depth) {
ret = rbd_dev_header_watch_sync(rbd_dev);
if (ret) {
if (ret == -ENOENT)
@@ -5321,7 +5334,7 @@ static int rbd_dev_image_probe(struct rbd_device 
*rbd_dev, bool mapping)
 * Otherwise this is a parent image, identified by pool, image
 * and snap ids - need to fill in names for those ids.
 */
-   if (mapping)
+   if (!depth)
ret = rbd_spec_fill_snap_id(rbd_dev);
else
ret = rbd_spec_fill_names(rbd_dev);
@@ -5343,12 +5356,12 @@ static int rbd_dev_image_probe(struct rbd_device 
*rbd_dev, bool mapping)
 * Need to warn users if this image is the one being
 * mapped and has a parent.
 */
-   if (mapping && rbd_dev->parent_spec)
+   if (!depth && rbd_dev->parent_spec)
rbd_warn(rbd_dev,
 "WARNING: kernel layering is EXPERIMENTAL!");
}
 
-   ret = rbd_dev_probe_parent(rbd_dev);
+   ret = rbd_dev_probe_parent(rbd_dev, depth);
if (ret)
goto err_out_probe;
 
@@ -5359,7 +5372,7 @@ static int rbd_dev_image_probe(struct rbd_device 
*rbd_dev, bool mapping)
 err_out_probe:

[PATCH] rbd: set max_sectors explicitly

2015-10-07 Thread Ilya Dryomov

Commit 30e2bc08b2bb ("Revert "block: remove artifical max_hw_sectors
cap"") restored a clamp on max_sectors.  It's now 2560 sectors instead
of 1024, but it's not good enough: we set max_hw_sectors to rbd object
size because we don't want object sized I/Os to be split, and the
default object size is 4M.

So, set max_sectors to max_hw_sectors in rbd at queue init time.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 05072464d25e..04e69b4df664 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3760,6 +3760,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
/* set io sizes to object size */
segment_size = rbd_obj_bytes(_dev->header);
blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
+   q->limits.max_sectors = queue_max_hw_sectors(q);
blk_queue_max_segments(q, segment_size / SECTOR_SIZE);
blk_queue_max_segment_size(q, segment_size);
blk_queue_io_min(q, segment_size);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: use writefull op for object size writes

2015-10-07 Thread Ilya Dryomov

This covers only the simplest case - an object size sized write, but
it's still useful in tiering setups when EC is used for the base tier
as writefull op can be proxied, saving an object promotion.

Even though updating ceph_osdc_new_request() to allow writefull should
just be a matter of fixing an assert, I didn't do it because its only
user is cephfs.  All other sites were updated.

Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c   |  9 +++--
 net/ceph/osd_client.c | 13 +
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 04e69b4df664..cd00e4653e49 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1863,9 +1863,11 @@ static void rbd_osd_req_callback(struct ceph_osd_request 
*osd_req,
rbd_osd_read_callback(obj_request);
break;
case CEPH_OSD_OP_SETALLOCHINT:
-   rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE);
+   rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE ||
+  osd_req->r_ops[1].op == CEPH_OSD_OP_WRITEFULL);
/* fall through */
case CEPH_OSD_OP_WRITE:
+   case CEPH_OSD_OP_WRITEFULL:
rbd_osd_write_callback(obj_request);
break;
case CEPH_OSD_OP_STAT:
@@ -2401,7 +2403,10 @@ static void rbd_img_obj_request_fill(struct 
rbd_obj_request *obj_request,
opcode = CEPH_OSD_OP_ZERO;
}
} else if (op_type == OBJ_OP_WRITE) {
-   opcode = CEPH_OSD_OP_WRITE;
+   if (!offset && length == object_size)
+   opcode = CEPH_OSD_OP_WRITEFULL;
+   else
+   opcode = CEPH_OSD_OP_WRITE;
osd_req_op_alloc_hint_init(osd_request, num_ops,
object_size, object_size);
num_ops++;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 80b94e37c94a..f79ccac6699f 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -285,6 +285,7 @@ static void osd_req_op_data_release(struct ceph_osd_request 
*osd_req,
switch (op->op) {
case CEPH_OSD_OP_READ:
case CEPH_OSD_OP_WRITE:
+   case CEPH_OSD_OP_WRITEFULL:
ceph_osd_data_release(>extent.osd_data);
break;
case CEPH_OSD_OP_CALL:
@@ -485,13 +486,14 @@ void osd_req_op_extent_init(struct ceph_osd_request 
*osd_req,
size_t payload_len = 0;
 
BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE &&
-  opcode != CEPH_OSD_OP_ZERO && opcode != CEPH_OSD_OP_TRUNCATE);
+  opcode != CEPH_OSD_OP_WRITEFULL && opcode != CEPH_OSD_OP_ZERO &&
+  opcode != CEPH_OSD_OP_TRUNCATE);
 
op->extent.offset = offset;
op->extent.length = length;
op->extent.truncate_size = truncate_size;
op->extent.truncate_seq = truncate_seq;
-   if (opcode == CEPH_OSD_OP_WRITE)
+   if (opcode == CEPH_OSD_OP_WRITE || opcode == CEPH_OSD_OP_WRITEFULL)
payload_len += length;
 
op->payload_len = payload_len;
@@ -670,9 +672,11 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
break;
case CEPH_OSD_OP_READ:
case CEPH_OSD_OP_WRITE:
+   case CEPH_OSD_OP_WRITEFULL:
case CEPH_OSD_OP_ZERO:
case CEPH_OSD_OP_TRUNCATE:
-   if (src->op == CEPH_OSD_OP_WRITE)
+   if (src->op == CEPH_OSD_OP_WRITE ||
+   src->op == CEPH_OSD_OP_WRITEFULL)
request_data_len = src->extent.length;
dst->extent.offset = cpu_to_le64(src->extent.offset);
dst->extent.length = cpu_to_le64(src->extent.length);
@@ -681,7 +685,8 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
dst->extent.truncate_seq =
cpu_to_le32(src->extent.truncate_seq);
osd_data = >extent.osd_data;
-   if (src->op == CEPH_OSD_OP_WRITE)
+   if (src->op == CEPH_OSD_OP_WRITE ||
+   src->op == CEPH_OSD_OP_WRITEFULL)
ceph_osdc_msg_data_add(req->r_request, osd_data);
else
ceph_osdc_msg_data_add(req->r_reply, osd_data);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: use writefull op for object size writes

2015-10-07 Thread Ilya Dryomov

On Wed, Oct 7, 2015 at 5:36 PM, Alex Elder <el...@ieee.org> wrote:
> On 10/07/2015 12:02 PM, Ilya Dryomov wrote:
>> This covers only the simplest case - an object size sized write, but
>> it's still useful in tiering setups when EC is used for the base tier
>> as writefull op can be proxied, saving an object promotion.
>>
>> Even though updating ceph_osdc_new_request() to allow writefull should
>> just be a matter of fixing an assert, I didn't do it because its only
>> user is cephfs.  All other sites were updated.
>>
>> Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
>
> I haven't looked at this at all.  But can you give me a
> short explanation of what "writefull" is?
>
> Full object write?

Well, in a way.  It replaces previous data, you can think of it as an
atomic truncate to 0 + write from offset 0.  So it always writes
(replaces) entire objects.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ceph:Remove unused goto labels in decode crush map functions

2015-10-02 Thread Ilya Dryomov

On Fri, Oct 2, 2015 at 9:48 PM, Nicholas Krause  wrote:
> This removes unused goto labels in decode crush map functions related
> to error paths due to them never being used on any error path for these
> particular functions in the file, osdmap.c.
>
> Signed-off-by: Nicholas Krause 
> ---
>  net/ceph/osdmap.c | 10 --
>  1 file changed, 10 deletions(-)
>
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 7d8f581..2f8e41c 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -59,8 +59,6 @@ static int crush_decode_uniform_bucket(void **p, void *end,
> ceph_decode_need(p, end, (1+b->h.size) * sizeof(u32), bad);
^^^
> b->item_weight = ceph_decode_32(p);
> return 0;
> -bad:
   ^^^
> -   return -EINVAL;
>  }

I realize that these macros are sneaky, but you should at least
compile-test your patches before you send them out.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [CEPH-DEVEL] [ceph-users] occasional failure to unmap rbd

2015-09-26 Thread Ilya Dryomov

On Sat, Sep 26, 2015 at 5:54 AM, Shinobu Kinjo  wrote:
> I think it's more helpful to put returned value in:
>
> # ./src/krbd.cc
> 530   cerr << "rbd: sysfs write failed" << std::endl;
>
> like:
>
> 530   cerr << "rbd: sysfs write failed (" << r << ")" << std::endl;
>
> So that we exactly know what **write** complains about.
> Because **write** has some return values in case of error.
>
> What do you think?

It's already doing that:

rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

sysfs_write_rbd_remove() return value is propagated up and reported.
The code is written in such a way that return values are preserved and
never overwritten (hopefully!).

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to sepcify the point a rbd mapped?

2015-09-25 Thread Ilya Dryomov

On Fri, Sep 25, 2015 at 6:25 AM, Jaze Lee  wrote:
> Hello,
> I know we can map a rbd image to a block device into kernel by
> ‘rbd map’ command。
> But we can not specify which block device. For example, if i have
> a rbd named rbd_0,
> i want it mapped into /dev/rbd_0 when it mapped.
>
> Is there some way to do that?

No, it's going to pick the lowest available ID.  The ID is picked
atomically by the kernel driver, you have no control over it.

If you want stable names, you should use udev symlinks, which look like

/dev/rbd//[@]

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Ilya Dryomov

On Wed, Sep 23, 2015 at 4:08 PM, Jason Dillaman  wrote:
>> > In this case the commands look a little confusing to me, as from their
>> > names I would rather think they enable/disable mirror for existent
>> > images too. Also, I don't see a command to check what current
>> > behaviour is. And, I suppose it would be useful if we could configure
>> > other default features for a pool (exclusive-lock, object-map, ...)
>> > Also, I am not sure we should specify  this way, as it is
>> > not consistent with other rbd commands. By default rbd operates on
>> > 'rbd' pool, which can be changed by --pool option. So what do you
>> > think if we have something similar to 'rbd feature' commands?
>> >
>> >   rbd [--pool ] default-feature enable 
>> >   rbd [--pool ] default-feature disable 
>> >   rbd [--pool ] default-feature show []
>> >
>> > (If  is not specified in the last command, all features are
>> > shown).
>>
>> I haven't read the discussion in full, so feel free to ignore, but I'd
>> much rather have rbd create create images with the same set of features
>> enabled no matter which pool it's pointed to.
>>
>> I'm not clear on what exactly a pool policy is.  If it's just a set of
>> feature bits to enable at create time, I don't think it's worth
>> introducing at all.  If it's a set feature bits + a set of mirroring
>> related options, I think it should just be a set of those options.
>> Then, rbd create --enable-mirroring could be a way to create an image
>> with mirroring enabled.
>>
>> My point is a plain old "rbd create foo" shouldn't depend an any new
>> pool-level metadata.  It's not that hard to remember which features you
>> want for which pools and rbd create shortcuts like --enable-object-map
>> and --enable-mirroring would hide feature bit dependencies and save
>> typing.  --enable-mirroring would also serve as a ticket to go look at
>> new metadata and pull out any mirroring related defaults.
>>
>
> While I agree that I don't necessarily want to go down the road of specifying 
> per-pool default features, I think there is a definite use-case need to be 
> able to enable mirroring on a per-pool basis by default.  Without such an 
> ability, taking OpenStack for example, it would require changes to Glance, 
> Cinder, and Nova in order to support DR configurations -- or you could get it 
> automatically with a little pre-configuration.

So a pool policy is just a set of feature bits?

I think Cinder at least creates images with rbd_default_features from
ceph.conf and adds in layering if it's not set, meaning there is no
interface for passing through feature bits (or anything else really,
things like striping options, etc).  What pool-level default feature
bits infrastructure would do is replace a big (cluster-level) hammer
with a smaller (pool-level) hammer.  You'd have to add librbd APIs for
it and someone eventually will try to follow suit and add defaults for
other settings.  You said you weren't attempting to create a mechanism
to specify arbitrary default features for a given pool, but I think it
will come up in the future if we introduce this - it's only logical.

What we might want to do instead is use this mirroring milestone to add
support for a proper key-value interface for passing in features and
other settings for individual rbd images to OpenStack.  I assume it's
all python dicts with OpenStack, so it shouldn't be hard?  I know that
getting patches into OpenStack can be frustrating at times and I might
be underestimating the importance of the use case you have in mind, but
patching our OpenStack drivers rather than adding what essentially is
a workaround to librbd makes a lot more sense to me.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Ilya Dryomov

On Wed, Sep 23, 2015 at 9:28 PM, Jason Dillaman  wrote:
>> So a pool policy is just a set of feature bits?
>
> It would have to store additional details as well.
>
>> I think Cinder at least creates images with rbd_default_features from
>> ceph.conf and adds in layering if it's not set, meaning there is no
>> interface for passing through feature bits (or anything else really,
>> things like striping options, etc).  What pool-level default feature
>> bits infrastructure would do is replace a big (cluster-level) hammer
>> with a smaller (pool-level) hammer.  You'd have to add librbd APIs for
>> it and someone eventually will try to follow suit and add defaults for
>> other settings.  You said you weren't attempting to create a mechanism
>> to specify arbitrary default features for a given pool, but I think it
>> will come up in the future if we introduce this - it's only logical.
>>
>> What we might want to do instead is use this mirroring milestone to add
>> support for a proper key-value interface for passing in features and
>> other settings for individual rbd images to OpenStack.  I assume it's
>> all python dicts with OpenStack, so it shouldn't be hard?  I know that
>> getting patches into OpenStack can be frustrating at times and I might
>> be underestimating the importance of the use case you have in mind, but
>> patching our OpenStack drivers rather than adding what essentially is
>> a workaround to librbd makes a lot more sense to me.
>>
>
> It would be less work to skip adding the pool-level defaults which is a plus 
> given everything else required.  However, putting aside how long it would 
> take for the required changes to trickle down from OpenStack, Qemu, etc 
> (since I agree that shouldn't drive design), in some ways your proposal could 
> be seen as blurring the configuration encapsulation between clients and Ceph.
>
> Is the goal to configure my storage policies in one place or should I have to 
> update all my client configuration settings (not that big of a deal if you 
> are using something like Puppet to push down consistent configs across your 
> servers)? Trying to think like an end-user, I think I would prefer 
> configuring it once within the storage system itself.  I am not familiar with 
> any other storage systems that configure mirroring via OpenStack config 
> files, but I could be wrong since there are a lot of volume drivers now.

I'm not very familiar with OpenStack so I don't know either, I'm just
pointing out that, as far as at least cinder goes, we currently use
a cluster-wide default for something that is inherently a per-image
property, that there is no way to change it, and that there is a way to
configure only a small subset of settings.  I don't see it as blurring
the configuration encapsulation: if a user is creating an image from
OpenStack (or any other client for that matter), they should be able to
specify all the settings they want for a given image and not rely on
cluster-wide or pool-wide defaults.  (Maybe I'm too fixed on this idea
that per-image properties should be per-image and you are trying to
think bigger.  What I'm ranting about here is status quo, mirroring and
the new use cases and configuration challanges it brings along are
somewhat off to the side.)

I'm not against pool-level defaults per se, I just think if we go down
this road it's going to be hard to draw a line in the future, and I want
to make sure we are not adding it just to work around deficiencies in
our OpenStack drivers (and possibly librbd create-like APIs).

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD mirroring CLI proposal ...

2015-09-23 Thread Ilya Dryomov

On Wed, Sep 23, 2015 at 9:33 AM, Mykola Golub  wrote:
> On Tue, Sep 22, 2015 at 01:32:49PM -0400, Jason Dillaman wrote:
>
>> > > * rbd mirror pool enable 
>> > > This will, by default, ensure that all images created in this
>> > > pool have exclusive lock, journaling, and mirroring feature bits
>> > > enabled.
>> > >
>> > > * rbd mirror pool disable 
>> > > This will clear the default image features for new images in this
>> > > pool.
>> >
>> > Will 'rbd mirror pool enable|disable' change behaviour only for newly
>> > created images in the pool or will enable|disable mirroring for
>> > existent images too?
>>
>> Since the goal is to set default pool behavior, it would only apply
>> to newly created images.  You can enable/disable on specific images
>> using the 'rbd mirror image enable/disable' commands.
>
> In this case the commands look a little confusing to me, as from their
> names I would rather think they enable/disable mirror for existent
> images too. Also, I don't see a command to check what current
> behaviour is. And, I suppose it would be useful if we could configure
> other default features for a pool (exclusive-lock, object-map, ...)
> Also, I am not sure we should specify  this way, as it is
> not consistent with other rbd commands. By default rbd operates on
> 'rbd' pool, which can be changed by --pool option. So what do you
> think if we have something similar to 'rbd feature' commands?
>
>   rbd [--pool ] default-feature enable 
>   rbd [--pool ] default-feature disable 
>   rbd [--pool ] default-feature show []
>
> (If  is not specified in the last command, all features are
> shown).

I haven't read the discussion in full, so feel free to ignore, but I'd
much rather have rbd create create images with the same set of features
enabled no matter which pool it's pointed to.

I'm not clear on what exactly a pool policy is.  If it's just a set of
feature bits to enable at create time, I don't think it's worth
introducing at all.  If it's a set feature bits + a set of mirroring
related options, I think it should just be a set of those options.
Then, rbd create --enable-mirroring could be a way to create an image
with mirroring enabled.

My point is a plain old "rbd create foo" shouldn't depend an any new
pool-level metadata.  It's not that hard to remember which features you
want for which pools and rbd create shortcuts like --enable-object-map
and --enable-mirroring would hide feature bit dependencies and save
typing.  --enable-mirroring would also serve as a ticket to go look at
new metadata and pull out any mirroring related defaults.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: partprobe or partx or ... ?

2015-09-21 Thread Ilya Dryomov

On Sat, Sep 19, 2015 at 11:08 PM, Loic Dachary  wrote:
>
>
> On 19/09/2015 17:23, Loic Dachary wrote:
>> Hi Ilya,
>>
>> At present ceph-disk uses partprobe to ensure the kernel is aware of the 
>> latest partition changes after a new one is created, or after zapping the 
>> partition table. Although it works reliably (in the sense that the kernel is 
>> indeed aware of the desired partition layout), it goes as far as to remove 
>> all partition devices of the current kernel table, only to re-add them with 
>> the new partition table. The delay it implies is not an issue because 
>> ceph-disk is rarely called. It however generate many udev events (dozens 
>> remove/change/add for a two partition disk) and almost always creates border 
>> cases that are difficult to figure out and debug. While it is a good way to 
>> ensure that ceph-disk is idempotent and immune to race conditions, maybe it 
>> is needlessly hard.
>>
>> Do you know of a light weight alternative to partprobe ? In the past we've 
>> used partx but I remember it failed to address some border cases in 
>> non-intuitive ways. Do you know of another, simpler, approach to this ?
>>
>> Thanks in advance for your help :-)
>>
>
> For the record using /sys/block/sdX/device/rescan sounds good but does not 
> exist for devices created via devicemapper (used for dmcrypt and multipath).

Hi Loic,

Yeah, partprobe loops through the entire partition table, trying do
delete/add every slot.  As an aside, the in-kernel way to do this
(blockdev --rereadpt) is similar in that it also drops all partitions
and re-adds them later, but it's faster and probably generates less
change events.  The downside is it won't work on busy device.

I don't think there is any alternative, except for using partx --add
with --nr, that is targeting specific slots in the partition table.  If
all you are doing is adding partitions and zapping entire partition
tables, that may work well enough.

That said, given that the resulting delay (which can be in the seconds
range, especially if your disk happens to have a busy partition) isn't
a problem, what difference does it make?  What are you listening to
those events for?

/sys/block/sdX/device/rescan is sd only, and AFAIK it doesn't generally
trigger a re-read of a partition table.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Ceph-community] Getting WARN in __kick_osd_requests doing stress testing

2015-09-18 Thread Ilya Dryomov

On Fri, Sep 18, 2015 at 9:48 AM, Abhishek L
 wrote:
> Redirecting to ceph-devel, where such a question might have a better
> chance of a reply.
>
> On Fri, Sep 18, 2015 at 4:03 AM,   wrote:
>> I'm running in a 3-node cluster and doing osd/rbd creation and deletion, and
>> ran across this WARN
>> Note, it only happened once (on one rbd add) after approximately 500 cycles
>> of the test, but was wondering if
>> someone can explain to me why this warning would be happening, and how I can
>> prevent it.
>>
>> Here is what my test script is doing:
>>
>> while(1):
>> create 5 ceph pools   - sleep 2 between each pool create
>> sleep 5
>> create 5 ceph volumes - sleep 2 between each pool create
>> sleep 5
>> delete 5 ceph volumes - sleep 2 between each pool create
>> sleep 5
>> delete 5 ceph pools   - sleep 2 between each pool create
>> sleep 5
>>
>>
>> 333940 Sep 17 00:31:54 10.0.41.9 [18372.272771] Call Trace:
>> 333941 Sep 17 00:31:54 10.0.41.9 [18372.273489]  []
>> dump_stack+0x45/0x57
>> 333942 Sep 17 00:31:54 10.0.41.9 [18372.274226]  []
>> warn_slowpath_common+0x97/0xe0
>> 333943 Sep 17 00:31:54 10.0.41.9 [18372.274923]  []
>> warn_slowpath_null+0x1a/0x20
>> 333944 Sep 17 00:31:54 10.0.41.9 [18372.275635]  []
>> __kick_osd_requests+0x1dc/0x240 [libceph]
>> 333945 Sep 17 00:31:54 10.0.41.9 [18372.276305]  []
>> osd_reset+0x57/0xa0 [libceph]
>> 333946 Sep 17 00:31:54 10.0.41.9 [18372.276962]  []
>> con_work+0x112/0x290 [libceph]
>> 333947 Sep 17 00:31:54 10.0.41.9 [18372.277608]  []
>> process_one_work+0x144/0x470
>> 333948 Sep 17 00:31:54 10.0.41.9 [18372.278247]  []
>> worker_thread+0x11e/0x450
>> 333949 Sep 17 00:31:54 10.0.41.9 [18372.278880]  [] ?
>> create_worker+0x1f0/0x1f0
>> 333950 Sep 17 00:31:54 10.0.41.9 [18372.279543]  []
>> kthread+0xc9/0xe0
>> 333951 Sep 17 00:31:54 10.0.41.9 [18372.280174]  [] ?
>> flush_kthread_worker+0x90/0x90
>> 333952 Sep 17 00:31:54 10.0.41.9 [18372.280803]  []
>> ret_from_fork+0x58/0x90
>> 333953 Sep 17 00:31:54 10.0.41.9 [18372.281430]  [] ?
>> flush_kthread_worker+0x90/0x90
>>
>> static void __kick_osd_requests(struct ceph_osd_client *osdc,
>> struct ceph_osd *osd)
>> {
>>  :
>> list_for_each_entry_safe(req, nreq, >o_linger_requests,
>>  r_linger_osd_item) {
>> WARN_ON(!list_empty(>r_req_lru_item));
>> __kick_linger_request(req);
>> }
>> :
>> }

What is your kernel version?

There is no mention of rbd map/unmap in the pseudo code you provided.
How are you mapping/unmapping those rbd images?  More details or the
script itself would be nice to see.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: advertise support for keepalive2

2015-09-16 Thread Ilya Dryomov

On Wed, Sep 16, 2015 at 9:28 AM, Yan, Zheng <uker...@gmail.com> wrote:
> On Mon, Sep 14, 2015 at 9:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
>> We are the client, but advertise keepalive2 anyway - for consistency,
>> if nothing else.  In the future the server might want to know whether
>> its clients support keepalive2.
>
> the kernel code still does fully support KEEPALIVE2 (it does not
> recognize CEPH_MSGR_TAG_KEEPALIVE2 tag). I think it's better to not
> advertise support for keepalive2.

I guess by "does not recognize" you mean the kernel client knows how to
write TAG_KEEPALIVE2, but not how to read it?  The same is true about
TAG_KEEPALIVE tag and the reverse (we can read, but can't write) is
true about TAG_KEEPALIVE2_ACK.

What I'm getting at is the kernel client is the client, and so it
doesn't need to know how to read keepalive bytes or write keepalive
acks.  The server however might want to know if its clients can send
keepalive2 bytes and handle keepalive2 acks.  Does this make sense?

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libceph: advertise support for keepalive2

2015-09-14 Thread Ilya Dryomov

We are the client, but advertise keepalive2 anyway - for consistency,
if nothing else.  In the future the server might want to know whether
its clients support keepalive2.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 include/linux/ceph/ceph_features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/ceph/ceph_features.h 
b/include/linux/ceph/ceph_features.h
index 4763ad64e832..f89b31d45cc8 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -107,6 +107,7 @@ static inline u64 ceph_sanitize_features(u64 features)
 CEPH_FEATURE_OSDMAP_ENC |  \
 CEPH_FEATURE_CRUSH_TUNABLES3 | \
 CEPH_FEATURE_OSD_PRIMARY_AFFINITY |\
+CEPH_FEATURE_MSGR_KEEPALIVE2 | \
 CEPH_FEATURE_CRUSH_V4)
 
 #define CEPH_FEATURES_REQUIRED_DEFAULT   \
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libceph: don't access invalid memory in keepalive2 path

2015-09-14 Thread Ilya Dryomov

This

struct ceph_timespec ceph_ts;
...
con_out_kvec_add(con, sizeof(ceph_ts), _ts);

wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet
ceph_ts becomes invalid on return from prepare_write_keepalive().  As
a result, we send out bogus keepalive2 stamps.  Fix this by encoding
into a ceph_timespec member, similar to how acks are read and written.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 include/linux/ceph/messenger.h | 4 +++-
 net/ceph/messenger.c   | 9 +
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 7e1252e97a30..b2371d9b51fa 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -238,6 +238,8 @@ struct ceph_connection {
bool out_kvec_is_msg; /* kvec refers to out_msg */
int out_more;/* there is more data after the kvecs */
__le64 out_temp_ack; /* for writing an ack */
+   struct ceph_timespec out_temp_keepalive2; /* for writing keepalive2
+stamp */
 
/* message in temps */
struct ceph_msg_header in_hdr;
@@ -248,7 +250,7 @@ struct ceph_connection {
int in_base_pos; /* bytes read */
__le64 in_temp_ack;  /* for reading an ack */
 
-   struct timespec last_keepalive_ack;
+   struct timespec last_keepalive_ack; /* keepalive2 ack stamp */
 
struct delayed_work work;   /* send|recv work */
unsigned long   delay;  /* current delay interval */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 525f454f7531..b9b0e3b5da49 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1353,11 +1353,12 @@ static void prepare_write_keepalive(struct 
ceph_connection *con)
dout("prepare_write_keepalive %p\n", con);
con_out_kvec_reset(con);
if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) {
-   struct timespec ts = CURRENT_TIME;
-   struct ceph_timespec ceph_ts;
-   ceph_encode_timespec(_ts, );
+   struct timespec now = CURRENT_TIME;
+
con_out_kvec_add(con, sizeof(tag_keepalive2), _keepalive2);
-   con_out_kvec_add(con, sizeof(ceph_ts), _ts);
+   ceph_encode_timespec(>out_temp_keepalive2, );
+   con_out_kvec_add(con, sizeof(con->out_temp_keepalive2),
+>out_temp_keepalive2);
} else {
con_out_kvec_add(con, sizeof(tag_keepalive), _keepalive);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 33/39] rbd: drop null test before destroy functions

2015-09-14 Thread Ilya Dryomov

On Sun, Sep 13, 2015 at 3:15 PM, Julia Lawall  wrote:
> Remove unneeded NULL test.
>
> The semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
>
> // 
> @@ expression x; @@
> -if (x != NULL) {
>   \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
>   x = NULL;
> -}
> // 
>
> Signed-off-by: Julia Lawall 
>
> ---
>  drivers/block/rbd.c |6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index d93a037..0507246 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -5645,10 +5645,8 @@ static int rbd_slab_init(void)
> if (rbd_segment_name_cache)
> return 0;
>  out_err:
> -   if (rbd_obj_request_cache) {
> -   kmem_cache_destroy(rbd_obj_request_cache);
> -   rbd_obj_request_cache = NULL;
> -   }
> +   kmem_cache_destroy(rbd_obj_request_cache);
> +   rbd_obj_request_cache = NULL;
>
> kmem_cache_destroy(rbd_img_request_cache);
> rbd_img_request_cache = NULL;

Applied.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

size_t and related types on mn10300

2015-09-13 Thread Ilya Dryomov

On Thu, Sep 10, 2015 at 10:57 AM, kbuild test robot
 wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
> master
> head:   22dc312d56ba077db27a9798b340e7d161f1df05
> commit: 5f1c79a71766ba656762636936edf708089bdb14 [12335/12685] libceph: check 
> data_len in ->alloc_msg()
> config: mn10300-allmodconfig (attached as .config)
> reproduce:
>   wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
>   chmod +x ~/bin/make.cross
>   git checkout 5f1c79a71766ba656762636936edf708089bdb14
>   # save the attached .config to linux build tree
>   make.cross ARCH=mn10300
>
> All warnings (new ones prefixed by >>):
>
>net/ceph/osd_client.c: In function 'get_reply':
>>> net/ceph/osd_client.c:2865:3: warning: format '%zu' expects argument of 
>>> type 'size_t', but argument 6 has type 'unsigned int' [-Wformat=]
>   pr_warn("%s osd%d tid %llu data %d > preallocated %zu, skipping\n",
>   ^
>
> vim +2865 net/ceph/osd_client.c
>
>   2849   req->r_reply, req->r_reply->con);
>   2850  ceph_msg_revoke_incoming(req->r_reply);
>   2851
>   2852  if (front_len > req->r_reply->front_alloc_len) {
>   2853  pr_warn("%s osd%d tid %llu front %d > preallocated 
> %d\n",
>   2854  __func__, osd->o_osd, req->r_tid, front_len,
>   2855  req->r_reply->front_alloc_len);
>   2856  m = ceph_msg_new(CEPH_MSG_OSD_OPREPLY, front_len, 
> GFP_NOFS,
>   2857   false);
>   2858  if (!m)
>   2859  goto out;
>   2860  ceph_msg_put(req->r_reply);
>   2861  req->r_reply = m;
>   2862  }
>   2863
>   2864  if (data_len > req->r_reply->data_length) {
>> 2865  pr_warn("%s osd%d tid %llu data %d > preallocated %zu, 
>> skipping\n",
>   2866  __func__, osd->o_osd, req->r_tid, data_len,
>   2867  req->r_reply->data_length);
>   2868  m = NULL;
>   2869  *skip = 1;
>   2870  goto out;
>   2871  }
>   2872
>   2873  m = ceph_msg_get(req->r_reply);

req->r_reply->data_length is size_t, formatted with %zu, as it should
be.  I think this is a problem with either mn10300 itself or the build
toolchain - compiling mn10300 defconfig throws a whole bunch of similar
errors all over the place.

arch/mn10300/include/uapi/asm/posix_types.h
 30 #if __GNUC__ == 4
 31 typedef unsigned int__kernel_size_t;
 32 typedef signed int  __kernel_ssize_t;
 33 #else
 34 typedef unsigned long   __kernel_size_t;
 35 typedef signed long __kernel_ssize_t;
 36 #endif

This came from commit 3ad001c04f1a ("MN10300: Fix size_t and ssize_t")
by David.  Now, if that's right, it should probably be >= 4.  But then

$ echo | /opt/gcc-4.9.0-nolibc/am33_2.0-linux/bin/am33_2.0-linux-gcc
-dM -E - | grep __SIZE_TYPE__
#define __SIZE_TYPE__ long unsigned int

and sure enough

gcc/config/mn10300/linux.h
 87 #undef SIZE_TYPE
 88 #undef PTRDIFF_TYPE
 89 #undef WCHAR_TYPE
 90 #undef WCHAR_TYPE_SIZE

gcc/config/mn10300/mn10300.h
125 #undef  SIZE_TYPE
126 #define SIZE_TYPE "unsigned int"

so it looks like "linux" variant uses gcc defaults, at least since
gcc.git commit 05381348ac78 (which is dated a few months after David's
commit made it into the kernel).  Can someone who cares about mn10300
or at least knows what it is look at this?

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive

2015-09-02 Thread Ilya Dryomov

On Wed, Sep 2, 2015 at 5:22 AM, Yan, Zheng  wrote:
> timespec_to_jiffies() does not work this way. it convert time delta in form 
> of timespec to time delta in form of jiffies.

Ah sorry, con->last_keepalive_ack is a realtime timespec from
userspace.

>
> I will updated the patch according  to the rest comments .

I still want ceph_con_keepalive_expired() to handle interval == 0
internally, so that opt->monc_ping_timeout can be passed without any
checks in the caller.

I also noticed you added delay = max_t(unsigned long, delay, HZ); to
__schedule_delayed().  Is it really necessary?

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive

2015-09-02 Thread Ilya Dryomov

On Wed, Sep 2, 2015 at 12:25 PM, Yan, Zheng <z...@redhat.com> wrote:
>
>> On Sep 2, 2015, at 17:12, Ilya Dryomov <idryo...@gmail.com> wrote:
>>
>> On Wed, Sep 2, 2015 at 5:22 AM, Yan, Zheng <z...@redhat.com> wrote:
>>> timespec_to_jiffies() does not work this way. it convert time delta in form 
>>> of timespec to time delta in form of jiffies.
>>
>> Ah sorry, con->last_keepalive_ack is a realtime timespec from
>> userspace.
>>
>>>
>>> I will updated the patch according  to the rest comments .
>>
>> I still want ceph_con_keepalive_expired() to handle interval == 0
>> internally, so that opt->monc_ping_timeout can be passed without any
>> checks in the caller.
>>
>> I also noticed you added delay = max_t(unsigned long, delay, HZ); to
>> __schedule_delayed().  Is it really necessary?
>>
>
> Updated. please review.

I was concerned about user facing option names and timeouts, since
that's what I had to deal with recently - those parts look OK to me
now.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libceph: check data_len in ->alloc_msg()

2015-09-02 Thread Ilya Dryomov

Only ->alloc_msg() should check data_len of the incoming message
against the preallocated ceph_msg, doing it in the messenger is not
right.  The contract is that either ->alloc_msg() returns a ceph_msg
which will fit all of the portions of the incoming message, or it
returns NULL and possibly sets skip, signaling whether NULL is due to
an -ENOMEM.  ->alloc_msg() should be the only place where we make the
skip/no-skip decision.

I stumbled upon this while looking at con/osd ref counting.  Right now,
if we get a non-extent message with a larger data portion than we are
prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip
it in the messenger, we don't put the con/osd ref acquired in
ceph_con_in_msg_alloc() (which is normally put in process_message()),
so this also fixes a memory leak.

An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't
corrupt random memory should a buggy ->alloc_msg() return an unfit
ceph_msg.

While at it, I changed the "unknown tid" dout() to a pr_warn() to make
sure all skips are seen and unified format strings.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 net/ceph/messenger.c  |  7 ---
 net/ceph/osd_client.c | 51 ++-
 2 files changed, 18 insertions(+), 40 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 36757d46ac40..525f454f7531 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2337,13 +2337,6 @@ static int read_partial_message(struct ceph_connection 
*con)
return ret;
 
BUG_ON(!con->in_msg ^ skip);
-   if (con->in_msg && data_len > con->in_msg->data_length) {
-   pr_warn("%s skipping long message (%u > %zd)\n",
-   __func__, data_len, con->in_msg->data_length);
-   ceph_msg_put(con->in_msg);
-   con->in_msg = NULL;
-   skip = 1;
-   }
if (skip) {
/* skip this message */
dout("alloc_msg said skip message\n");
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 50033677c0fa..80b94e37c94a 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2817,8 +2817,9 @@ out:
 }
 
 /*
- * lookup and return message for incoming reply.  set up reply message
- * pages.
+ * Lookup and return message for incoming reply.  Don't try to do
+ * anything about a larger than preallocated data portion of the
+ * message at the moment - for now, just skip the message.
  */
 static struct ceph_msg *get_reply(struct ceph_connection *con,
  struct ceph_msg_header *hdr,
@@ -2836,10 +2837,10 @@ static struct ceph_msg *get_reply(struct 
ceph_connection *con,
mutex_lock(>request_mutex);
req = __lookup_request(osdc, tid);
if (!req) {
-   *skip = 1;
+   pr_warn("%s osd%d tid %llu unknown, skipping\n",
+   __func__, osd->o_osd, tid);
m = NULL;
-   dout("get_reply unknown tid %llu from osd%d\n", tid,
-osd->o_osd);
+   *skip = 1;
goto out;
}
 
@@ -2849,10 +2850,9 @@ static struct ceph_msg *get_reply(struct ceph_connection 
*con,
ceph_msg_revoke_incoming(req->r_reply);
 
if (front_len > req->r_reply->front_alloc_len) {
-   pr_warn("get_reply front %d > preallocated %d (%u#%llu)\n",
-   front_len, req->r_reply->front_alloc_len,
-   (unsigned int)con->peer_name.type,
-   le64_to_cpu(con->peer_name.num));
+   pr_warn("%s osd%d tid %llu front %d > preallocated %d\n",
+   __func__, osd->o_osd, req->r_tid, front_len,
+   req->r_reply->front_alloc_len);
m = ceph_msg_new(CEPH_MSG_OSD_OPREPLY, front_len, GFP_NOFS,
 false);
if (!m)
@@ -2860,37 +2860,22 @@ static struct ceph_msg *get_reply(struct 
ceph_connection *con,
ceph_msg_put(req->r_reply);
req->r_reply = m;
}
-   m = ceph_msg_get(req->r_reply);
-
-   if (data_len > 0) {
-   struct ceph_osd_data *osd_data;
 
-   /*
-* XXX This is assuming there is only one op containing
-* XXX page data.  Probably OK for reads, but this
-* XXX ought to be done more generally.
-*/
-   osd_data = osd_req_op_extent_osd_data(req, 0);
-   if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
-   if (osd_data->pages &&
-   unl

Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive

2015-09-02 Thread Ilya Dryomov

On Wed, Sep 2, 2015 at 12:25 PM, Yan, Zheng <z...@redhat.com> wrote:
>
>> On Sep 2, 2015, at 17:12, Ilya Dryomov <idryo...@gmail.com> wrote:
>>
>> On Wed, Sep 2, 2015 at 5:22 AM, Yan, Zheng <z...@redhat.com> wrote:
>>> timespec_to_jiffies() does not work this way. it convert time delta in form 
>>> of timespec to time delta in form of jiffies.
>>
>> Ah sorry, con->last_keepalive_ack is a realtime timespec from
>> userspace.
>>
>>>
>>> I will updated the patch according  to the rest comments .
>>
>> I still want ceph_con_keepalive_expired() to handle interval == 0
>> internally, so that opt->monc_ping_timeout can be passed without any
>> checks in the caller.
>>
>> I also noticed you added delay = max_t(unsigned long, delay, HZ); to
>> __schedule_delayed().  Is it really necessary?
>>
>
> Updated. please review.

One other thing you might want to do is to / 3 instead of >> 2 in
delayed_work().  That way the tick interval in the non-hunting case is
going to be 10 seconds, which matches the default for ping interval
(mon_client_ping_interval) in userspace.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: devicemapper and udev events

2015-09-01 Thread Ilya Dryomov

On Tue, Sep 1, 2015 at 1:09 AM, Loic Dachary  wrote:
> Hi Ilya,
>
> While working on multipath, I noticed that udev add event are not triggered 
> when /dev/dm-0 etc. are created. Should I expect a udev add event for every 
> device ? Or does it not apply to devices created by / with devicemapper ?

I'd expect an event - are you sure it's not triggered?  Are you using
udevadm monitor or is it just some udev rule not getting executed?

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive

2015-09-01 Thread Ilya Dryomov

On Tue, Sep 1, 2015 at 5:21 PM, Yan, Zheng  wrote:
> Signed-off-by: Yan, Zheng 
> ---
>  include/linux/ceph/libceph.h   |  2 ++
>  include/linux/ceph/messenger.h |  4 +++
>  include/linux/ceph/msgr.h  |  4 ++-
>  net/ceph/ceph_common.c | 18 -
>  net/ceph/messenger.c   | 60 
> ++
>  net/ceph/mon_client.c  | 38 --
>  6 files changed, 111 insertions(+), 15 deletions(-)

Hi Zheng,

Just want to make sure you saw my "Re: " message - I had a few
suggestions.  This one looks like the one I commented on, straight from
the testing branch.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: format 2TB rbd device is too slow

2015-08-31 Thread Ilya Dryomov

On Mon, Aug 31, 2015 at 4:21 AM, Ma, Jianpeng  wrote:
> Ilya
>  I modify kernel code. The patch like this:
> [root@dev linux]# git diff
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index bc67a93..e4c4ea9 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -55,7 +55,7 @@
>   * universally 512 bytes.  These symbols are just slightly more
>   * meaningful than the bare numbers they represent.
>   */
> -#defineSECTOR_SHIFT9
> +#defineSECTOR_SHIFT12
>  #defineSECTOR_SIZE (1ULL << SECTOR_SHIFT)
>
>  /*
> @@ -3811,6 +3811,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
> blk_queue_io_min(q, segment_size);
> blk_queue_io_opt(q, segment_size);
>
> +   blk_queue_physical_block_size(q, SECTOR_SIZE);
> /* enable the discard support */
> queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
> q->limits.discard_granularity = segment_size;
>
> I don't know the exact why this can decrease the time.

Well, that's the answer ;)  set_capacity()'s @size is always in 512
byte sectors, and it's called as follows in rbd:

set_capacity(rbd_dev->disk, rbd_dev->mapping.size / SECTOR_SIZE);

By changing the SECTOR_SIZE from 512 to 4096 (i.e. by a factor of 8)
you just shrink the block device by the same factor.  What used to be
a 2T mapping would now be a 256G mapping, and since the amount of data
mkfs.xfs has to write is proportional to the size of the device, your
7 times decrease in time is easily explained by an 8 times decrease in
size of the mapping.

>
> For the -b of mkfs.xfs, I try for 65536. But it don't change any.

Right.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] rbd: fix double free on rbd_dev->header_name

2015-08-31 Thread Ilya Dryomov

If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name
is freed twice: once in rbd_dev_probe_parent() and then in its caller
rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to
handle parent images).

rbd_dev_probe_parent() is responsible for probing the parent, so it
shoudn't mock with clone's fields.

Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
---
 drivers/block/rbd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index bc67a93aa4f4..324bf35ec4dd 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5201,7 +5201,6 @@ static int rbd_dev_probe_parent(struct rbd_device 
*rbd_dev)
 out_err:
if (parent) {
rbd_dev_unparent(rbd_dev);
-   kfree(rbd_dev->header_name);
rbd_dev_destroy(parent);
} else {
rbd_put_client(rbdc);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: format 2TB rbd device is too slow

2015-08-28 Thread Ilya Dryomov

On Thu, Aug 27, 2015 at 3:43 AM, huang jun hjwsm1...@gmail.com wrote:
 hi,llya

 2015-08-26 23:56 GMT+08:00 Ilya Dryomov idryo...@gmail.com:
 On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com wrote:
 On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote:
 hi,all
 we create a 2TB rbd image, after map it to local,
 then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318
 seconds to finish, but  local physical disk with the same size just
 need 6 seconds.


 I think librbd have two PR related to this.

 After debug, we found there are two steps in rbd module during formating:
 a) send  233093 DELETE requests to osds(number_of_requests = 2TB / 4MB),
this step spent almost 92 seconds.

 I guess this(https://github.com/ceph/ceph/pull/4221/files) may help

 It's submitting deletes for non-existent objects, not zeroing.  The
 only thing that will really help here is the addition of rbd object map
 support to the kernel client.  That could happen in 4.4, but 4.5 is
 a safer bet.


 b) send 4238 messages like this: [set-alloc-hint object_size 4194304
 write_size 4194304,write 0~512] to osds, that spent 227 seconds.

 I think kernel rbd also need to use
 https://github.com/ceph/ceph/pull/4983/files

 set-alloc-hint may be a problem, but I think a bigger problem is the
 size of the write.  Are all those writes 512 bytes long?

 In another test to format 2TB rbd device,
 there are :
 2 messages,each write 131072 bytes
 4000 messages, each write 262144 bytes
 112 messages, each write 4096 bytes
 194 messages, each write 512 bytes

So the majority of writes is not 512 bytes long.  I don't think
disabling set-alloc-hint (and, as of now at least, you can't disable it
anyway) would drastically change the numbers.  If you are doing mkfs
right after creating and mapping an image for the first time, you can
add -K option to mkfs, which will tell it to not try to discard.  As
for the write phase, I can't suggest anything off hand.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: format 2TB rbd device is too slow

2015-08-28 Thread Ilya Dryomov

On Fri, Aug 28, 2015 at 10:36 AM, Ma, Jianpeng jianpeng...@intel.com wrote:
 Hi Ilya,
We can change sector size from 512 to 4096. This can reduce the count of 
 write.
 I did a simple test: for 900G, mkfs.xfs -f
 For default: 1m10s
 Physical sector size = 4096:  0m10s.

 But if change sector size, we need rbd meta record this.

What exactly do you mean by changing the sector size?  xfs sector size
or physical block size of the rbd device?  If the latter, meaning
mkfs.xfs -s size 4096, I fail to see how it can result in the above
numbers.  If the former, you have to patch the kernel and I fail to see
how it can result in such an improvement too.  I probably didn't have
enough coffee, please elaborate.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: format 2TB rbd device is too slow

2015-08-26 Thread Ilya Dryomov

On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com wrote:
 On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote:
 hi,all
 we create a 2TB rbd image, after map it to local,
 then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318
 seconds to finish, but  local physical disk with the same size just
 need 6 seconds.


 I think librbd have two PR related to this.

 After debug, we found there are two steps in rbd module during formating:
 a) send  233093 DELETE requests to osds(number_of_requests = 2TB / 4MB),
this step spent almost 92 seconds.

 I guess this(https://github.com/ceph/ceph/pull/4221/files) may help

It's submitting deletes for non-existent objects, not zeroing.  The
only thing that will really help here is the addition of rbd object map
support to the kernel client.  That could happen in 4.4, but 4.5 is
a safer bet.


 b) send 4238 messages like this: [set-alloc-hint object_size 4194304
 write_size 4194304,write 0~512] to osds, that spent 227 seconds.

 I think kernel rbd also need to use
 https://github.com/ceph/ceph/pull/4983/files

set-alloc-hint may be a problem, but I think a bigger problem is the
size of the write.  Are all those writes 512 bytes long?

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel RBD Readahead

2015-08-25 Thread Ilya Dryomov

On Tue, Aug 25, 2015 at 5:05 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: Ilya Dryomov [mailto:idryo...@gmail.com]
 Sent: 25 August 2015 09:45
 To: Nick Fisk n...@fisk.me.uk
 Cc: Ceph Development ceph-devel@vger.kernel.org
 Subject: Re: Kernel RBD Readahead

 On Tue, Aug 25, 2015 at 10:40 AM, Nick Fisk n...@fisk.me.uk wrote:
  I have done two tests one with 1MB objects and another with 4MB objects,
 my cluster is a little busier than when I did the quick test yesterday, so 
 all
 speeds are slightly down across the board but you can see the scaling effect
 nicely. Results:

  1MB Order RBD
  Readahead   DD-Null Speed
  128 18MB/s
  102432MB/s
  204840MB/s
  409658MB/s
  819275MB/s
  16384   91MB/s
  32768   160MB/s

  4MB Order RBD
  128 42MB/s
  102456MB/s
  204861MB/s
  409698MB/s
  8192121MB/s
  16384   170MB/s
  32768   195MB/s
  65536   221MB/s
  131072  271MB/s

  I think the results confirm my suspicions, where a full stripe in a raid 
  array
 will usually only be a couple of MB (eg 256kb chunk * 8 disks) and so a
 relatively small readahead will involve all the disks for max performance. 
 In a
 Ceph RBD a full stripe will be 4MB * number of OSD's in the cluster. So I 
 think
 that if sequential read performance is the only goal, then readahead
 probably needs to equal that figure, which could be massive. But in reality
 like me you will probably find that  you get sufficient performance at a 
 lower
 value. Of course all this theory could all change when the kernel client gets
 striping support.

  However in terms of a default, that’s a tricky one. Even setting it to 4096
 would probably start to have a negative impact on pure random IO latency.
 Each read would make a OSD read a whole 4MB object, see small table below
 for IOP/Read size for disks in my cluster. I would imagine somewhere
 between 256-1024 would be a good trade off between where the OSD disks
 latency starts to rise. Users would need to be aware of their workload and
 tweak readahead if needed.

  IOPs
  (4k Random Read)83
  (64k Random Read)   81
  (256k Random Read)  73
  (1M Random Read)52
  (4M Random Read)25

 Yeah, we want a sensible default, but it's always going to be a trade off.
 librbd has readahead knobs, but the only real use case there is shortening
 qemu boot times, so we can't copy those settings.  I'll have to think about 
 it
 some more - it might make more sense to leave things as is.  Users with large
 sequential read workloads should know to check and adjust readahead
 settings, and likely won't be satisfied with 1x object size anyway.

 Ok. I might try and create a 4.1 kernel with the blk-mq queue depth/IO size + 
 readahead +max_segments fixes in as I'm think the TCP_NODELAY bug will still 
 be present in my old 3.14 kernel.

I can build 4.2-rc8 + readahead patch on gitbuilders for you.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel RBD Readahead

2015-08-25 Thread Ilya Dryomov

On Tue, Aug 25, 2015 at 10:40 AM, Nick Fisk n...@fisk.me.uk wrote:
 I have done two tests one with 1MB objects and another with 4MB objects, my 
 cluster is a little busier than when I did the quick test yesterday, so all 
 speeds are slightly down across the board but you can see the scaling effect 
 nicely. Results:

 1MB Order RBD
 Readahead   DD-Null Speed
 128 18MB/s
 102432MB/s
 204840MB/s
 409658MB/s
 819275MB/s
 16384   91MB/s
 32768   160MB/s

 4MB Order RBD
 128 42MB/s
 102456MB/s
 204861MB/s
 409698MB/s
 8192121MB/s
 16384   170MB/s
 32768   195MB/s
 65536   221MB/s
 131072  271MB/s

 I think the results confirm my suspicions, where a full stripe in a raid 
 array will usually only be a couple of MB (eg 256kb chunk * 8 disks) and so a 
 relatively small readahead will involve all the disks for max performance. In 
 a Ceph RBD a full stripe will be 4MB * number of OSD's in the cluster. So I 
 think that if sequential read performance is the only goal, then readahead 
 probably needs to equal that figure, which could be massive. But in reality 
 like me you will probably find that  you get sufficient performance at a 
 lower value. Of course all this theory could all change when the kernel 
 client gets striping support.

 However in terms of a default, that’s a tricky one. Even setting it to 4096 
 would probably start to have a negative impact on pure random IO latency. 
 Each read would make a OSD read a whole 4MB object, see small table below for 
 IOP/Read size for disks in my cluster. I would imagine somewhere between 
 256-1024 would be a good trade off between where the OSD disks latency starts 
 to rise. Users would need to be aware of their workload and tweak readahead 
 if needed.

 IOPs
 (4k Random Read)83
 (64k Random Read)   81
 (256k Random Read)  73
 (1M Random Read)52
 (4M Random Read)25

Yeah, we want a sensible default, but it's always going to be a trade
off.  librbd has readahead knobs, but the only real use case there is
shortening qemu boot times, so we can't copy those settings.  I'll have
to think about it some more - it might make more sense to leave things
as is.  Users with large sequential read workloads should know to check
and adjust readahead settings, and likely won't be satisfied with 1x
object size anyway.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel RBD Readahead

2015-08-24 Thread Ilya Dryomov

On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: Ilya Dryomov [mailto:idryo...@gmail.com]
 Sent: 23 August 2015 18:33
 To: Nick Fisk n...@fisk.me.uk
 Cc: Ceph Development ceph-devel@vger.kernel.org
 Subject: Re: Kernel RBD Readahead

 On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote:
  Hi Ilya,

  I was wondering if I could just get your thoughts on a matter I have
  run into?

  Its surrounding read performance of the RBD kernel client and blk-mq,
  mainly when doing large single threaded reads. During testing
  performance seems to be limited to around 40MB/s, which is probably
  fairly similar to what you would expect to get from a single OSD. This
  is to be expected as an RBD is just a long chain of objects each on a
  different OSD which is being read through in order one at a time.

  In theory readahead should make up for this by making the RBD client
  read from several OSD’s ahead of the current required block. However
  from what I can see it seems that setting a readahead value higher
  than max_sectors_kb doesn’t appear to have any effect, meaning that
  readahead is limited to each object that is currently being read.
  Would you be able to confirm if this is correct and if this is by design?

 [CCing ceph-devel]

 Certainly not by design.  rbd is just a block device driver, so if the kernel
 submits a readahead read, it will obey and carry it out in full.
 The readahead is driven by the VM in pages, it doesn't care about rbd object
 boundaries and such.

 That said, one problem is in the VM subsystem, where readaheads get
 capped at 512 pages (= 2M).  If you do a simple single threaded read test,
 you'll see 4096 sector (= 2M) I/Os instead of object size I/Os:

 $ rbd info foo | grep order
 order 24 (16384 kB objects)
 $ blockdev --getra /dev/rbd0
 32768
 $ dd if=/dev/rbd0 of=/dev/null bs=32M
 # avgrq-sz is 4096.00

 This was introduced in commit 6d2be915e589 (mm/readahead.c: fix
 readahead failure for memoryless NUMA nodes and limit readahead pages)
 [1], which went into 3.15.  The hard limit was Linus' suggestion, apparently.

 #define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
 /*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  * sensible upper limit.
  */
 unsigned long max_sane_readahead(unsigned long nr) {
 return min(nr, MAX_READAHEAD);
 }

 This limit used to be dynamic and depended on the number of free pages in
 the system.  There has been an attempt to bring that behaviour back [2], but
 it didn't go very far as far as getting into mainline.  It looks like Red 
 Hat and
 Oracle are shipping [2] in some of their kernels though.  If you apply it, 
 you'll
 see 32768 sector (= 16M) I/Os in the above test, which is how it should be.

 [1]
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6d
 2be915e589b58cb11418cbe1f22ff90732b6ac
 [2] http://thread.gmane.org/gmane.linux.kernel/1893680

 One thing we should be doing is setting read_ahead_kb to object size, the
 default 128k doesn't really cut it for rbd.  I'll send a patch for that.

 Thanks,

 Ilya

 Thanks for your response.

 I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so that 
 is likely to be partly the cause of the poor performance I am seeing. However 
 I tried a 3.14 kernel and saw the same level performance, but this time the 
 IO's were limited to 1024 sectors. The queue depth was at around 8 so I guess 
 this means its submitting 8*512kb IO's up to the max_sectors_kb limit of 
 4096KB. From the OSD point of view, this will still be accessing 1 OSD at a 
 time.

 Maybe I'm expecting the wrong results but I was expecting that either 1 of 
 these 2 scenarios would happen.

 1. Kernel submits a large enough IO to satisfy the readahead value, 
 max_sectors_kb would need to be higher than the object size (currently not 
 possible) and the RADOS layer would be responsible for doing the parallel 
 reads to the OSD's to satisfy it.

 2. Kernel recognises that the readahead is bigger than the max_sectors_kb 
 value and submits several IO's in parallel to the RBD device to satisfy the 
 readahead request. Ie 32MB readahead would submit 8x4MB IO's in parallel.

 Please let me know if I have got the wrong idea here. But in my head either 
 solution should improve sequential reads by a large amount, with the 2nd 
 possibly slightly better as you are only waiting on the 1st OSD to respond to 
 complete the request.

 Thanks for including the Ceph-Devel list, unfortunately despite several 
 attempts I have not been able to post to this list after subscribing, please 
 can you forward any correspondence if you think it would be useful to share.

Did you remember to set max_sectors_kb to max_hw_sectors_kb?  The block
layer in 3.14 leaves max_sectors_kb at 512, even when max_hw_sectors_kb
is set to a much bigger value by the driver

Re: Kernel RBD Readahead

2015-08-24 Thread Ilya Dryomov

On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov idryo...@gmail.com wrote:
 On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: Ilya Dryomov [mailto:idryo...@gmail.com]
 Sent: 23 August 2015 18:33
 To: Nick Fisk n...@fisk.me.uk
 Cc: Ceph Development ceph-devel@vger.kernel.org
 Subject: Re: Kernel RBD Readahead

 On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote:
  Hi Ilya,

  I was wondering if I could just get your thoughts on a matter I have
  run into?

  Its surrounding read performance of the RBD kernel client and blk-mq,
  mainly when doing large single threaded reads. During testing
  performance seems to be limited to around 40MB/s, which is probably
  fairly similar to what you would expect to get from a single OSD. This
  is to be expected as an RBD is just a long chain of objects each on a
  different OSD which is being read through in order one at a time.

  In theory readahead should make up for this by making the RBD client
  read from several OSD’s ahead of the current required block. However
  from what I can see it seems that setting a readahead value higher
  than max_sectors_kb doesn’t appear to have any effect, meaning that
  readahead is limited to each object that is currently being read.
  Would you be able to confirm if this is correct and if this is by design?

 [CCing ceph-devel]

 Certainly not by design.  rbd is just a block device driver, so if the 
 kernel
 submits a readahead read, it will obey and carry it out in full.
 The readahead is driven by the VM in pages, it doesn't care about rbd object
 boundaries and such.

 That said, one problem is in the VM subsystem, where readaheads get
 capped at 512 pages (= 2M).  If you do a simple single threaded read test,
 you'll see 4096 sector (= 2M) I/Os instead of object size I/Os:

 $ rbd info foo | grep order
 order 24 (16384 kB objects)
 $ blockdev --getra /dev/rbd0
 32768
 $ dd if=/dev/rbd0 of=/dev/null bs=32M
 # avgrq-sz is 4096.00

 This was introduced in commit 6d2be915e589 (mm/readahead.c: fix
 readahead failure for memoryless NUMA nodes and limit readahead pages)
 [1], which went into 3.15.  The hard limit was Linus' suggestion, 
 apparently.

 #define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
 /*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  * sensible upper limit.
  */
 unsigned long max_sane_readahead(unsigned long nr) {
 return min(nr, MAX_READAHEAD);
 }

 This limit used to be dynamic and depended on the number of free pages in
 the system.  There has been an attempt to bring that behaviour back [2], but
 it didn't go very far as far as getting into mainline.  It looks like Red 
 Hat and
 Oracle are shipping [2] in some of their kernels though.  If you apply it, 
 you'll
 see 32768 sector (= 16M) I/Os in the above test, which is how it should be.

 [1]
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6d
 2be915e589b58cb11418cbe1f22ff90732b6ac
 [2] http://thread.gmane.org/gmane.linux.kernel/1893680

 One thing we should be doing is setting read_ahead_kb to object size, the
 default 128k doesn't really cut it for rbd.  I'll send a patch for that.

 Thanks,

 Ilya

 Thanks for your response.

 I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so 
 that is likely to be partly the cause of the poor performance I am seeing. 
 However I tried a 3.14 kernel and saw the same level performance, but this 
 time the IO's were limited to 1024 sectors. The queue depth was at around 8 
 so I guess this means its submitting 8*512kb IO's up to the max_sectors_kb 
 limit of 4096KB. From the OSD point of view, this will still be accessing 1 
 OSD at a time.

 Maybe I'm expecting the wrong results but I was expecting that either 1 of 
 these 2 scenarios would happen.

 1. Kernel submits a large enough IO to satisfy the readahead value, 
 max_sectors_kb would need to be higher than the object size (currently not 
 possible) and the RADOS layer would be responsible for doing the parallel 
 reads to the OSD's to satisfy it.

 2. Kernel recognises that the readahead is bigger than the max_sectors_kb 
 value and submits several IO's in parallel to the RBD device to satisfy the 
 readahead request. Ie 32MB readahead would submit 8x4MB IO's in parallel.

 Please let me know if I have got the wrong idea here. But in my head either 
 solution should improve sequential reads by a large amount, with the 2nd 
 possibly slightly better as you are only waiting on the 1st OSD to respond 
 to complete the request.

 Thanks for including the Ceph-Devel list, unfortunately despite several 
 attempts I have not been able to post to this list after subscribing, please 
 can you forward any correspondence if you think it would be useful to share.

 Did you remember to set max_sectors_kb to max_hw_sectors_kb?  The block
 layer in 3.14 leaves max_sectors_kb

Re: Kernel RBD Readahead

2015-08-24 Thread Ilya Dryomov

On Mon, Aug 24, 2015 at 7:00 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: Ilya Dryomov [mailto:idryo...@gmail.com]
 Sent: 24 August 2015 16:07
 To: Nick Fisk n...@fisk.me.uk
 Cc: Ceph Development ceph-devel@vger.kernel.org
 Subject: Re: Kernel RBD Readahead

 On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov idryo...@gmail.com
 wrote:
  On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk n...@fisk.me.uk wrote:
  -Original Message-
  From: Ilya Dryomov [mailto:idryo...@gmail.com]
  Sent: 23 August 2015 18:33
  To: Nick Fisk n...@fisk.me.uk
  Cc: Ceph Development ceph-devel@vger.kernel.org
  Subject: Re: Kernel RBD Readahead

  On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote:
   Hi Ilya,

   I was wondering if I could just get your thoughts on a matter I
   have run into?

   Its surrounding read performance of the RBD kernel client and
   blk-mq, mainly when doing large single threaded reads. During
   testing performance seems to be limited to around 40MB/s, which is
   probably fairly similar to what you would expect to get from a
   single OSD. This is to be expected as an RBD is just a long chain
   of objects each on a different OSD which is being read through in order
 one at a time.

   In theory readahead should make up for this by making the RBD
   client read from several OSD’s ahead of the current required
   block. However from what I can see it seems that setting a
   readahead value higher than max_sectors_kb doesn’t appear to have
   any effect, meaning that readahead is limited to each object that is
 currently being read.
   Would you be able to confirm if this is correct and if this is by 
   design?

  [CCing ceph-devel]

  Certainly not by design.  rbd is just a block device driver, so if
  the kernel submits a readahead read, it will obey and carry it out in 
  full.
  The readahead is driven by the VM in pages, it doesn't care about
  rbd object boundaries and such.

  That said, one problem is in the VM subsystem, where readaheads get
  capped at 512 pages (= 2M).  If you do a simple single threaded read
  test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os:

  $ rbd info foo | grep order
  order 24 (16384 kB objects)
  $ blockdev --getra /dev/rbd0
  32768
  $ dd if=/dev/rbd0 of=/dev/null bs=32M
  # avgrq-sz is 4096.00

  This was introduced in commit 6d2be915e589 (mm/readahead.c: fix
  readahead failure for memoryless NUMA nodes and limit readahead
  pages) [1], which went into 3.15.  The hard limit was Linus' suggestion,
 apparently.

  #define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
  /*
   * Given a desired number of PAGE_CACHE_SIZE readahead pages,
 return
  a
   * sensible upper limit.
   */
  unsigned long max_sane_readahead(unsigned long nr) {
  return min(nr, MAX_READAHEAD); }

  This limit used to be dynamic and depended on the number of free
  pages in the system.  There has been an attempt to bring that
  behaviour back [2], but it didn't go very far as far as getting into
  mainline.  It looks like Red Hat and Oracle are shipping [2] in some
  of their kernels though.  If you apply it, you'll see 32768 sector (= 
  16M)
 I/Os in the above test, which is how it should be.

  [1]
  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/comm
  it/?id=6d 2be915e589b58cb11418cbe1f22ff90732b6ac
  [2] http://thread.gmane.org/gmane.linux.kernel/1893680

  One thing we should be doing is setting read_ahead_kb to object
  size, the default 128k doesn't really cut it for rbd.  I'll send a patch 
  for
 that.

  Thanks,

  Ilya

  Thanks for your response.

  I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so 
  that
 is likely to be partly the cause of the poor performance I am seeing. However
 I tried a 3.14 kernel and saw the same level performance, but this time the
 IO's were limited to 1024 sectors. The queue depth was at around 8 so I
 guess this means its submitting 8*512kb IO's up to the max_sectors_kb limit
 of 4096KB. From the OSD point of view, this will still be accessing 1 OSD at 
 a
 time.

  Maybe I'm expecting the wrong results but I was expecting that either 1
 of these 2 scenarios would happen.

  1. Kernel submits a large enough IO to satisfy the readahead value,
 max_sectors_kb would need to be higher than the object size (currently not
 possible) and the RADOS layer would be responsible for doing the parallel
 reads to the OSD's to satisfy it.

  2. Kernel recognises that the readahead is bigger than the
 max_sectors_kb value and submits several IO's in parallel to the RBD device
 to satisfy the readahead request. Ie 32MB readahead would submit 8x4MB
 IO's in parallel.

  Please let me know if I have got the wrong idea here. But in my head
 either solution should improve sequential reads by a large amount, with the
 2nd possibly slightly better as you are only waiting on the 1st OSD

Re: Kernel RBD Readahead

2015-08-24 Thread Ilya Dryomov

On Mon, Aug 24, 2015 at 11:11 PM, Nick Fisk n...@fisk.me.uk wrote:

 -Original Message-
 From: Ilya Dryomov [mailto:idryo...@gmail.com]
 Sent: 24 August 2015 18:19
 To: Nick Fisk n...@fisk.me.uk
 Cc: Ceph Development ceph-devel@vger.kernel.org
 Subject: Re: Kernel RBD Readahead

 On Mon, Aug 24, 2015 at 7:00 PM, Nick Fisk n...@fisk.me.uk wrote:
  -Original Message-
  From: Ilya Dryomov [mailto:idryo...@gmail.com]
  Sent: 24 August 2015 16:07
  To: Nick Fisk n...@fisk.me.uk
  Cc: Ceph Development ceph-devel@vger.kernel.org
  Subject: Re: Kernel RBD Readahead

  On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov idryo...@gmail.com
  wrote:
   On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk n...@fisk.me.uk wrote:
   -Original Message-
   From: Ilya Dryomov [mailto:idryo...@gmail.com]
   Sent: 23 August 2015 18:33
   To: Nick Fisk n...@fisk.me.uk
   Cc: Ceph Development ceph-devel@vger.kernel.org
   Subject: Re: Kernel RBD Readahead

   On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote:
Hi Ilya,

I was wondering if I could just get your thoughts on a matter I
have run into?

Its surrounding read performance of the RBD kernel client and
blk-mq, mainly when doing large single threaded reads. During
testing performance seems to be limited to around 40MB/s, which
is probably fairly similar to what you would expect to get from
a single OSD. This is to be expected as an RBD is just a long
chain of objects each on a different OSD which is being read
through in order
  one at a time.

In theory readahead should make up for this by making the RBD
client read from several OSD’s ahead of the current required
block. However from what I can see it seems that setting a
readahead value higher than max_sectors_kb doesn’t appear to
have any effect, meaning that readahead is limited to each
object that is
  currently being read.
Would you be able to confirm if this is correct and if this is by
 design?

   [CCing ceph-devel]

   Certainly not by design.  rbd is just a block device driver, so
   if the kernel submits a readahead read, it will obey and carry it out 
   in
 full.
   The readahead is driven by the VM in pages, it doesn't care about
   rbd object boundaries and such.

   That said, one problem is in the VM subsystem, where readaheads
   get capped at 512 pages (= 2M).  If you do a simple single
   threaded read test, you'll see 4096 sector (= 2M) I/Os instead of
 object size I/Os:

   $ rbd info foo | grep order
   order 24 (16384 kB objects)
   $ blockdev --getra /dev/rbd0
   32768
   $ dd if=/dev/rbd0 of=/dev/null bs=32M
   # avgrq-sz is 4096.00

   This was introduced in commit 6d2be915e589 (mm/readahead.c: fix
   readahead failure for memoryless NUMA nodes and limit readahead
   pages) [1], which went into 3.15.  The hard limit was Linus'
   suggestion,
  apparently.

   #define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
   /*
* Given a desired number of PAGE_CACHE_SIZE readahead pages,
  return
   a
* sensible upper limit.
*/
   unsigned long max_sane_readahead(unsigned long nr) {
   return min(nr, MAX_READAHEAD); }

   This limit used to be dynamic and depended on the number of free
   pages in the system.  There has been an attempt to bring that
   behaviour back [2], but it didn't go very far as far as getting
   into mainline.  It looks like Red Hat and Oracle are shipping [2]
   in some of their kernels though.  If you apply it, you'll see
   32768 sector (= 16M)
  I/Os in the above test, which is how it should be.

   [1]
   https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/c
   omm it/?id=6d 2be915e589b58cb11418cbe1f22ff90732b6ac
   [2] http://thread.gmane.org/gmane.linux.kernel/1893680

   One thing we should be doing is setting read_ahead_kb to object
   size, the default 128k doesn't really cut it for rbd.  I'll send
   a patch for
  that.

   Thanks,

   Ilya

   Thanks for your response.

   I do see the IO's being limited to 4096 sectors in the 4.1 kernel
   and so that
  is likely to be partly the cause of the poor performance I am seeing.
  However I tried a 3.14 kernel and saw the same level performance, but
  this time the IO's were limited to 1024 sectors. The queue depth was
  at around 8 so I guess this means its submitting 8*512kb IO's up to
  the max_sectors_kb limit of 4096KB. From the OSD point of view, this
  will still be accessing 1 OSD at a time.

   Maybe I'm expecting the wrong results but I was expecting that
   either 1
  of these 2 scenarios would happen.

   1. Kernel submits a large enough IO to satisfy the readahead
   value,
  max_sectors_kb would need to be higher than the object size
  (currently not
  possible) and the RADOS layer would be responsible for doing the
  parallel reads to the OSD's to satisfy it.

   2. Kernel recognises

Re: Kernel RBD Readahead

2015-08-23 Thread Ilya Dryomov

On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote:
Hi Ilya,

I was wondering if I could just get your thoughts on a matter I have run
into?

Its surrounding read performance of the RBD kernel client and blk-mq, mainly
when doing large single threaded reads. During testing performance seems to
be limited to around 40MB/s, which is probably fairly similar to what you
would expect to get from a single OSD. This is to be expected as an RBD is
just a long chain of objects each on a different OSD which is being read
through in order one at a time.

In theory readahead should make up for this by making the RBD client read
from several OSD’s ahead of the current required block. However from what I
can see it seems that setting a readahead value higher than max_sectors_kb
doesn’t appear to have any effect, meaning that readahead is limited to each
object that is currently being read. Would you be able to confirm if this is
correct and if this is by design?

[CCing ceph-devel]

Certainly not by design. rbd is just a block device driver, so if the
kernel submits a readahead read, it will obey and carry it out in full.
The readahead is driven by the VM in pages, it doesn't care about rbd
object boundaries and such.

That said, one problem is in the VM subsystem, where readaheads get
capped at 512 pages (= 2M). If you do a simple single threaded read
test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os:

$ rbd info foo | grep order
order 24 (16384 kB objects)
$ blockdev --getra /dev/rbd0
32768
$ dd if=/dev/rbd0 of=/dev/null bs=32M
# avgrq-sz is 4096.00

This was introduced in commit 6d2be915e589 (mm/readahead.c: fix
readahead failure for memoryless NUMA nodes and limit readahead pages)
[1], which went into 3.15. The hard limit was Linus' suggestion,
apparently.

#define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE)
/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
unsigned long max_sane_readahead(unsigned long nr)
{
return min(nr, MAX_READAHEAD);
}

This limit used to be dynamic and depended on the number of free pages
in the system. There has been an attempt to bring that behaviour back
[2], but it didn't go very far as far as getting into mainline. It
looks like Red Hat and Oracle are shipping [2] in some of their kernels
though. If you apply it, you'll see 32768 sector (= 16M) I/Os in the
above test, which is how it should be.

[1]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6d2be915e589b58cb11418cbe1f22ff90732b6ac
[2] http://thread.gmane.org/gmane.linux.kernel/1893680

One thing we should be doing is setting read_ahead_kb to object size,
the default 128k doesn't really cut it for rbd. I'll send a patch for
that.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: /sys/block and /dev and partitions

2015-08-18 Thread Ilya Dryomov

On Tue, Aug 18, 2015 at 12:46 AM, Loic Dachary l...@dachary.org wrote:
 Hi Ilya,

 For regular devices such as /dev/vdb2 or /dev/sda3, do you think it is safe 
 to use /sys/dev/block/M:m/partition to figure out the partition number ? Or 
 could it vary depending on the disk driver or the partition table layout ? Or 
 the kernel version ?

Yes, I think it is.  partition was added specifically for finding out
partition numbers after extended devt scheme was merged, because it
made it impossible to use arithmetic on minor numbers.  That happened
around 2.6.27 - 2.6.28, so it's a more recent one compared to others,
but I don't think you care about anything pre 2.6.32.

(You could choose not to rely on partition and access() start for
figuring out if the thing is a partition - it's been around forever,
but then, the only way to figure out the partition index would be to
parse the name, which you want to avoid.)

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rbd: default map options, new options, misc (Re: creating the issue for the v0.94.4 release)

2015-08-18 Thread Ilya Dryomov

On Mon, Aug 17, 2015 at 8:41 PM, Loic Dachary l...@dachary.org wrote:


 On 17/08/2015 18:31, Josh Durgin wrote:
 Yes, this would make sense in firefly too, since it's useful for newer 
 kernels regardless of ceph userspace versions.

 Ok, I'll schedule it for backport. Is there an issue associated to this pull 
 request ?

No, probably not.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: /sys/block and /dev and partitions

2015-08-15 Thread Ilya Dryomov

On Sat, Aug 15, 2015 at 11:56 PM, Loic Dachary l...@dachary.org wrote:
 Hi Ilya,

 On 15/08/2015 19:42, Ilya Dryomov wrote:
 On Sat, Aug 15, 2015 at 6:35 PM, Loic Dachary l...@dachary.org wrote:
 Hi Sage,

 On 15/08/2015 16:28, Sage Weil wrote:
 On Sat, 15 Aug 2015, Loic Dachary wrote:
 Hi,

 Is there a portable and consistent way to figure out if a given /dev/XXX
 path (for instance /dev/dm-1) is a partition of a whole device ?
 Although checking /sys/block/dm-1/dm/name for a number at the end (like
 mpatha1 or mpatha2) would probably work, it feels like a fragile hack.
 Looking into /sys/block/dm-1/slaves will lead to
 /sys/block/dm-1/slaves/dm-0 and we can check that
 /sys/block/dm-*/subsystem is class/block. But that does not necessarily
 mean dm-1 is a partition of dm-0, just that it's a slave of dm-0.

 Take a look at is_partition in ceph-disk, whih is the best I came up with.
 Basically it checks if the device name appears as /sys/block/*/$foo...

 For regular devices, you can access() /sys/dev/block/maj:min/partition.
 If it's there, it's a partition - no need to iterate over /sys/block.

 I added http://tracker.ceph.com/issues/12706 for when someone has time to 
 rework that part of the code.



 That is consistently updated for /dev/sdb or /dev/vdb but things are 
 different when using multipath. I'll rely on /sys/block/dm-?/dm/name 
 instead until a better solution is found.

 A better way might be to rely on the fact that a dm partition will
 necessarily have its uuid prefixed by part.  In that case, it should
 be safe to assume that the thing in slaves is a whole disk - I think
 that's what various util-linux tools do.  However, IIRC the dm uuid is
 optional, so that won't work on a dm device without a uuid.

 It looks like multipath on both CentOS 7 and Ubuntu 14.04 set the uuid in 
 this way.

 Is it also safe to assume that if the uuid is:

 $ cat /sys/dev/block/253:?/dm/uuid
 mpath-3533007d0
 part1-mpath-3533007d0
 part2-mpath-3533007d0

 it means these were created by multipath because of the mpath ? When asking 
 dmsetup with:

Yes, I think so.  I'm pretty sure these partid- and mpath-
prefixes were devised for exactly this purpose.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: /sys/block and /dev and partitions

2015-08-15 Thread Ilya Dryomov

On Sat, Aug 15, 2015 at 6:35 PM, Loic Dachary l...@dachary.org wrote:
 Hi Sage,

 On 15/08/2015 16:28, Sage Weil wrote:
 On Sat, 15 Aug 2015, Loic Dachary wrote:
 Hi,

 Is there a portable and consistent way to figure out if a given /dev/XXX
 path (for instance /dev/dm-1) is a partition of a whole device ?
 Although checking /sys/block/dm-1/dm/name for a number at the end (like
 mpatha1 or mpatha2) would probably work, it feels like a fragile hack.
 Looking into /sys/block/dm-1/slaves will lead to
 /sys/block/dm-1/slaves/dm-0 and we can check that
 /sys/block/dm-*/subsystem is class/block. But that does not necessarily
 mean dm-1 is a partition of dm-0, just that it's a slave of dm-0.

 Take a look at is_partition in ceph-disk, whih is the best I came up with.
 Basically it checks if the device name appears as /sys/block/*/$foo...

For regular devices, you can access() /sys/dev/block/maj:min/partition.
If it's there, it's a partition - no need to iterate over /sys/block.


 That is consistently updated for /dev/sdb or /dev/vdb but things are 
 different when using multipath. I'll rely on /sys/block/dm-?/dm/name instead 
 until a better solution is found.

A better way might be to rely on the fact that a dm partition will
necessarily have its uuid prefixed by part.  In that case, it should
be safe to assume that the thing in slaves is a whole disk - I think
that's what various util-linux tools do.  However, IIRC the dm uuid is
optional, so that won't work on a dm device without a uuid.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rbd object map

2015-08-07 Thread Ilya Dryomov

On Fri, Aug 7, 2015 at 9:13 AM, Max Yehorov myeho...@skytap.com wrote:
 Hi,

 Object map feature was added in hammer. It is possible to create and
 delete an image with object map enabled using rbd, though it is not
 possible to map the image which has object map or exclusive lock
 features enabled, e.g.:

   ceph@ceph1:~$ rbd create foo --size=42 --image-format=2 --image-features=15
   ceph@ceph1:~$ sudo rbd map foo
   rbd: sysfs write failed
   rbd: map failed: (6) No such device or address


 I checked rbd.c in linux/drivers/block/rbd.c, it looks like that it
 only supports layering and striping as of kernel 4.1 branch.

Correct, and note that RBD_FEATURE_STRIPINGV2 is only partially
supported - you can map a STRIPINGV2 image only if it was created
with default striping parameters, i.e. you didn't specify --stripe-unit
or --stripe-count at create time.


   #define RBD_FEATURE_LAYERING(10)
   #define RBD_FEATURE_STRIPINGV2  (11)
   #define RBD_FEATURES_ALL \
 (RBD_FEATURE_LAYERING | RBD_FEATURE_STRIPINGV2)

   /* Features supported by this (client software) implementation. */

   #define RBD_FEATURES_SUPPORTED  (RBD_FEATURES_ALL)

 Is object map a partial feature in hammer? If so, will there be a
 complete support of object map in some upcoming ceph release?

ceph releases and kernel releases are not tied together.  object map is
fully supported by librbd, it's just not supported by the kernel client
yet.  Work is under way, but it won't show up in anything earlier than
kernel 4.4.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Newbie question about metadata_list.

2015-08-07 Thread Ilya Dryomov

On Fri, Aug 7, 2015 at 10:12 AM, Łukasz Szymczyk
lukasz.szymc...@corp.ovh.com wrote:
 On Thu, 6 Aug 2015 15:01:54 +0300
 Ilya Dryomov idryo...@gmail.com wrote:

 Hi,

 On Thu, Aug 6, 2015 at 12:26 PM, Łukasz Szymczyk
 lukasz.szymc...@corp.ovh.com wrote:
  Hi,
 
  I'm writing some program to replace image in cluster with it's copy.
  But I have problem with metadata_list.

  librbd::ImageCtx *ic = (librbd::ImageCtx*)im;
  std::string start;
  int max = 1000;
  bufferlist in, out;
  ::encode(start, in);
  ::encode(max, in);
  ret = ((librados::IoCtx*)io)-exec(ic-header_oid, rbd, 
  metadata_list, in, out);
  if (ret  0) printf(fail\n);

 You should use rbd_metadata_list() C API instead of this.

 Looks like my ceph version doesn't have it.


 
  return 0;
 
  }
 
  So, my question is: what should be set/enabled to get those metadata?
  Or maybe what I'm doing wrong here.

 Try rbd image-meta list image?

 I've got this:
 # rbd image-meta list image
 rbd: error parsing command 'image-meta'; -h or --help for usage


 It's a fairly recent feature, are you sure your OSDs support it?
 What's the output of ceph daemon osd.0 version?

 # ceph daemon osd.0 version
 {version:0.80.9}

image metadata was introduced in v9.0.0.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Newbie question about metadata_list.

2015-08-06 Thread Ilya Dryomov

On Thu, Aug 6, 2015 at 12:26 PM, Łukasz Szymczyk
lukasz.szymc...@corp.ovh.com wrote:
 Hi,

 I'm writing some program to replace image in cluster with it's copy.
 But I have problem with metadata_list.
 I created pool:
 #rados mkpool dupa
 then I created image:
 #rbd create --size 1000 -p mypool image --image-format 2

 Below is code which tries to get metadata, but it fails with -EOPNOTSUPP.
 I compile it with command like this:
 g++ file.cpp -lrbd -lrados -I/ceph/source/directory/src

 #include stdio.h
 #include stdlib.h
 #include include/rbd/librbd.h
 #include include/rbd/librbd.hpp
 using namespace ceph;
 #include librbd/ImageCtx.h

 int main() {
 rados_t clu;
 int ret = rados_create(clu, NULL);
 if (ret) return -1;

 ret = rados_conf_read_file(clu, NULL);
 if (ret) return -1;

 rados_conf_parse_env(clu, NULL);
 ret = rados_connect(clu);
 if (ret) return -1;

 rados_ioctx_t io;
 ret = rados_ioctx_create(clu, mypool, io);
 if (ret) return -1;

 rbd_image_t im;
 ret = rbd_open(io, image, im, NULL);
 if (ret) return -1;

 librbd::ImageCtx *ic = (librbd::ImageCtx*)im;
 std::string start;
 int max = 1000;
 bufferlist in, out;
 ::encode(start, in);
 ::encode(max, in);
 ret = ((librados::IoCtx*)io)-exec(ic-header_oid, rbd, 
 metadata_list, in, out);
 if (ret  0) printf(fail\n);

You should use rbd_metadata_list() C API instead of this.


 return 0;

 }

 So, my question is: what should be set/enabled to get those metadata?
 Or maybe what I'm doing wrong here.

Try rbd image-meta list image?

It's a fairly recent feature, are you sure your OSDs support it?
What's the output of ceph daemon osd.0 version?

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: fix copyup completion race

2015-07-29 Thread Ilya Dryomov

On Tue, Jul 28, 2015 at 8:48 PM, Alex Elder el...@ieee.org wrote:
 On 07/17/2015 05:36 AM, Ilya Dryomov wrote:
 For write/discard obj_requests that involved a copyup method call, the
 opcode of the first op is CEPH_OSD_OP_CALL and the -callback is
 rbd_img_obj_copyup_callback().  The latter frees copyup pages, sets
 -xferred and delegates to rbd_img_obj_callback(), the normal image
 object callback, for reporting to block layer and putting refs.

 First of all, I'm really sorry it took so long to get to
 reviewing this.  The short of this is it looks good, but
 it would be nice for you to read my description and affirm
 it matches reality.  I also suggest you rename a function.

 rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op,
 which means obj_request is marked done in rbd_osd_trivial_callback(),

 This callback path really should be fixed so that every op in
 the r_ops[] array gets handled separately.  That was sort of
 planned, but we initially only had the one case (write copyup),
 and it only had two ops.  Now we have 1, 2, or 3, and in principle
 there could be more, so it would be better to support things
 generically.  The obj_request should only be marked done *after*
 all individual op callbacks have been processed.  But... that
 can be done another day.

Yeah, I'm planning on reworking the entire completion infrastructure.
The sneakiness of rbd_img_obj_callback() is slightly above the level
I'm comfortable with and the things we have to do to make it work
(bumping img_request kref for each obj_request, for example) are
confusing - it should just be a plain atomic_t.  The fact that
rbd_img_obj_callback() may be called after we think we are done with
the request (i.e. after rbd_img_request_complete() was called) is also
alarming to me.  I stared at it for a while and for now convinced
myself there are no *obvious* bugs, but who knows...

All that coupled with the sheer number of functions ending with
_callback (some of which are actual callbacks but most are just called
from those callbacks) makes the code hard to follow and even harder to
make changes to, so I'm going to try to restructure it, hopefully in
the near future.


 *before* -callback is invoked and rbd_img_obj_copyup_callback() has
 a chance to run.  Marking obj_request done essentially means giving
 rbd_img_obj_callback() a license to end it at any moment, so if another
 obj_request from the same img_request is being completed concurrently,
 rbd_img_obj_end_request() may very well be called on such prematurally
 marked done request:

 obj_request-1/2 reply
 handle_reply()
   rbd_osd_req_callback()
 rbd_osd_trivial_callback()
 rbd_obj_request_complete()
 rbd_img_obj_copyup_callback()
 rbd_img_obj_callback()
 obj_request-2/2 reply
 handle_reply()
   rbd_osd_req_callback()
 rbd_osd_trivial_callback()
   for_each_obj_request(obj_request-img_request) {
 rbd_img_obj_end_request(obj_request-1/2)
 rbd_img_obj_end_request(obj_request-2/2) --
   }

 Calling rbd_img_obj_end_request() on such a request leads to trouble,
 in particular because its -xfferred is 0.  We report 0 to the block
 layer with blk_update_request(), get back 1 for this request has more
 data in flight and then trip on

 rbd_assert(more ^ (which == img_request-obj_request_count));

 Another of these pesky assertions actually identifies a problem.

Most of those rbd_assert() assert pointer != NULL, which is pretty
useless.  However, there is definitely a bunch of useful ones, this
one being one of those, of course.


 with rhs (which == ...) being 1 because rbd_img_obj_end_request() has
 been called for both requests and lhs (more) being 1 because we haven't
 got a chance to set -xfferred in rbd_img_obj_copyup_callback() yet.

 To fix this, leverage that rbd wants to call class methods in only two
 cases: one is a generic method call wrapper (obj_request is standalone)
 and the other is a copyup (obj_request is part of an img_request).  So
 make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke
 rbd_img_obj_copyup_callback() from it if obj_request is part of an
 img_request, similar to how CEPH_OSD_OP_READ handler invokes
 rbd_img_obj_request_read_callback().

 OK, I understand what you're doing here.  I'm going to restate
 it just to confirm that.

 When an OSD request completes, rbd_osd_req_callback() is called.
 It examines the first op in the object request's r_ops[] array.

 If the first op in an r_ops[] array is a CALL, you handle it with
 a new OSD request callback (sub)function rbd_osd_call_callback().
 (That's a reasonable structural change to make anyway, I think.)
 That function now distinguishes between a copyup request (which
 will be for an image object) and a simple method call (which will
 be a request on an object not associated

Re: max rbd devices

2015-07-28 Thread Ilya Dryomov

On Tue, Jul 28, 2015 at 5:19 PM, Wyllys Ingersoll
wyllys.ingers...@keepertech.com wrote:
 Actually, its a 4.1rc8 kernel. And modinfo shows that the single_major
 parameter defaults to false.

 $ sudo modinfo rbd
 filename:   
 /lib/modules/4.1.0-040100rc8-generic/kernel/drivers/block/rbd.ko
 license:GPL
 description:RADOS Block Device (RBD) driver
 author: Jeff Garzik j...@garzik.org
 author: Yehuda Sadeh yeh...@hq.newdream.net
 author: Sage Weil s...@newdream.net
 author: Alex Elder el...@inktank.com
 srcversion: C0EF017464E250E25E466E8
 depends:libceph
 intree: Y
 vermagic:   4.1.0-040100rc8-generic SMP mod_unload modversions
 signer: Build time autogenerated kernel key
 sig_key:FC:1C:37:16:92:C8:E7:E5:17:9C:06:C5:5F:C4:DF:7E:19:00:A9:0F
 sig_hashalgo:   sha512
 parm:   single_major:Use a single major number for all rbd
 devices (default: false) (bool)

Yes, it defaults to false.  But if rbd.ko isn't loaded by the time you
do rbd map (which it shouldn't), rbd cli tool will load it for you with
single_major=Y.  That's why I asked if you loaded it by hand.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: max rbd devices

2015-07-28 Thread Ilya Dryomov

On Tue, Jul 28, 2015 at 5:59 PM, Wyllys Ingersoll
wyllys.ingers...@keepertech.com wrote:
 Interesting to note - /etc/modules contains a line for rbd, which
 means it gets loaded at boot time and NOT via the rbd cli, so perhaps
 this is why it is loaded with the flag set to N.

Yeah, that's probably it - try removing it and reboot.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: max rbd devices

2015-07-28 Thread Ilya Dryomov

On Tue, Jul 28, 2015 at 5:35 PM, Wyllys Ingersoll
wyllys.ingers...@keepertech.com wrote:
 OK, I misunderstood. It is not loaded by hand. We have permanent
 mappings listed in /etc/ceph/rbdmap which then get mapped
 automatically when the system boots.  So, in that case, shouldn't
 single_major=Y since it essentially is just calling rbd map from an
 init script?

Let's step back a second.  What is the output of

$ cat /sys/module/rbd/parameters/single_major

right now?

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 >

1 - 100 of 510 matches

Mail list logo