Re: 答复: Reboot blocked when undoing unmap op.
On Mon, Jan 4, 2016 at 10:51 AM, Wukongmingwrote: > Hi, Ilya, > > It is an old problem. > When you say "when you issue a reboot, daemons get killed and the kernel > client ends up waiting for the them to come back, because of outstanding > writes issued by umount called by systemd (or whatever)." > > Do you mean if umount rbd successfully, the process of kernel client will > stop waiting? What kind of Communication mechanism between libceph and > daemons(or ceph userspace)? If you umount the filesystem on top of rbd and unmap rbd image, there won't be anything to wait for. In fact, if there aren't any other rbd images mapped, libceph will clean up after itself and exit. If you umount the filesystem on top of rbd but don't unmap the image, libceph will remain there, along with some amount of communication (keepalive messages, watch requests, etc). However, all of that is internal and is unlikely to block reboot. If you don't umount the filesystem, your init system will try to umount it, issuing FS requests to the rbd device. We don't want to drop those requests, so, if daemons are gone by then, libceph ends up blocking. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: puzzling disapearance of /dev/sdc1
On Fri, Dec 18, 2015 at 1:38 PM, Loic Dacharywrote: > Hi Ilya, > > It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds > them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe > does. It is used intensively by ceph-disk and inevitably leads to races where > a device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu > 14.04 with a 3.13.0-62-generic kernel only generates two udev change events > and does not remove / add partitions. The source code between sgdisk 0.8.6 > and sgdisk 0.8.8 did not change in a significant way and the output of strace > -e ioctl sgdisk -i 2 /dev/vdb is identical in both environments. > > ioctl(3, BLKGETSIZE, 20971520) = 0 > ioctl(3, BLKGETSIZE64, 10737418240) = 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 > ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 > ioctl(3, BLKGETSIZE, 20971520) = 0 > ioctl(3, BLKGETSIZE64, 10737418240) = 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKGETSIZE, 20971520) = 0 > ioctl(3, BLKGETSIZE64, 10737418240) = 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > > This leads me to the conclusion that the difference is in how the kernel > reacts to these ioctl. I'm pretty sure it's not the kernel versions that matter here, but systemd versions. Those are all get-property ioctls, and I don't think sgdisk -i does anything with the partition table. What it probably does though is it opens the disk for write for some reason. When it closes it, udevd (systemd-udevd process) picks that close up via inotify and issues the BLKRRPART ioctl, instructing the kernel to re-read the partition table. Technically, that's different from what partprobe does, but it still generates those udev events you are seeing in the monitor. AFAICT udevd started doing this in v214. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: puzzling disapearance of /dev/sdc1
On Thu, Dec 17, 2015 at 3:10 PM, Loic Dacharywrote: > Hi Sage, > > On 17/12/2015 14:31, Sage Weil wrote: >> On Thu, 17 Dec 2015, Loic Dachary wrote: >>> Hi Ilya, >>> >>> This is another puzzling behavior (the log of all commands is at >>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a >>> series of sgdisk -i commands to examine various devices including >>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup >>> again although I don't have a definitive proof of this). >>> >>> It looks like a side effect of a previous partprobe command, the only >>> command I can think of that removes / re-adds devices. I thought calling >>> udevadm settle after running partprobe would be enough to ensure >>> partprobe completed (and since it takes as much as 2mn30 to return, I >>> would be shocked if it does not ;-). Yeah, IIRC partprobe goes through every slot in the partition table, trying to first remove and then add the partition back. But, I don't see any mention of partprobe in the log you referred to. Should udevadm settle for a few vd* devices be taking that much time? I'd investigate that regardless of the issue at hand. >>> >>> Any idea ? I desperately try to find a consistent behavior, something >>> reliable that we could use to say : "wait for the partition table to be >>> up to date in the kernel and all udev events generated by the partition >>> table update to complete". >> >> I wonder if the underlying issue is that we shouldn't be calling udevadm >> settle from something running from udev. Instead, of a udev-triggered >> run of ceph-disk does something that changes the partitions, it >> should just exit and let udevadm run ceph-disk again on the new >> devices...? > > Unless I missed something this is on CentOS 7 and ceph-disk is only called > from udev as ceph-disk trigger which does nothing else but asynchronously > delegate the work to systemd. Therefore there is no udevadm settle from > within udev (which would deadlock and timeout every time... I hope ;-). That's a sure lockup, until one of them times out. How are you delegating to systemd? Is it to avoid long-running udev events? I'm probably missing something - udevadm settle wouldn't block on anything other than udev, so if you are shipping work off to somewhere else, udev can't be relied upon for waiting. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding partprobe failure
On Thu, Dec 17, 2015 at 1:19 PM, Loic Dacharywrote: > Hi Ilya, > > I'm seeing a partprobe failure right after a disk was zapped with sgdisk > --clear --mbrtogpt -- /dev/vdb: > > partprobe /dev/vdb failed : Error: Partition(s) 1 on /dev/vdb have been > written, but we have been unable to inform the kernel of the change, probably > because it/they are in use. As a result, the old partition(s) will remain in > use. You should reboot now before making further changes. > > waiting 60 seconds (see the log below) and trying again succeeds. The > partprobe call is guarded by udevadm settle to prevent udev actions from > racing and nothing else goes on in the machine. > > Any idea how that could happen ? > > Cheers > > 2015-12-17 11:46:10,356.356 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:get_dm_uuid > /dev/vdb uuid path is /sys/dev/block/253:16/dm/uuid > 2015-12-17 11:46:10,357.357 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:Zapping > partition table on /dev/vdb > 2015-12-17 11:46:10,358.358 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running > command: /usr/sbin/sgdisk --zap-all -- /dev/vdb > 2015-12-17 11:46:10,365.365 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Caution: > invalid backup GPT header, but valid main header; regenerating > 2015-12-17 11:46:10,366.366 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:backup > header from main header. > 2015-12-17 11:46:10,366.366 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk: > 2015-12-17 11:46:10,366.366 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning! > Main and backup partition tables differ! Use the 'c' and 'e' options > 2015-12-17 11:46:10,367.367 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:on the > recovery & transformation menu to examine the two tables. > 2015-12-17 11:46:10,367.367 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk: > 2015-12-17 11:46:10,367.367 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning! > One or more CRCs don't match. You should repair the disk! > 2015-12-17 11:46:10,368.368 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk: > 2015-12-17 11:46:11,413.413 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk: > 2015-12-17 11:46:11,414.414 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Caution: > Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk > 2015-12-17 11:46:11,414.414 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:verification > and recovery are STRONGLY recommended. > 2015-12-17 11:46:11,414.414 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk: > 2015-12-17 11:46:11,415.415 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning: > The kernel is still using the old partition table. > 2015-12-17 11:46:11,415.415 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The new > table will be used at the next reboot. > 2015-12-17 11:46:11,416.416 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:GPT > data structures destroyed! You may now partition the disk using fdisk or > 2015-12-17 11:46:11,416.416 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:other > utilities. > 2015-12-17 11:46:11,416.416 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running > command: /usr/sbin/sgdisk --clear --mbrtogpt -- /dev/vdb > 2015-12-17 11:46:12,504.504 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Creating > new GPT entries. > 2015-12-17 11:46:12,505.505 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:Warning: > The kernel is still using the old partition table. > 2015-12-17 11:46:12,505.505 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The new > table will be used at the next reboot. > 2015-12-17 11:46:12,505.505 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:The > operation has completed successfully. > 2015-12-17 11:46:12,506.506 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:DEBUG:ceph-disk:Calling > partprobe on zapped device /dev/vdb > 2015-12-17 11:46:12,507.507 > INFO:tasks.workunit.client.0.target167114233028.stderr:DEBUG:CephDisk:INFO:ceph-disk:Running > command: /usr/bin/udevadm settle --timeout=600 > 2015-12-17 11:46:15,427.427 >
Re: Rbd map failure in 3.16.0-55
On Sat, Dec 12, 2015 at 7:56 AM, Varada Kariwrote: > Hi all, > > We are working on jewel branch on a test cluster to validate some of the > fixes. But landed up in the following error when mapping an image using krbd > on Ubuntu 14.04.2 with 3.16.0-55 kernel version. > > $ sudo rbd map -p pool1 rbd1 > rbd: sysfs write failed > rbd: map failed: (5) Input/output error > > > $ uname -a > Linux 3.16.0-55-generic #74~14.04.1-Ubuntu SMP Tue Nov 17 10:15:59 UTC 2015 > x86_64 x86_64 x86_64 GNU/Linux > > $dmesg > > [11082.199006] libceph: read_partial_message bad hdr crc 2112154322 != > expected 0 > [11082.209414] libceph: mon0 x.x.x.x:6789 socket error on read > [11092.238317] libceph: read_partial_message bad hdr crc 2112154322 != > expected 0 > [11092.248982] libceph: mon0 x.x.x.x:6789 socket error on read > . > > When I looked at the ceph_msg_header on 3.16 kernel > (http://lxr.free-electrons.com/source/include/linux/ceph/msgr.h?v=3.16) > > There is one field change from jewel branch > > 145 struct ceph_msg_header { > ... > 159 > 160 /* oldest code we think can decode this. unknown if zero. */ > 161 __le16 compat_version; <<< New one, which is present from 3.19 > onwards > 162 __le16 reserved; > 163 __le32 crc; /* header crc32c */ > 164 } __attribute__ ((packed)); That's not a problem - those 16 bits used to be reserved. > > How can map the image in 3.16 kernel apart from upgrading the kernel? Do we > have any branch where I can build the modules with latest changes? I'm guessing you have ms_crc_header set to false - the kernel client always checks header (and middle) checksums, hence the mismatch. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Rbd map failure in 3.16.0-55
On Sat, Dec 12, 2015 at 6:42 PM, Somnath Roywrote: > Ilya, > If we map with 'nocrc' would that help ? No, it will disable data crcs, header and middle crcs will still be checked. The header/data separation in userspace is fairly new, if that's something you care about, I can port it for 4.5 - it should be trivial enough. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CEPH_MAX_OID_NAME_LEN in rbd kernel driver
On Fri, Dec 11, 2015 at 3:48 PM, Jean-Tiare Le Bigotwrote: > Hi, > > I hit a use case where rbd was failing to map an image because of the > name length. > > dmesg output: > > WARNING: CPU: 0 PID: 20851 at include/linux/ceph/osdmap.h:97 > rbd_osd_req_create.isra.29+0xdd/0x140() > > ceph_oid_set_name > 'rbd_id.docker:devel:ib8k2cxvci9sh7b4ly8wwvpyjf6kjoyuero1k25zal3ypc916i04x78tbmn60gjm6pyp38lmjq:latest' > len 101 vs 100, truncating > > > It turns out the maximum ceph object name length is controlled by the > constant CEPH_MAX_OID_NAME_LEN. I naively tried to increase it to 500 > which appears to workaround (fix ?) the issue. > > > diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h > index 49ff69f..cf34bb0 100644 > --- a/include/linux/ceph/osdmap.h > +++ b/include/linux/ceph/osdmap.h > @@ -50,7 +50,7 @@ struct ceph_object_locator { > * > * (probably outdated: must be >= RBD_MAX_MD_NAME_LEN -- currently 100) > */ > -#define CEPH_MAX_OID_NAME_LEN 100 > +#define CEPH_MAX_OID_NAME_LEN 500 > > struct ceph_object_id { > char name[CEPH_MAX_OID_NAME_LEN]; > > > What is the impact of such a change ? Should I submit a proper patch ? > And, last but not least, shouldn't it return an error instead of trucate > it and fail because, obviously the object does not exist ? Worse, It > could exist with a different data. Which would be an interesting vector > for injecting data in a third party image... The impact is heavier allocations - each request has two of those and the encoding buffer is also CEPH_MAX_OID_NAME_LEN, not the actual name length. I was actually considering lowering it, but I didn't consider image names from hell like that. librbd has #define RBD_MAX_OBJ_NAME_SIZE 96, but it's not enforced. Is that ID between devel: and :latest fixed-size in docker world? I.e. would bumping it to something less than 500 do? I don't think injecting data is a valid concern - if you can create an image in a pool, you can write to it and if you can write to it... It's better to return an error, of course. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel RBD hang on OSD Failure
On Fri, Dec 11, 2015 at 1:37 AM, Matt Connerwrote: > Hi Ilya, > > I had already recovered but I managed to recreate the problem again. I ran How did you recover? > the commands against rbd_data.f54f9422698a8. which was one > of those listed in osdc this time. We have 2048 PGs in the pool so the list > is long. That's not going to work - I need it in a consistent "stuck" state so I can match the outputs. I understand you can't keep it that way for a long time, so can you please reproduce it and email me off list with osdmap, osdc and ceph -s right away? You have a lot of OSDs but it also doesn't seem to take a long time to recreate. Bump debug ms to 1 on all OSDs. > > As for when I fetched the object using rados, it grabbed it without issue. > > osd map: > osdmap e6247 pool 'NAS-ic2gw01' (3) object > 'rbd_data.f54f9422698a8.' -> pg 3.cac46c43 (3.443) -> up > ([33,56], p33) acting ([33,56], p33) > > osdmaptool: > pool 3 pg_num 2048 > 3.0 [23,138]23 If it's long, better to compress and attach. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel RBD hang on OSD Failure
On Fri, Dec 11, 2015 at 12:16 AM, Matt Connerwrote: > Hi Ilya, > > Took me a little time to reproduce, but I'm once again got into the > state. In the past I had reproduced the issue with a single OSD > failure but in this case I failed an entire server. Ignore the "noout" > flag as I set it to prevent rebalance during my attempts to reproduce > the issue. > > osdmap: > > epoch 6191 > flags > pool 0 pg_num 64 (63) read_tier -1 write_tier -1 > pool 1 pg_num 64 (63) read_tier -1 write_tier -1 > pool 2 pg_num 64 (63) read_tier -1 write_tier -1 > pool 3 pg_num 2048 (2047) read_tier -1 write_tier -1 > pool 5 pg_num 2048 (2047) read_tier -1 write_tier -1 > pool 6 pg_num 2048 (2047) read_tier -1 write_tier -1 > pool 7 pg_num 2048 (2047) read_tier -1 write_tier -1 > osd0 10.10.30.55:6830 100% (exists, up) 100% > osd1 10.10.30.53:6809 100% (exists, up) 100% > osd2 10.10.30.54:6839 100% (exists, up) 100% > osd3 10.10.30.56:6833 100% (exists, up) 100% > osd4 10.10.30.55:6848 100% (exists, up) 100% > osd5 10.10.30.53:6854 100% (exists, up) 100% > osd6 10.10.30.56:6818 100% (exists, up) 100% > osd7 10.10.30.54:6830 100% (exists, up) 100% > osd8 10.10.30.55:6839 100% (exists, up) 100% > osd9 10.10.30.56:6845 100% (exists, up) 100% > osd10 10.10.30.53:6860 100% (exists, up) 100% > osd11 10.10.30.54:6842 100% (exists, up) 100% > osd12 10.10.30.55:6836 100% (exists, up) 100% > osd13 10.10.30.56:6863 100% (exists, up) 100% > osd14 10.10.30.53:6839 100% (exists, up) 100% > osd15 10.10.30.54:6866 100% (exists, up) 100% > osd16 10.10.30.55:6869 100% (exists, up) 100% > osd17 10.10.30.56:6812 100% (exists, up) 100% > osd18 10.10.30.53:6824 100% (exists, up) 100% > osd19 10.10.30.54:6818 100% (exists, up) 100% > osd20 10.10.30.55:6833 100% (exists, up) 100% > osd21 10.10.30.56:6869 100% (exists, up) 100% > osd22 10.10.30.53:6857 100% (exists, up) 100% > osd23 10.10.30.54:6833 100% (exists, up) 100% > osd24 10.10.30.55:6803 100% (exists, up) 100% > osd25 10.10.30.56:6821 100% (exists, up) 100% > osd26 10.10.30.54:6848 100% (exists, up) 100% > osd27 10.10.30.53:6848 100% (exists, up) 100% > osd28 10.10.30.55:6863 100% (exists, up) 100% > osd29 10.10.30.56:6827 100% (exists, up) 100% > osd30 10.10.30.54:6836 100% (exists, up) 100% > osd31 10.10.30.53:6863 100% (exists, up) 100% > osd32 10.10.30.55:6845 100% (exists, up) 100% > osd33 10.10.30.56:6839 100% (exists, up) 100% > osd34 10.10.30.54:6851 100% (exists, up) 100% > osd35 10.10.30.53:6845 100% (exists, up) 100% > osd36 10.10.30.55:6854 100% (exists, up) 100% > osd37 10.10.30.56:6851 100% (exists, up) 100% > osd38 10.10.30.54:6806 100% (exists, up) 100% > osd39 10.10.30.53:6824 0% (doesn't exist) 100% > osd40 10.10.30.55:6815 100% (exists, up) 100% > osd41 10.10.30.56:6842 100% (exists, up) 100% > osd42 10.10.30.54:6863 100% (exists, up) 100% > osd43 10.10.30.53:6806 100% (exists, up) 100% > osd44 10.10.30.55:6821 100% (exists, up) 100% > osd45 10.10.30.56:6803 100% (exists, up) 100% > osd46 10.10.30.54:6812 100% (exists, up) 100% > osd47 10.10.30.53:6866 100% (exists, up) 100% > osd48 10.10.30.55:6812 100% (exists, up) 100% > osd49 10.10.30.56:6815 100% (exists, up) 100% > osd50 10.10.30.53:6842 100% (exists, up) 100% > osd51 10.10.30.54:6854 100% (exists, up) 100% > osd52 10.10.30.55:6842 100% (exists, up) 100% > osd53 10.10.30.56:6836 100% (exists, up) 100% > osd54 10.10.30.53:6803 100% (exists, up) 100% > osd55 10.10.30.54:6869 100% (exists, up) 100% > osd56 10.10.30.55:6866 100% (exists, up) 100% > osd57 10.10.30.56:6860 100% (exists, up) 100% > osd58 10.10.30.53:6800 100% (exists, up) 100% > osd59 10.10.30.54:6803 100% (exists, up) 100% > osd60 10.10.30.55:6824 100% (exists, up) 100% > osd61 10.10.30.56:6804 100% (exists, up) 100% > osd62 10.10.30.53:6815 100% (exists, up) 100% > osd63 10.10.30.54:6857 100% (exists, up) 100% > osd64 10.10.30.55:6800 100% (exists, up) 100% > osd65 10.10.30.56:6866 100% (exists, up) 100% > osd66 10.10.30.53:6836 100% (exists, up) 100% > osd67 10.10.30.54:6845 100% (exists, up) 100% > osd68 10.10.30.55:6860 100% (exists, up) 100% > osd69 10.10.30.56:6824 100% (exists, up) 100% > osd70 10.10.30.53:6821 100% (exists, up) 100% > osd71 10.10.30.54:6809 100% (exists, up) 100% > osd72 10.10.30.55:6806 100% (exists, up) 100% > osd73 10.10.30.56:6809 100% (exists, up) 100% > osd74 10.10.30.53:6818 100% (exists, up) 100% > osd75 10.10.30.54:6815 100% (exists, up) 100% > osd76 10.10.30.55:6827 100% (exists, up) 100% > osd77 10.10.30.56:6830 100% (exists, up) 100% > osd78 10.10.30.53:6833 100% (exists, up) 100% > osd79 10.10.30.54:6800 100% (exists, up) 100% > osd80 10.10.30.55:6851 100% (exists, up) 100% > osd81 10.10.30.56:6800 100% (exists, up) 100% > osd82 10.10.30.53:6812 100% (exists, up) 100% > osd83 10.10.30.54:6821 100% (exists, up) 100% > osd84 10.10.30.55:6857 100% (exists, up) 100% > osd85 10.10.30.56:6848 100% (exists, up) 100% > osd86 10.10.30.53:6830 100% (exists, up) 100% > osd87 10.10.30.54:6827 100%
Re: [ceph-users] Kernel RBD hang on OSD Failure
On Tue, Dec 8, 2015 at 10:57 AM, Tom Christensenwrote: > We aren't running NFS, but regularly use the kernel driver to map RBDs and > mount filesystems in same. We see very similar behavior across nearly all > kernel versions we've tried. In my experience only very few versions of the > kernel driver survive any sort of crush map change/update while something is > mapped. In fact in the last 2 years I think I've only seen this work on 1 > kernel version unfortunately its badly out of date and we can't run it in > our environment anymore, I think it was a 3.0 kernel version running on > ubuntu 12.04. We have just recently started trying to find a kernel that > will survive OSD outages or changes to the cluster. We're on ubuntu 14.04, > and have tried 3.16, 3.19.0-25, 4.3, and 4.2 without success in the last > week. We only map 1-3 RBDs per client machine at a time but we regularly > will get processes stuck in D state which are accessing the filesystem > inside the RBD and will have to hard reboot the RBD client machine. This is > always associated with a cluster change in some way, reweighting OSDs, > rebooting an OSD host, restarting an individual OSD, adding OSDs, and > removing OSDs all cause the kernel client to hang. If no change is made to > the cluster, the kernel client will be happy for weeks. There are a couple of known bugs in the remap/resubmit area, but those are supposedly corner cases (like *all* the OSDs going down and then back up, etc). I had no idea it was that severe and goes that back. Apparently triggering it requires a heavier load, as we've never seen anything like that in our tests. For unrelated reasons, remap/resubmit code is getting entirely rewritten for kernel 4.5, so, if you've been dealing with this issue for the last two years (I don't remember seeing any tickets listing that many kernel versions and not mentioning NFS), I'm afraid the best course of action for you would be to wait for 4.5 to come out and try it. If you'd be willing to test out an early version on one of more of your client boxes, I can ping you when it's ready. I'll take a look at 3.0 vs 3.16 with an eye on remap code. Did you happen to try 3.10? It sounds like you can reproduce this pretty easily. Can you get it to lock up and do: # cat /sys/kernel/debug/ceph/*/osdmap # cat /sys/kernel/debug/ceph/*/osdc $ ceph status and bunch of times? I have a hunch that kernel client simply fails to request enough of new osdmaps after the cluster topology changes under load. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel RBD hang on OSD Failure
On Mon, Dec 7, 2015 at 9:56 PM, Matt Connerwrote: > Hi, > > We have a Ceph cluster in which we have been having issues with RBD > clients hanging when an OSD failure occurs. We are using a NAS gateway > server which maps RBD images to filesystems and serves the filesystems > out via NFS. The gateway server has close to 180 NFS clients and > almost every time even 1 OSD goes down during heavy load, the NFS > exports lock up and the clients are unable to access the NAS share via > NFS. When the OSD fails, Ceph recovers without issue, but the gateway > kernel RBD module appears to get stuck waiting on the now failed OSD. > Note that this works correctly when under lighter loads. > > From what we have been able to determine, the NFS server daemon hangs > waiting for I/O from the OSD that went out and never recovers. > Similarly, attempting to access files from the exported FS locally on > the gateway server will result in a similar hang. We also noticed that > Ceph health details will continue to report blocked I/O on the now > down OSD until either the OSD is recovered or the gateway server is > rebooted. Based on a few kernel logs from NFS and PVS, we were able > to trace the problem to the RBD kernel module. > > Unfortunately, the only way we have been able to recover our gateway > is by hard rebooting the server. > > Has anyone else encountered this issue and/or have a possible solution? > Are there suggestions for getting more detailed debugging information > from the RBD kernel module? Dumping osdc, osdmap and ceph status when it gets stuck would be a start: # cat /sys/kernel/debug/ceph/*/osdmap # cat /sys/kernel/debug/ceph/*/osdc $ ceph status Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: testing the /dev/cciss/c0d0 device names
On Sat, Dec 5, 2015 at 7:36 PM, Loic Dacharywrote: > Hi Ilya, > > ceph-disk has special handling for device names like /dev/cciss/c0d1 [1] and > it was partially broken when support for device mapper was introduced. > Ideally there would be a way to test that support when running the ceph-disk > suite [2]. Do you know of a way to do that without having the hardware for > which this driver is designed ? > > Maybe this convention (/dev/cciss/c0d0 being mapped to /sys/block/cciss!c0d0 > is not unique to this driver and I could use another to validate the name > conversion from X/Y to X!Y and vice versa is handled as it should ? No, it's not unique. driver core does strreplace(s, '/', '!') at register time to work around such block devices. The list includes DAC960, aoeblk, cciss, cpqarray, sx8 and probably more, but I don't think anything widespread uses this naming scheme. IIRC dm actually won't let you name a device with anything that contains a slash. If you really wanted you could set up aoeblk I guess, but a simple unit test should be be more than enough ;) Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: don't put snap_context twice in rbd_queue_workfn()
Commit 4e752f0ab0e8 ("rbd: access snapshot context and mapping size safely") moved ceph_get_snap_context() out of rbd_img_request_create() and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the error path in rbd_queue_workfn(). However, rbd_img_request_create() consumes a ref on snapc, so calling ceph_put_snap_context() after a successful rbd_img_request_create() leads to an extra put. Fix it. Cc: sta...@vger.kernel.org # 3.18+ Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 24a757e97d56..4a876785b68c 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3442,6 +3442,7 @@ static void rbd_queue_workfn(struct work_struct *work) goto err_rq; } img_request->rq = rq; + snapc = NULL; /* img_request consumes a ref */ if (op_type == OBJ_OP_DISCARD) result = rbd_img_request_fill(img_request, OBJ_REQUEST_NODATA, -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block-rbd: One function call less in rbd_dev_probe_parent() after error detection
On Thu, Nov 26, 2015 at 8:54 AM, SF Markus Elfringwrote: >>> I interpreted the eventual passing of a null pointer to the >>> rbd_dev_destroy() >>> function as an indication for further source code adjustments. >> >> If all error paths could be adjusted so that NULL pointers are never passed >> in, >> destroy functions wouldn't need to have a NULL check, would they? > > How do you think about to clarify corresponding implementation details a bit > more? > > * Why was the function "rbd_dev_probe_parent" implemented in the way > that it relies on a sanity check in the function "rbd_dev_destroy" then? Because it's not a bad thing? What's wrong with an init to NULL, a possible assignment, in this case from rbd_dev_create(), and an unconditional rbd_dev_destroy()? The NULL check in rbd_dev_destroy() is not a sanity check, it's a feature. It's not there to "fixup" callers that pass NULL - it's there because it is _expected_ that some callers will pass NULL. > * How are the chances to restructure the source code a bit (like changing a > few > jump labels) so that it should also work without an extra function call > during error handling there? As I said in my reply to Dan, the problem with rbd_dev_probe_parent() is the calling code which expects it to call unparent if ->parent_spec. This makes it stand out and confuses people, but can't be fixed without refactoring a bunch of other code. The extra function call is *not* a problem. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] block-rbd: One function call less in rbd_dev_probe_parent() after error detection
On Wed, Nov 25, 2015 at 12:55 PM, Dan Carpenter <dan.carpen...@oracle.com> wrote: > On Tue, Nov 24, 2015 at 09:21:06PM +0100, Ilya Dryomov wrote: >> >> Cleanup here is (and should be) done in reverse order. >> > > > Yes. This is true. > >> > I have got an other impression about the appropriate order for the >> > corresponding >> > clean-up function calls. >> > >> > >> >> We allocate parent rbd_device and then link it with what we already have, >> > >> > I guess that we have got a different understanding about the relevant >> > "linking". >> >> Well, there isn't any _literal_ linking (e.g. adding to a link list, >> etc) in this case. We just bump some refs and do probe to fill in the >> newly allocated parent. If probe fails, we put refs and free parent, >> reversing the "alloc parent, bump refs" order. >> >> The actual linking (rbd_dev->parent = parent) is done right before >> returning so we never have to undo it in rbd_dev_probe_parent() and >> that's the only reason your patch probably doesn't break anything. >> Think about what happens if, after your patch is applied, someone moves >> that assignment up or adds an extra step that can fail after it... >> > > The problem is that the unwind code should be a mirror of the allocate > code but rbd_dev_unparent() doesn't mirror anything. Generally, writing > future proof stubs like this is a wrong thing because predicting the > future is hard and in the mean time we are left stubs which confuse > everyone. It's not a future proof stub. It's just some crufty code that was fixed over time to not leak things. I won't defend it - it is confusing and could definitely be improved - but that can't be done without refactoring a fair bunch of calling code. A patch changing rbd_dev_probe_parent() alone just won't do it. > >> If all error paths could be adjusted so that NULL pointers are never >> passed in, destroy functions wouldn't need to have a NULL check, would >> they? > > Yep. We agree on the right way to do it. I am probably the number one > kernel developer for removing the most sanity checks. :P (As opposed > to patch 1/1 where we now rely on the sanity check inside > rbd_dev_destroy().) > > drivers/block/rbd.c > 5149 static int rbd_dev_probe_parent(struct rbd_device *rbd_dev, int depth) > 5150 { > 5151 struct rbd_device *parent = NULL; > 5152 int ret; > 5153 > 5154 if (!rbd_dev->parent_spec) > 5155 return 0; > 5156 > 5157 if (++depth > RBD_MAX_PARENT_CHAIN_LEN) { > 5158 pr_info("parent chain is too long (%d)\n", depth); > 5159 ret = -EINVAL; > 5160 goto out_err; > > We haven't allocated anything so this should just be return -EINVAL; > In the original code, we decrement the kref count on ->parent_spec on > this error path so that is a classic One Err Bug. The caller expects rbd_dev->parent_spec to be put on any error. Notice that we return right away if !rbd_dev->parent_spec. > > 5161 } > 5162 > 5163 parent = rbd_dev_create(rbd_dev->rbd_client, > rbd_dev->parent_spec, > 5164 NULL); > 5165 if (!parent) { > 5166 ret = -ENOMEM; > 5167 goto out_err; > > Still haven't allocated anything so return -ENOMEM, but if we fail after > this point we will need to call rbd_dev_destroy(). > > 5168 } > 5169 > 5170 /* > 5171 * Images related by parent/child relationships always share > 5172 * rbd_client and spec/parent_spec, so bump their refcounts. > 5173 */ > 5174 __rbd_get_client(rbd_dev->rbd_client); > 5175 rbd_spec_get(rbd_dev->parent_spec); > > We will need to put these on any later error paths. And we do, in rbd_dev_destroy(parent), since these are references for the parent. > > 5176 > 5177 ret = rbd_dev_image_probe(parent, depth); > 5178 if (ret < 0) > 5179 goto out_err; > > Ok. We need to put the ->parent_spec, ->rbd_client and free the parent. > > 5180 > 5181 rbd_dev->parent = parent; > 5182 atomic_set(_dev->parent_ref, 1); > 5183 return 0; > 5184 > 5185 out_err: > 5186 rbd_dev_unparent(rbd_dev); > > This is a complicated way to say rbd_spec_put(rbd_dev->parent_spec); > > Also, is it really necessary to set ->parent_spec to NULL? If
Re: block-rbd: One function call less in rbd_dev_probe_parent() after error detection
On Tue, Nov 24, 2015 at 9:34 PM, SF Markus Elfringwrote: >> Well, there isn't any _literal_ linking (e.g. adding to a link list, >> etc) in this case. We just bump some refs and do probe to fill in the >> newly allocated parent. > > Thanks for your clarification. > > >> The actual linking (rbd_dev->parent = parent) is done right before >> returning so we never have to undo it in rbd_dev_probe_parent() and >> that's the only reason your patch probably doesn't break anything. > > Is this function implementation just also affected by an issue > which is mentioned in the Linux document "CodingStyle" as "one err bugs"? No, why? "one err bug" as per CodingStyle is a NULL deref on line 2 if foo is NULL. If it was just "err: kfree(foo); return ret;", a NULL foo would be perfectly OK. 1 err: 2 kfree(foo->bar); 3 kfree(foo); 4 return ret; If you can spot such a NULL deref in rbd_dev_probe_parent(), I'd gladly take a patch. > > >> Think about what happens if, after your patch is applied, someone moves >> that assignment up or adds an extra step that can fail after it... > > Is such a software maintenance concern really enough to delay (or reject) > my second update suggestion in this small patch series? Yes - it's rejected because it messes up the order of cleanup for no good reason. I realize why you think the patch is correct and it's not without merit, but it just doesn't fit the weird rbd_dev_probe_parent() contract. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [CEPH-DEVEL] [ceph-users] occasional failure to unmap rbd
On Tue, Nov 24, 2015 at 12:12 AM, Markus Kienast <elias1...@gmail.com> wrote: > Kernel Version > elias@paris3:~$ uname -a > Linux paris3.sfe.tv 3.16.0-28-generic #38-Ubuntu SMP Sat Dec 13 > 16:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux > > Output of dmesg and /var/log/dmesg attached. > But does not show much except for one mon being down. > The mon is down for hardware reasons. > > > > On Mon, Nov 23, 2015 at 11:26 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >> >> On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <m...@trickkiste.at> wrote: >> > I am having the same issue here. >> >> Which kernel are you running? Could you attach your dmesg? >> >> > >> > root@paris3:/etc/neutron# rbd unmap /dev/rbd0 >> > rbd: failed to remove rbd device: (16) Device or resource busy >> > rbd: remove failed: (16) Device or resource busy >> > >> > root@paris3:/etc/neutron# rbd info -p volumes >> > volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2 >> > 2015-11-23 22:42:06.842697 7f2d57e49700 0 -- :/2760503703 >> >> > 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1 >> > c=0x17734e0).fault >> > rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2': >> > size 500 GB in 128000 objects >> > order 22 (4096 kB objects) >> > block_name_prefix: rbd_data.1b6d9e2aaa998b >> > format: 2 >> > features: layering >> > root@paris3:/etc/neutron# rados -p volumes listwatchers >> > rbd_header.1b6d9e2aaa998b >> > 2015-11-23 22:42:58.546723 7fec94fec700 0 -- :/2519796249 >> >> > 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 >> > c=0x9cf4f0).fault >> >> Did you root cause these faults? > > Hardware failure caused these faults. > >> >> > watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1 >> > >> > root@paris3:/etc/neutron# ps ax | grep rbd >> > 7814 ?S 0:00 [jbd2/rbd0-8] >> >> Was there an ext filesystem involved? How was it umounted - do you >> have a "umount " process stuck in D state? > > Yes, all these RBDs are formatted with ext4. I am regularly using them > with openstack and have never had any problems. > I did "unmount " and the unmount process did actually > finish just fine. > Where can I look up, if it is stuck in "D" state? > >> >> > 11003 ?S 0:00 [jbd2/rbd1-8] >> > 14042 ?S 0:00 [jbd2/rbd2p1-8] >> > 24228 ?S 0:00 [jbd2/rbd3-8] >> > >> > root@paris3:/etc/neutron# ceph --version >> > ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070) >> > >> > root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/ >> > returns nothing >> > >> > root@paris3:/etc/neutron# fuser -amv /dev/rbd0 >> > USERPID ACCESS COMMAND >> > /dev/rbd0: >> >> What's the output of "cat /sys/bus/rbd/devices/0/client_id"? > > root@paris3:~# cat /sys/bus/rbd/devices/0/client_id > client8471177 > >> >> What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"? > > root@paris3:~# ls -l /sys/kernel/debug/ceph/ > total 0 > drwxr-xr-x 2 root root 0 Feb 4 2015 > 32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711 > drwxr-xr-x 2 root root 0 Nov 23 11:41 > 32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177 > > root@paris3:~# cat > /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177/osdc > has no output This means there are no outstanding/hung rbd I/Os. According to you, umount completed successfully, and yet there is a jbd2/rbd0-8 kthread hanging around, keeping /dev/rbd0 open and holding a ref to it. A quick search produced two similar reports: [1] https://ask.fedoraproject.org/en/question/7572/how-to-stop-kernel-ext4-journaling-thread/ [2] http://lists.openwall.net/linux-ext4/2015/10/24/11 The only difference as far I can tell is those people noticed the jbd2 thread because they wanted to run fsck, while you ran into it because you tried to do "rbd unmap". Neither mentions rbd. Look at [2], did you at any point see any similar errors in dmesg? > > root@paris3:~# cat > /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc > hangs with no output It shouldn't hang, so it could be unrelated. Given the "Feb 4 2015" timestamp, I'm going to assume you haven't rebooted this box in a long time? If so, do you remember what happened around that date? Do you keep syslog archives? I'd be interested in seeing everything you have for Feb 3 - Feb 5. To try to figure out where it's hanging, can you do # cat >/sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc < it'll hang, grab its PID from ps output > # cat /proc/$PID/stack Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reboot blocked when undoing unmap op.
On Fri, Nov 20, 2015 at 3:19 AM, Wukongmingwrote: > Hi Sage, > > I created a rbd image, and mapped to a local which means I can find > /dev/rbd0, at this time I reboot the system, in last step of shutting down, > it blocked with an error > > [235618.0202207] libceph: connect 172.16.57.252:6789 error -101. > > My Works’ Env: > > Ubuntu kernel 3.19.0 > Ceph 0.94.5 > A cluster of 2 Servers with iscsitgt and open-iscsi, both as server and > client. Multipath process is on but not affect this issue. I’ve tried > stopping multipath, but the issue still there. > I map a rbd image to a local, why show me a connect error? > > I saw your reply on > http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/13077, but just > apart. Is this issue resolved and how? Yeah, this has been a long standing problem with libceph/rbd. The issue is that you *have* to umount (and ideally also unmap, but unmap isn't strictly necessary) before you reboot. Otherwise (and I assume by mapped to a local you mean you've got MONs and OSDs on the same node as you do rbd map), when you issue a reboot, daemons get killed and the kernel client ends up waiting for the them to come back, because of outstanding writes issued by umount called by systemd (or whatever). There are other variations of this, but it all comes down to you having to cold reboot. The right fix is to have all init systems sequence the killing of ceph daemons after the umount/unmap. I also toyed with adding a reboot notifier for libceph to save a cold reboot, but the problem with that in the general case is data integrity. However, in cases like the one I described above, there is no going back so we might as well kill libceph through a notifier. I have an incomplete patch somewhere, but it really shouldn't be necessary... Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel client OSD message version
On Thu, Nov 19, 2015 at 11:27 AM, Lakis, Jacekwrote: > Ilya, thank you for the quick reply. > > For example it's about split decoding. I'm asking not because of specific > changes, I'm rather curious about when we should sync the kernel client > encoding to the master (or stable versions). It's usually done when the need arises, like if the kernel client starts using a field that isn't in the old enconding. We don't update for each new ceph release. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] net/ceph: do not define list_entry_next
On Wed, Nov 18, 2015 at 1:13 PM, Sergey Senozhatskywrote: > Cosmetic. > > Do not define list_entry_next() and use list_next_entry() > from list.h. > > Signed-off-by: Sergey Senozhatsky > --- > net/ceph/messenger.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index 9981039..77ccca9 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > #ifdef CONFIG_BLOCK > #include > #endif /* CONFIG_BLOCK */ > @@ -23,9 +24,6 @@ > #include > #include > > -#define list_entry_next(pos, member) \ > - list_entry(pos->member.next, typeof(*pos), member) > - > /* > * Ceph uses the messenger to exchange ceph_msg messages with other > * hosts in the system. The messenger provides ordered and reliable > @@ -1042,7 +1040,7 @@ static bool ceph_msg_data_pagelist_advance(struct > ceph_msg_data_cursor *cursor, > /* Move on to the next page */ > > BUG_ON(list_is_last(>page->lru, >head)); > - cursor->page = list_entry_next(cursor->page, lru); > + cursor->page = list_next_entry(cursor->page, lru); > cursor->last_piece = cursor->resid <= PAGE_SIZE; > > return true; > @@ -1166,7 +1164,7 @@ static bool ceph_msg_data_advance(struct > ceph_msg_data_cursor *cursor, > if (!cursor->resid && cursor->total_resid) { > WARN_ON(!cursor->last_piece); > BUG_ON(list_is_last(>data->links, cursor->data_head)); > - cursor->data = list_entry_next(cursor->data, links); > + cursor->data = list_next_entry(cursor->data, links); > __ceph_msg_data_cursor_init(cursor); > new_piece = true; > } Someone beat you to it ;) https://github.com/ceph/ceph-client/commit/76b4a27faebb369c1c50df01ef08b614a2854fc5 Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] libceph: use list_next_entry instead of list_entry_next
On Mon, Nov 16, 2015 at 2:46 PM, Geliang Tangwrote: > list_next_entry has been defined in list.h, so I replace list_entry_next > with it. > > Signed-off-by: Geliang Tang > --- > net/ceph/messenger.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index 9981039..b1d1489 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -23,9 +23,6 @@ > #include > #include > > -#define list_entry_next(pos, member) \ > - list_entry(pos->member.next, typeof(*pos), member) > - > /* > * Ceph uses the messenger to exchange ceph_msg messages with other > * hosts in the system. The messenger provides ordered and reliable > @@ -1042,7 +1039,7 @@ static bool ceph_msg_data_pagelist_advance(struct > ceph_msg_data_cursor *cursor, > /* Move on to the next page */ > > BUG_ON(list_is_last(>page->lru, >head)); > - cursor->page = list_entry_next(cursor->page, lru); > + cursor->page = list_next_entry(cursor->page, lru); > cursor->last_piece = cursor->resid <= PAGE_SIZE; > > return true; > @@ -1166,7 +1163,7 @@ static bool ceph_msg_data_advance(struct > ceph_msg_data_cursor *cursor, > if (!cursor->resid && cursor->total_resid) { > WARN_ON(!cursor->last_piece); > BUG_ON(list_is_last(>data->links, cursor->data_head)); > - cursor->data = list_entry_next(cursor->data, links); > + cursor->data = list_next_entry(cursor->data, links); > __ceph_msg_data_cursor_init(cursor); > new_piece = true; > } Applied. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
request_queue use-after-free - inode_detach_wb()
Hello, Last week, while running an rbd test which does a lot of maps and unmaps (read losetup / losetup -d) with slab debugging enabled, I hit the attached splat. That 6a byte corresponds to the atomic_long_t count of the percpu_ref refcnt in request_queue::backing_dev_info::wb, pointing to a percpu_ref_put() on a freed memory. It hasn't reproduced since. After a prolonged stare at rbd (we've just fixed an rbd vs sysfs lifecycle issue, so I naturally assumed we either missed something or it had something to do with that patch) I looked wider and concluded that the most likely place a stray percpu_ref_put() could have come from was inode_detach_wb(). It's called from __destroy_inode(), which means iput(), which means bdput(). Looking at __blkdev_put(), the issue becomes clear: we are taking precautions to flush before calling out to ->release() because, at least according to the comment, ->release() can free queue; we are recording owner pointer because put_disk() may free both gendisk and queue, and then, after all that, we are calling bdput() which may touch the queue through wb_put() in inode_detach_wb(). (The fun part is wb_put() is supposed to be a noop for root wbs, but slab debugging interferes with that by poisoning wb->bdi pointer.) 1514 * dirty data before. 1515 */ 1516 bdev_write_inode(bdev); 1517 } 1518 if (bdev->bd_contains == bdev) { 1519 if (disk->fops->release) 1520 disk->fops->release(disk, mode); 1521 } 1522 if (!bdev->bd_openers) { 1523 struct module *owner = disk->fops->owner; 1524 1525 disk_put_part(bdev->bd_part); 1526 bdev->bd_part = NULL; 1527 bdev->bd_disk = NULL; 1528 if (bdev != bdev->bd_contains) 1529 victim = bdev->bd_contains; 1530 bdev->bd_contains = NULL; 1531 1532 put_disk(disk); <-- may free q 1533 module_put(owner); 1534 } 1535 mutex_unlock(>bd_mutex); 1536 bdput(bdev); <-- may touch q.backing_dev_info.wb To reproduce, apply the attached patch (systemd-udevd condition is just a convenience: udev reacts to change events by getting the bdev which it then has to put), boot with slub_debug=,blkdev_queue and do: $ sudo modprobe loop $ sudo losetup /dev/loop0 foo.img $ sudo dd if=/dev/urandom of=/dev/loop0 bs=1M count=1 $ sudo losetup -d /dev/loop0 $ sudo rmmod loop (rmmod is key - it's the only way to get loop to do put_disk(). For rbd, it's just rbd map - dd - rbd unmap.) In the past we used to reassign to default_backing_dev_info here, but it was nuked in b83ae6d42143 ("fs: remove mapping->backing_dev_info"). Shortly after that cgroup-specific writebacks patches from Tejun got merged, adding inode::i_wb and inode_detach_wb() call. The fix seems to be to detach the inode earlier, but I'm not familiar enough with cgroups code, so sending my findings instead of a patch. Christoph, Tejun? Thanks, Ilya [18513.199040] = [18513.199459] BUG blkdev_queue (Not tainted): Poison overwritten [18513.199459] - [18513.199459] [18513.205765] Disabling lock debugging due to kernel taint [18513.205765] INFO: 0x8800659a05d8-0x8800659a05d8. First byte 0x6a instead of 0x6b [18513.205765] INFO: Allocated in blk_alloc_queue_node+0x28/0x2c0 age=10215 cpu=1 pid=1920 [18513.205765] __slab_alloc.constprop.50+0x4d5/0x540 [18513.205765] kmem_cache_alloc+0x2ba/0x320 [18513.205765] blk_alloc_queue_node+0x28/0x2c0 [18513.205765] blk_mq_init_queue+0x20/0x60 [18513.205765] do_rbd_add.isra.23+0x833/0xd70 [18513.205765] rbd_add+0x1d/0x30 [18513.205765] bus_attr_store+0x25/0x30 [18513.205765] sysfs_kf_write+0x45/0x60 [18513.205765] kernfs_fop_write+0x141/0x190 [18513.205765] __vfs_write+0x28/0xe0 [18513.205765] vfs_write+0xa2/0x180 [18513.205765] SyS_write+0x49/0xa0 [18513.219265] entry_SYSCALL_64_fastpath+0x12/0x6f [18513.219265] INFO: Freed in blk_free_queue_rcu+0x1c/0x20 age=122 cpu=1 pid=1959 [18513.219265] __slab_free+0x148/0x290 [18513.219265] kmem_cache_free+0x2b7/0x340 [18513.219265] blk_free_queue_rcu+0x1c/0x20 [18513.219265] rcu_process_callbacks+0x2fb/0x820 [18513.219265] __do_softirq+0xd4/0x460 [18513.219265] irq_exit+0x95/0xa0 [18513.219265] smp_apic_timer_interrupt+0x42/0x50 [18513.219265] apic_timer_interrupt+0x81/0x90 [18513.219265] __slab_free+0xb5/0x290 [18513.219265] kmem_cache_free+0x2b7/0x340 [18513.219265] ptlock_free+0x19/0x20 [18513.219265] ___pte_free_tlb+0x22/0x50 [18513.219265] free_pgd_range+0x258/0x440 [18513.219265] free_pgtables+0xc4/0x120 [18513.219265] exit_mmap+0xc3/0x130 [18513.219265] mmput+0x3d/0xf0 [18513.219265] INFO: Slab 0xea0001966800 objects=9 used=9
Re: [PATCH] ceph:Fix error handling in the function down_reply
On Mon, Nov 9, 2015 at 11:15 AM, Yan, Zhengwrote: > >> On Nov 9, 2015, at 11:11, Nicholas Krause wrote: >> >> This fixes error handling in the function down_reply in order to >> properly check and jump to the goto label, out_err for this >> particular function if a error code is returned by any function >> called in down_reply and therefore make checking be included >> for the call to ceph_update_snap_trace in order to comply with >> these error handling checks/paths. >> >> Signed-off-by: Nicholas Krause >> --- >> fs/ceph/mds_client.c | 11 +++ >> 1 file changed, 7 insertions(+), 4 deletions(-) >> >> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >> index 51cb02d..0b01f94 100644 >> --- a/fs/ceph/mds_client.c >> +++ b/fs/ceph/mds_client.c >> @@ -2495,14 +2495,17 @@ static void handle_reply(struct ceph_mds_session >> *session, struct ceph_msg *msg) >> realm = NULL; >> if (rinfo->snapblob_len) { >> down_write(>snap_rwsem); >> - ceph_update_snap_trace(mdsc, rinfo->snapblob, >> - rinfo->snapblob + rinfo->snapblob_len, >> - le32_to_cpu(head->op) == CEPH_MDS_OP_RMSNAP, >> - ); >> + err = ceph_update_snap_trace(mdsc, rinfo->snapblob, >> + rinfo->snapblob + >> rinfo->snapblob_len, >> + le32_to_cpu(head->op) == >> CEPH_MDS_OP_RMSNAP, >> + ); >> downgrade_write(>snap_rwsem); >> } else { >> down_read(>snap_rwsem); >> } >> + >> + if (err) >> + goto out_err; >> >> /* insert trace into our cache */ >> mutex_lock(>r_fill_mutex); > > Applied, thanks This looks to me like it'd leave snap_rwsem locked for read? Also, the name of the function in question is handle_reply(), not down_reply(). I'll revert testing. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] libceph: drop authorizer check from cephx msg signing routines
I don't see a way for auth->authorizer to be NULL in ceph_x_sign_message() or ceph_x_check_message_signature(). Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- net/ceph/auth_x.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c index 65054fd31b97..3a544ca6b5ce 100644 --- a/net/ceph/auth_x.c +++ b/net/ceph/auth_x.c @@ -697,8 +697,7 @@ static int ceph_x_sign_message(struct ceph_auth_handshake *auth, struct ceph_msg *msg) { int ret; - if (!auth->authorizer) - return 0; + ret = calcu_signature((struct ceph_x_authorizer *)auth->authorizer, msg, >footer.sig); if (ret < 0) @@ -713,8 +712,6 @@ static int ceph_x_check_message_signature(struct ceph_auth_handshake *auth, __le64 sig_check; int ret; - if (!auth->authorizer) - return 0; ret = calcu_signature((struct ceph_x_authorizer *)auth->authorizer, msg, _check); if (ret < 0) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] libceph: msg signing callouts don't need con argument
We can use msg->con instead - at the point we sign an outgoing message or check the signature on the incoming one, msg->con is always set. We wouldn't know how to sign a message without an associated session (i.e. msg->con == NULL) and being able to sign a message using an explicitly provided authorizer is of no use. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- fs/ceph/mds_client.c | 14 -- include/linux/ceph/messenger.h | 5 ++--- net/ceph/messenger.c | 4 ++-- net/ceph/osd_client.c | 14 -- 4 files changed, 20 insertions(+), 17 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 89838a226fe9..e7b130a637f9 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3942,17 +3942,19 @@ static struct ceph_msg *mds_alloc_msg(struct ceph_connection *con, return msg; } -static int sign_message(struct ceph_connection *con, struct ceph_msg *msg) +static int mds_sign_message(struct ceph_msg *msg) { - struct ceph_mds_session *s = con->private; + struct ceph_mds_session *s = msg->con->private; struct ceph_auth_handshake *auth = >s_auth; + return ceph_auth_sign_message(auth, msg); } -static int check_message_signature(struct ceph_connection *con, struct ceph_msg *msg) +static int mds_check_message_signature(struct ceph_msg *msg) { - struct ceph_mds_session *s = con->private; + struct ceph_mds_session *s = msg->con->private; struct ceph_auth_handshake *auth = >s_auth; + return ceph_auth_check_message_signature(auth, msg); } @@ -3965,8 +3967,8 @@ static const struct ceph_connection_operations mds_con_ops = { .invalidate_authorizer = invalidate_authorizer, .peer_reset = peer_reset, .alloc_msg = mds_alloc_msg, - .sign_message = sign_message, - .check_message_signature = check_message_signature, + .sign_message = mds_sign_message, + .check_message_signature = mds_check_message_signature, }; /* eof */ diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index b2371d9b51fa..3687ff0f0133 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -43,10 +43,9 @@ struct ceph_connection_operations { struct ceph_msg * (*alloc_msg) (struct ceph_connection *con, struct ceph_msg_header *hdr, int *skip); - int (*sign_message) (struct ceph_connection *con, struct ceph_msg *msg); - int (*check_message_signature) (struct ceph_connection *con, - struct ceph_msg *msg); + int (*sign_message) (struct ceph_msg *msg); + int (*check_message_signature) (struct ceph_msg *msg); }; /* use format string %s%d */ diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index fce6ad636613..805f6f82139f 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1205,7 +1205,7 @@ static void prepare_write_message_footer(struct ceph_connection *con) con->out_kvec[v].iov_base = >footer; if (con->peer_features & CEPH_FEATURE_MSG_AUTH) { if (con->ops->sign_message) - con->ops->sign_message(con, m); + con->ops->sign_message(m); else m->footer.sig = 0; con->out_kvec[v].iov_len = sizeof(m->footer); @@ -2422,7 +2422,7 @@ static int read_partial_message(struct ceph_connection *con) } if (need_sign && con->ops->check_message_signature && - con->ops->check_message_signature(con, m)) { + con->ops->check_message_signature(m)) { pr_err("read_partial_message %p signature check failed\n", m); return -EBADMSG; } diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 191bc21cecea..118e4ce37ecc 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2979,17 +2979,19 @@ static int invalidate_authorizer(struct ceph_connection *con) return ceph_monc_validate_auth(>client->monc); } -static int sign_message(struct ceph_connection *con, struct ceph_msg *msg) +static int osd_sign_message(struct ceph_msg *msg) { - struct ceph_osd *o = con->private; + struct ceph_osd *o = msg->con->private; struct ceph_auth_handshake *auth = >o_auth; + return ceph_auth_sign_message(auth, msg); } -static int check_message_signature(struct ceph_connection *con, struct ceph_msg *msg) +static int osd_check_message_signature(struct ceph_msg *msg) { - struct ceph_osd *o = con->private; + struct ceph_osd *o = msg->con->private; struct ceph_auth_handshake *auth = >o_auth; + return ceph_auth_check_message_signature(auth, ms
[PATCH 0/4] libceph: nocephx_sign_messages option + misc
Hello, This adds nocephx_sign_messages libceph option (a lack of which is something people are running into, see [1]), plus a couple of related cleanups. [1] https://forum.proxmox.com/threads/24116-new-krbd-option-on-pve4-don-t-work Thanks, Ilya Ilya Dryomov (4): libceph: msg signing callouts don't need con argument libceph: drop authorizer check from cephx msg signing routines libceph: stop duplicating client fields in messenger libceph: add nocephx_sign_messages option fs/ceph/mds_client.c | 14 -- include/linux/ceph/libceph.h | 4 +++- include/linux/ceph/messenger.h | 16 +++- net/ceph/auth_x.c | 8 ++-- net/ceph/ceph_common.c | 18 +- net/ceph/messenger.c | 32 net/ceph/osd_client.c | 14 -- 7 files changed, 53 insertions(+), 53 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: [PATCH] mark rbd requiring stable pages
On Fri, Oct 23, 2015 at 9:06 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > On Fri, Oct 23, 2015 at 9:00 PM, ronny.hegew...@online.de > <ronny.hegew...@online.de> wrote: >>> Could you share the entire log snippet for those 10 minutes? >> >> Thats all in the logs. But if more information would be useful tell me >> which logs >> to activate and i will give it another run. At least this part is easy to >> reproduce. >> >>> Which kernel was this on? >> >> The latest kernel i used which produced the corruption was 3.19.8. >> The earliest one was 3.11. > > No need for now, I'll poke around and report back. So the "bad crc" errors are of course easily reproducible, but I haven't managed to reproduce ext4 corruptions. I amended your patch to only require stable pages in case we actually compute checksums, see https://github.com/ceph/ceph-client/commit/4febcceb866822c1a1aee2836c9c810e3ef29bbf. Any other data points you can share? Can you describe your cluster (boxes, OSDs, clients, rbds mapped - where, how many, ext4 mkfs and mount options, etc) in more detail? Is there anything special about your setup that you can think of? You've mentioned that the best test case in your experience is kernel compilation. What .config are you using, how many threads (make -jX) and how long does it take to build a kernel with that .config and that number of threads? You have more than one rbd device mapped on the same box - how many exactly, do you put any load on the rest while the kernel is compiling on one of them? What about rbd devices mapped on other boxes? You get the idea - every bit counts. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: ceph: osd_client: change osd_req_op_data() macro
On Thu, Oct 22, 2015 at 5:06 PM, Ioana Ciorneiwrote: > This patch changes the osd_req_op_data() macro to not evaluate > parameters more than once in order to follow the kernel coding style. > > Signed-off-by: Ioana Ciornei > Reviewed-by: Alex Elder > --- > net/ceph/osd_client.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > index a362d7e..856e8f8 100644 > --- a/net/ceph/osd_client.c > +++ b/net/ceph/osd_client.c > @@ -120,10 +120,12 @@ static void ceph_osd_data_bio_init(struct ceph_osd_data > *osd_data, > } > #endif /* CONFIG_BLOCK */ > > -#define osd_req_op_data(oreq, whch, typ, fld) \ > - ({ \ > - BUG_ON(whch >= (oreq)->r_num_ops); \ > - &(oreq)->r_ops[whch].typ.fld; \ > +#define osd_req_op_data(oreq, whch, typ, fld)\ > + ({\ > + struct ceph_osd_request *__oreq = (oreq); \ > + unsigned int __whch = (whch); \ > + BUG_ON(__whch >= __oreq->r_num_ops); \ > + &__oreq->r_ops[__whch].typ.fld; \ > }) > > static struct ceph_osd_data * For some reason this ended up in Spam - you should CC subsystem maintainer(s) on any patch, as noted in Documentation/SubmittingPatches. Applied with minor changes, see https://github.com/ceph/ceph-client/commit/51dcb83f3d1143819fe0791c112e1b1f830e457d. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: introduce ceph_x_authorizer_cleanup()
Commit ae385eaf24dc ("libceph: store session key in cephx authorizer") introduced ceph_x_authorizer::session_key, but didn't update all the exit/error paths. Introduce ceph_x_authorizer_cleanup() to encapsulate ceph_x_authorizer cleanup and switch to it. This fixes ceph_x_destroy(), which currently always leaks key and ceph_x_build_authorizer() error paths. Cc: Yan, Zheng <z...@redhat.com> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- net/ceph/auth_x.c | 28 +--- net/ceph/crypto.h | 4 +++- 2 files changed, 20 insertions(+), 12 deletions(-) diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c index ba6eb17226da..65054fd31b97 100644 --- a/net/ceph/auth_x.c +++ b/net/ceph/auth_x.c @@ -279,6 +279,15 @@ bad: return -EINVAL; } +static void ceph_x_authorizer_cleanup(struct ceph_x_authorizer *au) +{ + ceph_crypto_key_destroy(>session_key); + if (au->buf) { + ceph_buffer_put(au->buf); + au->buf = NULL; + } +} + static int ceph_x_build_authorizer(struct ceph_auth_client *ac, struct ceph_x_ticket_handler *th, struct ceph_x_authorizer *au) @@ -297,7 +306,7 @@ static int ceph_x_build_authorizer(struct ceph_auth_client *ac, ceph_crypto_key_destroy(>session_key); ret = ceph_crypto_key_clone(>session_key, >session_key); if (ret) - return ret; + goto out_au; maxlen = sizeof(*msg_a) + sizeof(msg_b) + ceph_x_encrypt_buflen(ticket_blob_len); @@ -309,8 +318,8 @@ static int ceph_x_build_authorizer(struct ceph_auth_client *ac, if (!au->buf) { au->buf = ceph_buffer_new(maxlen, GFP_NOFS); if (!au->buf) { - ceph_crypto_key_destroy(>session_key); - return -ENOMEM; + ret = -ENOMEM; + goto out_au; } } au->service = th->service; @@ -340,7 +349,7 @@ static int ceph_x_build_authorizer(struct ceph_auth_client *ac, ret = ceph_x_encrypt(>session_key, _b, sizeof(msg_b), p, end - p); if (ret < 0) - goto out_buf; + goto out_au; p += ret; au->buf->vec.iov_len = p - au->buf->vec.iov_base; dout(" built authorizer nonce %llx len %d\n", au->nonce, @@ -348,9 +357,8 @@ static int ceph_x_build_authorizer(struct ceph_auth_client *ac, BUG_ON(au->buf->vec.iov_len > maxlen); return 0; -out_buf: - ceph_buffer_put(au->buf); - au->buf = NULL; +out_au: + ceph_x_authorizer_cleanup(au); return ret; } @@ -624,8 +632,7 @@ static void ceph_x_destroy_authorizer(struct ceph_auth_client *ac, { struct ceph_x_authorizer *au = (void *)a; - ceph_crypto_key_destroy(>session_key); - ceph_buffer_put(au->buf); + ceph_x_authorizer_cleanup(au); kfree(au); } @@ -653,8 +660,7 @@ static void ceph_x_destroy(struct ceph_auth_client *ac) remove_ticket_handler(ac, th); } - if (xi->auth_authorizer.buf) - ceph_buffer_put(xi->auth_authorizer.buf); + ceph_x_authorizer_cleanup(>auth_authorizer); kfree(ac->private); ac->private = NULL; diff --git a/net/ceph/crypto.h b/net/ceph/crypto.h index d1498224c49d..2e9cab09f37b 100644 --- a/net/ceph/crypto.h +++ b/net/ceph/crypto.h @@ -16,8 +16,10 @@ struct ceph_crypto_key { static inline void ceph_crypto_key_destroy(struct ceph_crypto_key *key) { - if (key) + if (key) { kfree(key->key); + key->key = NULL; + } } int ceph_crypto_key_clone(struct ceph_crypto_key *dst, -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: [PATCH] mark rbd requiring stable pages
On Fri, Oct 23, 2015 at 9:00 PM, ronny.hegew...@online.dewrote: >> Could you share the entire log snippet for those 10 minutes? > > Thats all in the logs. But if more information would be useful tell me which > logs > to activate and i will give it another run. At least this part is easy to > reproduce. > >> Which kernel was this on? > > The latest kernel i used which produced the corruption was 3.19.8. > The earliest one was 3.11. No need for now, I'll poke around and report back. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: remove duplicate calls to rbd_dev_mapping_clear()
Commit d1cf5788450e ("rbd: set mapping info earlier") defined rbd_dev_mapping_clear(), but, just a few days after, commit f35a4dee14c3 ("rbd: set the mapping size and features later") moved rbd_dev_mapping_set() calls and added another rbd_dev_mapping_clear() call instead of moving the old one. Around the same time, another duplicate was introduced in rbd_dev_device_release() - kill both. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index cf087c5a6aa5..f8f4beb72b84 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5251,8 +5251,6 @@ err_out_blkdev: unregister_blkdev(rbd_dev->major, rbd_dev->name); err_out_id: rbd_dev_id_put(rbd_dev); - rbd_dev_mapping_clear(rbd_dev); - return ret; } @@ -5507,7 +5505,6 @@ static void rbd_dev_device_release(struct rbd_device *rbd_dev) if (!single_major) unregister_blkdev(rbd_dev->major, rbd_dev->name); rbd_dev_id_put(rbd_dev); - rbd_dev_mapping_clear(rbd_dev); } static void rbd_dev_remove_parent(struct rbd_device *rbd_dev) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark rbd requiring stable pages
On Fri, Oct 23, 2015 at 1:56 AM, Ronny Hegewald <ronny.hegew...@online.de> wrote: > On Thursday 22 October 2015, Ilya Dryomov wrote: >> Well, checksum mismatches are to be expected given what we are doing >> now, but I wouldn't expect any data corruptions. Ronny writes that he >> saw frequent ext4 corruptions on krbd devices before he enabled stable >> pages, which leads me to believe that the !crc case, for which we won't >> be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken. Ronny, >> could you describe it in more detail and maybe share some of those osd >> logs with bad crc messages? >> > This is from a 10 minute period from one of the OSDs. > > 23:11:02.423728 ce5dfb70 0 bad crc in data 1657725429 != exp 496797267 > 23:11:37.586411 ce5dfb70 0 bad crc in data 1216602498 != exp 111161 > 23:12:07.805675 cc3ffb70 0 bad crc in data 3140625666 != exp 2614069504 > 23:12:44.485713 c96ffb70 0 bad crc in data 1712148977 != exp 3239079328 > 23:13:24.746217 ce5dfb70 0 bad crc in data 144620426 != exp 3156694286 > 23:13:52.792367 ce5dfb70 0 bad crc in data 4033880920 != exp 4159672481 > 23:14:22.958999 c96ffb70 0 bad crc in data 847688321 != exp 1551499144 > 23:16:35.015629 ce5dfb70 0 bad crc in data 2790209714 != exp 3779604715 > 23:17:48.482049 c96ffb70 0 bad crc in data 1563466764 != exp 528198494 > 23:19:28.925357 cc3ffb70 0 bad crc in data 1764275395 != exp 2075504274 > 23:19:59.039843 cc3ffb70 0 bad crc in data 2960172683 != exp 1215950691 Could you share the entire log snippet for those 10 minutes? > > The filesystem corruptions are usually ones with messages of > > EXT4-fs error (device rbd4): ext4_mb_generate_buddy:757: group 155, block > bitmap and bg descriptor inconsistent: 23625 vs 23660 free clusters > > These were pretty common, at least every other day, often multiple times a > day. > > Sometimes there was a additional > > JBD2: Spotted dirty metadata buffer (dev = rbd4, blocknr = 0). There's a risk > of filesystem corruption in case of system crash. > > Another type of Filesystem corruption i experienced through kernel > compilations, that lead to the following messages. > > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282221) - > no `.' or `..' > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273062) - > no `.' or `..' > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272270) - > no `.' or `..' > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282254) - > no `.' or `..' > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #273070) - > no `.' or `..' > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #272308) - > no `.' or `..' > EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted > inode referenced: 270039 > last message repeated 2 times > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271534) - > no `.' or `..' > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #271275) - > no `.' or `..' > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #282290) - > no `.' or `..' > EXT4-fs warning (device rbd3): empty_dir:2488: bad directory (dir #281914) - > no `.' or `..' > EXT4-fs error (device rbd3): ext4_lookup:1417: inode #270033: comm rm: deleted > inode referenced: 270039 > last message repeated 2 times > kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: > deleted inode referenced: 282221 > EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted > inode referenced: 282221 > EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted > inode referenced: 281914 > EXT4-fs error (device rbd3): ext4_lookup:1417: inode #273018: comm rm: deleted > inode referenced: 281914 > EXT4-fs error: 243 callbacks suppressed > EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: deleted > inode referenced: 45375 > kernel: EXT4-fs error (device rbd3): ext4_lookup:1417: inode #282002: comm cp: > deleted inode referenced: 45371 > > The result was that various files and directories in the kernel sourcedir > couldn't be accessed anymore and even fsck couldn't repair it, so i had to > finally delete it. But these ones were pretty rare. > > Another issue were the data-corruptions in the files itself, that happened > independently from the filesystem-corruptions. These happened on most days, > sometimes only once, sometimes multiple times a day. > > Newly written files that contained corrupted data seem to always have it only > at one place. These corrupt data replaced the original data from the file, but > never changed the
Re: [PATCH] mark rbd requiring stable pages
On Thu, Oct 22, 2015 at 7:22 PM, Mike Christie <micha...@cs.wisc.edu> wrote: > On 10/22/15, 11:52 AM, Ilya Dryomov wrote: >> >> On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie <micha...@cs.wisc.edu> >> wrote: >>> >>> On 10/22/2015 06:20 AM, Ilya Dryomov wrote: >>>> >>>> >>>>>> >>>>>> If we are just talking about if stable pages are not used, and someone >>>>>> is re-writing data to a page after the page has already been submitted >>>>>> to the block layer (I mean the page is on some bio which is on a >>>>>> request >>>>>> which is on some request_queue scheduler list or basically anywhere in >>>>>> the block layer), then I was saying this can occur with any block >>>>>> driver. There is nothing that is preventing this from happening with a >>>>>> FC driver or nvme or cciss or in dm or whatever. The app/user can >>>>>> rewrite as late as when we are in the make_request_fn/request_fn. >>>>>> >>>>>> I think I am misunderstanding your question because I thought this is >>>>>> expected behavior, and there is nothing drivers can do if the app is >>>>>> not >>>>>> doing a flush/sync between these types of write sequences. >>>> >>>> I don't see a problem with rewriting as late as when we are in >>>> request_fn() (or in a wq after being put there by request_fn()). Where >>>> I thought there *might* be an issue is rewriting after sendpage(), if >>>> sendpage() is used - perhaps some sneaky sequence similar to that >>>> retransmit bug that would cause us to *transmit* incorrect bytes (as >>>> opposed to *re*transmit) or something of that nature? >>> >>> >>> >>> Just to make sure we are on the same page. >>> >>> Are you concerned about the tcp/net layer retransmitting due to it >>> detecting a issue as part of the tcp protocol, or are you concerned >>> about rbd/libceph initiating a retry like with the nfs issue? >> >> >> The former, tcp/net layer. I'm just conjecturing though. >> > > For iscsi, we normally use the sendpage path. Data digests are off by > default and some distros do not even allow you to turn them on, so our > sendpage path has got a lot of testing and we have not seen any corruptions. > Not saying it is not possible, but just saying we have not seen any. Great, that's reassuring. > > It could be due to a recent change. Ronny, tell us about the workload and I > will check iscsi. > > Oh yeah, for the tcp/net retransmission case, I had said offlist, I thought > there might be a issue with iscsi but I guess I was wrong, so I have not > seen any issues with that either. I'll drop my concerns then. Those corruptions could be a bug in ceph reconnect code or something else - regardless, that's separate from the issue at hand. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark rbd requiring stable pages
On Thu, Oct 22, 2015 at 6:07 AM, Mike Christie <micha...@cs.wisc.edu> wrote: > On 10/21/2015 03:57 PM, Ilya Dryomov wrote: >> On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >>> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >>>> Hmm... On the one hand, yes, we do compute CRCs, but that's optional, >>>> so enabling this unconditionally is probably too harsh. OTOH we are >>>> talking to the network, which means all sorts of delays, retransmission >>>> issues, etc, so I wonder how exactly "unstable" pages behave when, say, >>>> added to an skb - you can't write anything to a page until networking >>>> is fully done with it and expect it to work. It's particularly >>>> alarming that you've seen corruptions. >>>> >>>> Currently the only users of this flag are block integrity stuff and >>>> md-raid5, which makes me wonder what iscsi, nfs and others do in this >>>> area. There's an old ticket on this topic somewhere on the tracker, so >>>> I'll need to research this. Thanks for bringing this up! >>> >>> Hi Mike, >>> >>> I was hoping to grab you for a few minutes, but you weren't there... >>> >>> I spent a better part of today reading code and mailing lists on this >>> topic. It is of course a bug that we use sendpage() which inlines >>> pages into an skb and do nothing to keep those pages stable. We have >>> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc >>> case is an obvious fix. >>> >>> I looked at drbd and iscsi and I think iscsi could do the same - ditch >>> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid >>> of iscsi_sw_tcp_conn::sendpage member while at it). Using stable pages >>> rather than having a roll-your-own implementation which doesn't close >>> the race but only narrows it sounds like a win, unless copying through >>> sendmsg() is for some reason cheaper than stable-waiting? > > Yeah, that is what I was saying on the call the other day, but the > reception was bad. We only have the sendmsg code path when digest are on > because that code came before stable pages. When stable pages were > created, it was on by default but did not cover all the cases, so we > left the code. It then handled most scenarios, but I just never got > around to removing old the code. However, it was set to off by default > so I left it and made this patch for iscsi to turn on stable pages: > > [this patch only enabled stable pages when digests/crcs are on and dif > not remove the code yet] > https://groups.google.com/forum/#!topic/open-iscsi/n4jvWK7BPYM > > I did not really like the layering so I have not posted it for inclusion. Good to know I got it right ;) > > > >>> >>> drbd still needs the non-zero-copy version for its async protocol for >>> when they free the pages before the NIC has chance to put them on the >>> wire. md-raid5 it turns out has an option to essentially disable most >>> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate >>> if that option is enabled. >>> >>> What I'm worried about is the !crc (!datadgst_en) case. I'm failing to >>> convince myself that mucking with sendpage()ed pages while they sit in >>> the TCP queue (or anywhere in the networking stack, really), is safe - >>> there is nothing to prevent pages from being modified after sendpage() >>> returned and Ronny reports data corruptions that pretty much went away >>> with BDI_CAP_STABLE_WRITES set. I may be, after prolonged staring at >>> this, starting to confuse fs with block, though. How does that work in >>> iscsi land? > > This is what I was trying to ask about in the call the other day. Where > is the corruption that Ronny was seeing. Was it checksum mismatches on > data being written, or is incorrect meta data being written, etc? Well, checksum mismatches are to be expected given what we are doing now, but I wouldn't expect any data corruptions. Ronny writes that he saw frequent ext4 corruptions on krbd devices before he enabled stable pages, which leads me to believe that the !crc case, for which we won't be setting BDI_CAP_STABLE_WRITES, is going to be/remain broken. Ronny, could you describe it in more detail and maybe share some of those osd logs with bad crc messages? > > If we are just talking about if stable pages are not used, and someone > is re-writing data to a page after the page has already been submitted > to the block layer (I mean the pag
Re: [PATCH] mark rbd requiring stable pages
On Thu, Oct 22, 2015 at 5:37 PM, Mike Christie <micha...@cs.wisc.edu> wrote: > On 10/22/2015 06:20 AM, Ilya Dryomov wrote: >> >>> > >>> > If we are just talking about if stable pages are not used, and someone >>> > is re-writing data to a page after the page has already been submitted >>> > to the block layer (I mean the page is on some bio which is on a request >>> > which is on some request_queue scheduler list or basically anywhere in >>> > the block layer), then I was saying this can occur with any block >>> > driver. There is nothing that is preventing this from happening with a >>> > FC driver or nvme or cciss or in dm or whatever. The app/user can >>> > rewrite as late as when we are in the make_request_fn/request_fn. >>> > >>> > I think I am misunderstanding your question because I thought this is >>> > expected behavior, and there is nothing drivers can do if the app is not >>> > doing a flush/sync between these types of write sequences. >> I don't see a problem with rewriting as late as when we are in >> request_fn() (or in a wq after being put there by request_fn()). Where >> I thought there *might* be an issue is rewriting after sendpage(), if >> sendpage() is used - perhaps some sneaky sequence similar to that >> retransmit bug that would cause us to *transmit* incorrect bytes (as >> opposed to *re*transmit) or something of that nature? > > > Just to make sure we are on the same page. > > Are you concerned about the tcp/net layer retransmitting due to it > detecting a issue as part of the tcp protocol, or are you concerned > about rbd/libceph initiating a retry like with the nfs issue? The former, tcp/net layer. I'm just conjecturing though. (We don't have the nfs issue, because even if the client sends such a retransmit (which it won't), the primary OSD will reject it as a dup.) Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark rbd requiring stable pages
On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >> Hmm... On the one hand, yes, we do compute CRCs, but that's optional, >> so enabling this unconditionally is probably too harsh. OTOH we are >> talking to the network, which means all sorts of delays, retransmission >> issues, etc, so I wonder how exactly "unstable" pages behave when, say, >> added to an skb - you can't write anything to a page until networking >> is fully done with it and expect it to work. It's particularly >> alarming that you've seen corruptions. >> >> Currently the only users of this flag are block integrity stuff and >> md-raid5, which makes me wonder what iscsi, nfs and others do in this >> area. There's an old ticket on this topic somewhere on the tracker, so >> I'll need to research this. Thanks for bringing this up! > > Hi Mike, > > I was hoping to grab you for a few minutes, but you weren't there... > > I spent a better part of today reading code and mailing lists on this > topic. It is of course a bug that we use sendpage() which inlines > pages into an skb and do nothing to keep those pages stable. We have > csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc > case is an obvious fix. > > I looked at drbd and iscsi and I think iscsi could do the same - ditch > the fallback to sock_no_sendpage() in the datadgst_en case (and get rid > of iscsi_sw_tcp_conn::sendpage member while at it). Using stable pages > rather than having a roll-your-own implementation which doesn't close > the race but only narrows it sounds like a win, unless copying through > sendmsg() is for some reason cheaper than stable-waiting? > > drbd still needs the non-zero-copy version for its async protocol for > when they free the pages before the NIC has chance to put them on the > wire. md-raid5 it turns out has an option to essentially disable most > of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate > if that option is enabled. > > What I'm worried about is the !crc (!datadgst_en) case. I'm failing to > convince myself that mucking with sendpage()ed pages while they sit in > the TCP queue (or anywhere in the networking stack, really), is safe - > there is nothing to prevent pages from being modified after sendpage() > returned and Ronny reports data corruptions that pretty much went away > with BDI_CAP_STABLE_WRITES set. I may be, after prolonged staring at > this, starting to confuse fs with block, though. How does that work in > iscsi land? > > (There was/is also this [1] bug, which is kind of related and probably > worth looking into at some point later. ceph shouldn't be that easily > affected - we've got state, but there is a ticket for it.) > > [1] http://www.spinics.net/lists/linux-nfs/msg34913.html And now with Mike on the CC and a mention that at least one scenario of [1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption issue when retransmitting RPC calls"). Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark rbd requiring stable pages
On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > Hmm... On the one hand, yes, we do compute CRCs, but that's optional, > so enabling this unconditionally is probably too harsh. OTOH we are > talking to the network, which means all sorts of delays, retransmission > issues, etc, so I wonder how exactly "unstable" pages behave when, say, > added to an skb - you can't write anything to a page until networking > is fully done with it and expect it to work. It's particularly > alarming that you've seen corruptions. > > Currently the only users of this flag are block integrity stuff and > md-raid5, which makes me wonder what iscsi, nfs and others do in this > area. There's an old ticket on this topic somewhere on the tracker, so > I'll need to research this. Thanks for bringing this up! Hi Mike, I was hoping to grab you for a few minutes, but you weren't there... I spent a better part of today reading code and mailing lists on this topic. It is of course a bug that we use sendpage() which inlines pages into an skb and do nothing to keep those pages stable. We have csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc case is an obvious fix. I looked at drbd and iscsi and I think iscsi could do the same - ditch the fallback to sock_no_sendpage() in the datadgst_en case (and get rid of iscsi_sw_tcp_conn::sendpage member while at it). Using stable pages rather than having a roll-your-own implementation which doesn't close the race but only narrows it sounds like a win, unless copying through sendmsg() is for some reason cheaper than stable-waiting? drbd still needs the non-zero-copy version for its async protocol for when they free the pages before the NIC has chance to put them on the wire. md-raid5 it turns out has an option to essentially disable most of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate if that option is enabled. What I'm worried about is the !crc (!datadgst_en) case. I'm failing to convince myself that mucking with sendpage()ed pages while they sit in the TCP queue (or anywhere in the networking stack, really), is safe - there is nothing to prevent pages from being modified after sendpage() returned and Ronny reports data corruptions that pretty much went away with BDI_CAP_STABLE_WRITES set. I may be, after prolonged staring at this, starting to confuse fs with block, though. How does that work in iscsi land? (There was/is also this [1] bug, which is kind of related and probably worth looking into at some point later. ceph shouldn't be that easily affected - we've got state, but there is a ticket for it.) [1] http://www.spinics.net/lists/linux-nfs/msg34913.html Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()
On Mon, Oct 19, 2015 at 4:49 AM, Shraddha Barkewrote: > Use local variable cursor in place of >cursor in > read_partial_msg_data() > > Signed-off-by: Shraddha Barke > --- > Changes in v2- > Drop incorrect use of cursor > > net/ceph/messenger.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index b9b0e3b..b087edd 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -2246,7 +2246,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > if (do_datacrc) > crc = con->in_data_crc; > while (cursor->resid) { > - page = ceph_msg_data_next(>cursor, _offset, , > + page = ceph_msg_data_next(cursor, _offset, , > NULL); > ret = ceph_tcp_recvpage(con->sock, page, page_offset, length); > if (ret <= 0) { > @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > > if (do_datacrc) > crc = ceph_crc32c_page(crc, page, page_offset, ret); > - (void) ceph_msg_data_advance(>cursor, (size_t)ret); > + (void) ceph_msg_data_advance(cursor, (size_t)ret); > } > if (do_datacrc) > con->in_data_crc = crc; This looks good. I'd like you to update this patch to do the same for write_partial_message_data(). Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] net: ceph: messenger: Use local variable cursor instead of >cursor
On Mon, Oct 19, 2015 at 6:29 PM, Shraddha Barkewrote: > Use local variable cursor in place of >cursor in > read_partial_msg_data() and write_partial_msg_data() > > Signed-off-by: Shraddha Barke > --- > Changes in v3- > Replace >cursor with cursor in write_partial_msg_data() too > > net/ceph/messenger.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index b9b0e3b..01b567b 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -1552,8 +1552,8 @@ static int write_partial_message_data(struct > ceph_connection *con) > bool need_crc; > int ret; > > - page = ceph_msg_data_next(>cursor, _offset, , > - _piece); > + page = ceph_msg_data_next(cursor, _offset, , > + _piece); > ret = ceph_tcp_sendpage(con->sock, page, page_offset, > length, !last_piece); > if (ret <= 0) { > @@ -1564,7 +1564,7 @@ static int write_partial_message_data(struct > ceph_connection *con) > } > if (do_datacrc && cursor->need_crc) > crc = ceph_crc32c_page(crc, page, page_offset, > length); > - need_crc = ceph_msg_data_advance(>cursor, (size_t)ret); > + need_crc = ceph_msg_data_advance(cursor, (size_t)ret); > } > > dout("%s %p msg %p done\n", __func__, con, msg); > @@ -2246,8 +2246,8 @@ static int read_partial_msg_data(struct ceph_connection > *con) > if (do_datacrc) > crc = con->in_data_crc; > while (cursor->resid) { > - page = ceph_msg_data_next(>cursor, _offset, , > - NULL); > + page = ceph_msg_data_next(cursor, _offset, , > + NULL); > ret = ceph_tcp_recvpage(con->sock, page, page_offset, length); > if (ret <= 0) { > if (do_datacrc) > @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > > if (do_datacrc) > crc = ceph_crc32c_page(crc, page, page_offset, ret); > - (void) ceph_msg_data_advance(>cursor, (size_t)ret); > + (void) ceph_msg_data_advance(cursor, (size_t)ret); > } > if (do_datacrc) > con->in_data_crc = crc; Applied, see https://github.com/ceph/ceph-client/commit/621a56fd69751d263795f4f35e65eff7daa3a470. Generally, we prefix net/ceph commits with "libceph". Different subsystems have different conventions, use "git log --oneline " to get a sense of what is preferred. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
Returning pool id (i.e. >= 0) from a sysfs ->store() callback makes userspace think it needs to retry the write. Fix it - it's a leftover from the times when the equivalent of rbd_dev_create() was the first action in rbd_add(). Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 07f666f4ca18..df795deffe77 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5394,7 +5394,7 @@ static ssize_t do_rbd_add(struct bus_type *bus, struct rbd_spec *spec = NULL; struct rbd_client *rbdc; bool read_only; - int rc = -ENOMEM; + int rc; if (!try_module_get(THIS_MODULE)) return -ENODEV; @@ -5429,8 +5429,10 @@ static ssize_t do_rbd_add(struct bus_type *bus, } rbd_dev = rbd_dev_create(rbdc, spec, rbd_opts); - if (!rbd_dev) + if (!rbd_dev) { + rc = -ENOMEM; goto err_out_client; + } rbdc = NULL;/* rbd_dev now owns this */ spec = NULL;/* rbd_dev now owns this */ rbd_opts = NULL;/* rbd_dev now owns this */ -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: set device_type::release instead of device::release
No point in providing an empty device_type::release callback and then setting device::release for each rbd_dev dynamically. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4917de726bdb..fa6767e9ed2a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3986,14 +3986,12 @@ static const struct attribute_group *rbd_attr_groups[] = { NULL }; -static void rbd_sysfs_dev_release(struct device *dev) -{ -} +static void rbd_dev_release(struct device *dev); static struct device_type rbd_device_type = { .name = "rbd", .groups = rbd_attr_groups, - .release= rbd_sysfs_dev_release, + .release= rbd_dev_release, }; static struct rbd_spec *rbd_spec_get(struct rbd_spec *spec) @@ -4074,7 +4072,6 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, rbd_dev->dev.bus = _bus_type; rbd_dev->dev.type = _device_type; rbd_dev->dev.parent = _root_dev; - rbd_dev->dev.release = rbd_dev_release; device_initialize(_dev->dev); rbd_dev->rbd_client = rbdc; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: don't free rbd_dev outside of the release callback
struct rbd_device has struct device embedded in it, which means it's part of kobject universe and has an unpredictable life cycle. Freeing its memory outside of the release callback is flawed, yet commits 200a6a8be5db ("rbd: don't destroy rbd_dev in device release function") and 8ad42cd0c002 ("rbd: don't have device release destroy rbd_dev") moved rbd_dev_destroy() out to rbd_dev_image_release(). This commit reverts most of that, the key points are: - rbd_dev->dev is initialized in rbd_dev_create(), making it possible to use rbd_dev_destroy() - which is just a put_device() - both before we register with device core and after. - rbd_dev_release() (the release callback) is the only place we kfree(rbd_dev). It's also where we do module_put(), keeping the module unload race window as small as possible. - We pin the module in rbd_dev_create(), but only for mapping rbd_dev-s. Moving image related stuff out of struct rbd_device into another struct which isn't tied with sysfs and device core is long overdue, but until that happens, this will keep rbd module refcount (which users can observe with lsmod) sane. Fixes: http://tracker.ceph.com/issues/12697 Cc: Alex Elder <el...@linaro.org> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 89 - 1 file changed, 47 insertions(+), 42 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index df795deffe77..4917de726bdb 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -418,8 +418,6 @@ MODULE_PARM_DESC(single_major, "Use a single major number for all rbd devices (d static int rbd_img_request_submit(struct rbd_img_request *img_request); -static void rbd_dev_device_release(struct device *dev); - static ssize_t rbd_add(struct bus_type *bus, const char *buf, size_t count); static ssize_t rbd_remove(struct bus_type *bus, const char *buf, @@ -4038,6 +4036,25 @@ static void rbd_spec_free(struct kref *kref) kfree(spec); } +static void rbd_dev_release(struct device *dev) +{ + struct rbd_device *rbd_dev = dev_to_rbd_dev(dev); + bool need_put = !!rbd_dev->opts; + + rbd_put_client(rbd_dev->rbd_client); + rbd_spec_put(rbd_dev->spec); + kfree(rbd_dev->opts); + kfree(rbd_dev); + + /* +* This is racy, but way better than putting module outside of +* the release callback. The race window is pretty small, so +* doing something similar to dm (dm-builtin.c) is overkill. +*/ + if (need_put) + module_put(THIS_MODULE); +} + static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, struct rbd_spec *spec, struct rbd_options *opts) @@ -4054,6 +4071,12 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, INIT_LIST_HEAD(_dev->node); init_rwsem(_dev->header_rwsem); + rbd_dev->dev.bus = _bus_type; + rbd_dev->dev.type = _device_type; + rbd_dev->dev.parent = _root_dev; + rbd_dev->dev.release = rbd_dev_release; + device_initialize(_dev->dev); + rbd_dev->rbd_client = rbdc; rbd_dev->spec = spec; rbd_dev->opts = opts; @@ -4065,15 +4088,21 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, rbd_dev->layout.fl_object_size = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER); rbd_dev->layout.fl_pg_pool = cpu_to_le32((u32) spec->pool_id); + /* +* If this is a mapping rbd_dev (as opposed to a parent one), +* pin our module. We have a ref from do_rbd_add(), so use +* __module_get(). +*/ + if (rbd_dev->opts) + __module_get(THIS_MODULE); + return rbd_dev; } static void rbd_dev_destroy(struct rbd_device *rbd_dev) { - rbd_put_client(rbd_dev->rbd_client); - rbd_spec_put(rbd_dev->spec); - kfree(rbd_dev->opts); - kfree(rbd_dev); + if (rbd_dev) + put_device(_dev->dev); } /* @@ -4699,27 +4728,6 @@ static int rbd_dev_header_info(struct rbd_device *rbd_dev) return rbd_dev_v2_header_info(rbd_dev); } -static int rbd_bus_add_dev(struct rbd_device *rbd_dev) -{ - struct device *dev; - int ret; - - dev = _dev->dev; - dev->bus = _bus_type; - dev->type = _device_type; - dev->parent = _root_dev; - dev->release = rbd_dev_device_release; - dev_set_name(dev, "%d", rbd_dev->dev_id); - ret = device_register(dev); - - return ret; -} - -static void rbd_bus_del_dev(struct rbd_device *rbd_dev) -{ - device_unregister(_dev->dev); -} - /* * Get a unique rbd identifier for the given new rbd_dev, and add * the rbd_dev to the global lis
Re: [PATCH] Net: ceph: osd_client: Remove con argument in handle_reply
On Sun, Oct 18, 2015 at 10:25 AM, Shraddha Barkewrote: > Since the function handle_reply does not use it's con argument, > remove it. > > Signed-off-by: Shraddha Barke > --- > net/ceph/osd_client.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > index 80b94e3..e281f60 100644 > --- a/net/ceph/osd_client.c > +++ b/net/ceph/osd_client.c > @@ -1745,8 +1745,7 @@ static void complete_request(struct ceph_osd_request > *req) > * handle osd op reply. either call the callback if it is specified, > * or do the completion to wake up the waiting thread. > */ > -static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg, > -struct ceph_connection *con) > +static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg) > { > void *p, *end; > struct ceph_osd_request *req; > @@ -2802,7 +2801,7 @@ static void dispatch(struct ceph_connection *con, > struct ceph_msg *msg) > ceph_osdc_handle_map(osdc, msg); > break; > case CEPH_MSG_OSD_OPREPLY: > - handle_reply(osdc, msg, con); > + handle_reply(osdc, msg); > break; > case CEPH_MSG_WATCH_NOTIFY: > handle_watch_notify(osdc, msg); Applied, with a slightly tweaked change log - see https://github.com/ceph/ceph-client/commit/81c842af2b39009ec1f8272a33b149ad0e845aaf. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()
On Sun, Oct 18, 2015 at 12:00 PM, Shraddha Barkewrote: > Use local variable cursor in place of >cursor in > read_partial_msg_data() > > Signed-off-by: Shraddha Barke > --- > net/ceph/messenger.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index b9b0e3b..b3d1973 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -2231,7 +2231,7 @@ static int read_partial_message_section(struct > ceph_connection *con, > static int read_partial_msg_data(struct ceph_connection *con) > { > struct ceph_msg *msg = con->in_msg; > - struct ceph_msg_data_cursor *cursor = >cursor; > + struct ceph_msg_data_cursor *cursor = cursor; This is not going to work... Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()
On Thu, Oct 15, 2015 at 11:10 PM, Alex Elder <el...@ieee.org> wrote: > On 10/11/2015 01:03 PM, Ilya Dryomov wrote: >> Currently we leak parent_spec and trigger a "parent reference >> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. >> The problem is we take the !parent out_err branch and that only drops >> refcounts; parent_spec that would've been freed had we called >> rbd_dev_unparent() remains and triggers rbd_warn() in >> rbd_dev_parent_put() - at that point we have parent_spec != NULL and >> parent_ref == 0, so counter ends up being -1 after the decrement. > > OK, now that I understand the context... > > You now get extra references for the spec and client > for the parent only after creating the parent device. > I think the reason they logically belonged before > the call to rbd_device_create() for the parent is > because the client and spec pointers passed to that > function carry with them references that are passed > to the resulting rbd_device if successful. Let me stress that those two get()s that I moved is not the point of the patch at all. Whether we do it before or after rbd_dev_create() is pretty much irrelevant, the only reason I moved those calls is to make the error path simpler. > > If creating the parent device fails, you unparent the > original device, which will still have a null parent > pointer. The effect of unparenting in this case is > dropping a reference to the parent's spec, and clearing > the device's pointer to it. This is confusing, but > let's run with it. > > If creating the parent device succeeds, references to > the client and parent spec are taken (basically, these > belong to the just-created parent device). The parent > image is now probed. If this fails, you again > unparent the device. We still have not set the > device's parent pointer, so the effect is as before, > dropping the parent spec reference and clearing > the device's reference to it. The error handling > now destroys the parent, which drops references to > its client and the spec. Again, this seems > confusing, as if we've dropped one more reference > to the parent spec than desired. > > This logic now seems to work. But it's working > around a flaw in the caller I think. Upon entry > to rbd_dev_probe_parent(), a layered device will > have have a non-null parent_spec pointer (and a > reference to it), which will have been filled in > by rbd_dev_v2_parent_info(). > > Really, it should not be rbd_dev_probe_parent() > that drops that parent spec reference on error. > Instead, rbd_dev_image_probe() (which got the > reference to the parent spec) should handle > cleaning up the device's parent spec if an > error occurs after it has been assigned. > > I'll wait for your response, I'd like to know > if what I'm saying makes sense. All of the above makes sense. I agree that it is asymmetrical and that it would have been better to have rbd_dev_image_probe() drop the last ref on ->parent_spec. But, that's not what all of the existing code is doing. Consider rbd_dev_probe_parent() without my patch. There are two out_err jumps. As it is, if rbd_dev_create() fails, we only revert those two get()s and return with alive ->parent_spec. However, if rbd_dev_image_probe() fails, we call rbd_dev_unparent(), which will put ->parent_spec. Really, the issue is rbd_dev_unparent(), which not only removes the parent rbd_dev, but also the parent spec. All I did with this patch was align both out_err cases to do the same, namely undo the effects of rbd_dev_v2_parent_info() - I didn't want to create yet another error path. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mark rbd requiring stable pages
On Thu, Oct 15, 2015 at 8:50 PM, Ronny Hegewaldwrote: > rbd requires stable pages, as it performs a crc of the page data before they > are send to the OSDs. > > But since kernel 3.9 (patch 1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0 "mm: only > enforce stable page writes if the backing device requires it") it is not > assumed anymore that block devices require stable pages. > > This patch sets the necessary flag to get stable pages back for rbd. > > In a ceph installation that provides multiple ext4 formatted rbd devices "bad > crc" messages appeared regularly (ca 1 message every 1-2 minutes on every OSD > that provided the data for the rbd) in the OSD-logs before this patch. After > this patch this messages are pretty much gone (only ca 1-2 / month / OSD). > > This patch seems also to fix data and filesystem corruptions on ext4 formatted > rbd devices that were previously seen on pretty much a daily basis. But it is > unknown at the moment why this is the case. > > Signed-off-by: Ronny Hegewald > > --- > > That the mentioned corruption issue is really affected through this patch > i could verify through the system logs. Since installation of this patch i > have seen only a 2-3 filesystem corruptions. But these could be also just > leftovers of corruptions that happened before the installation but got noticed > from ext4 only later after the patched kernel was installed. This seems even > more likely as i have seen not a single data corruption issue since the patch. > > --- linux/drivers/block/rbd.c.org 2015-10-07 01:32:55.90667 + > +++ linux/drivers/block/rbd.c 2015-09-04 02:21:22.34999 + > @@ -3786,6 +3786,7 @@ > > blk_queue_merge_bvec(q, rbd_merge_bvec); > disk->queue = q; > + disk->queue->backing_dev_info.capabilities |= BDI_CAP_STABLE_WRITES; > > q->queuedata = rbd_dev; Hmm... On the one hand, yes, we do compute CRCs, but that's optional, so enabling this unconditionally is probably too harsh. OTOH we are talking to the network, which means all sorts of delays, retransmission issues, etc, so I wonder how exactly "unstable" pages behave when, say, added to an skb - you can't write anything to a page until networking is fully done with it and expect it to work. It's particularly alarming that you've seen corruptions. Currently the only users of this flag are block integrity stuff and md-raid5, which makes me wonder what iscsi, nfs and others do in this area. There's an old ticket on this topic somewhere on the tracker, so I'll need to research this. Thanks for bringing this up! Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()
On Fri, Oct 16, 2015 at 1:50 PM, Alex Elder <el...@ieee.org> wrote: > On 10/16/2015 04:50 AM, Ilya Dryomov wrote: >> On Thu, Oct 15, 2015 at 11:10 PM, Alex Elder <el...@ieee.org> wrote: >>> On 10/11/2015 01:03 PM, Ilya Dryomov wrote: >>>> Currently we leak parent_spec and trigger a "parent reference >>>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. >>>> The problem is we take the !parent out_err branch and that only drops >>>> refcounts; parent_spec that would've been freed had we called >>>> rbd_dev_unparent() remains and triggers rbd_warn() in >>>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and >>>> parent_ref == 0, so counter ends up being -1 after the decrement. >>> >>> OK, now that I understand the context... >>> >>> You now get extra references for the spec and client >>> for the parent only after creating the parent device. >>> I think the reason they logically belonged before >>> the call to rbd_device_create() for the parent is >>> because the client and spec pointers passed to that >>> function carry with them references that are passed >>> to the resulting rbd_device if successful. >> >> Let me stress that those two get()s that I moved is not the point of >> the patch at all. Whether we do it before or after rbd_dev_create() is >> pretty much irrelevant, the only reason I moved those calls is to make >> the error path simpler. > > Yes I understand that. > >>> If creating the parent device fails, you unparent the >>> original device, which will still have a null parent >>> pointer. The effect of unparenting in this case is >>> dropping a reference to the parent's spec, and clearing >>> the device's pointer to it. This is confusing, but >>> let's run with it. >>> >>> If creating the parent device succeeds, references to >>> the client and parent spec are taken (basically, these >>> belong to the just-created parent device). The parent >>> image is now probed. If this fails, you again >>> unparent the device. We still have not set the >>> device's parent pointer, so the effect is as before, >>> dropping the parent spec reference and clearing >>> the device's reference to it. The error handling >>> now destroys the parent, which drops references to >>> its client and the spec. Again, this seems >>> confusing, as if we've dropped one more reference >>> to the parent spec than desired. >>> >>> This logic now seems to work. But it's working >>> around a flaw in the caller I think. Upon entry >>> to rbd_dev_probe_parent(), a layered device will >>> have have a non-null parent_spec pointer (and a >>> reference to it), which will have been filled in >>> by rbd_dev_v2_parent_info(). >>> >>> Really, it should not be rbd_dev_probe_parent() >>> that drops that parent spec reference on error. >>> Instead, rbd_dev_image_probe() (which got the >>> reference to the parent spec) should handle >>> cleaning up the device's parent spec if an >>> error occurs after it has been assigned. >>> >>> I'll wait for your response, I'd like to know >>> if what I'm saying makes sense. >> >> All of the above makes sense. I agree that it is asymmetrical and that >> it would have been better to have rbd_dev_image_probe() drop the last >> ref on ->parent_spec. But, that's not what all of the existing code is >> doing. > > Agreed. > >> Consider rbd_dev_probe_parent() without my patch. There are two >> out_err jumps. As it is, if rbd_dev_create() fails, we only revert >> those two get()s and return with alive ->parent_spec. However, if >> rbd_dev_image_probe() fails, we call rbd_dev_unparent(), which will put >> ->parent_spec. Really, the issue is rbd_dev_unparent(), which not only >> removes the parent rbd_dev, but also the parent spec. All I did with >> this patch was align both out_err cases to do the same, namely undo the >> effects of rbd_dev_v2_parent_info() - I didn't want to create yet >> another error path. > > OK. Walking through the code I did concluded that what > you did made things "line up" properly. But I just felt > like it wasn't necessarily the *right* fix, but it was > a correct fix. > > The only point in suggesting what I think would be a better > fix is that it would make things easier to reason about > and get correct, and in so doing, make it easier to > maintain and adapt in the future. Well, I'm fixing another rbd_dev refcount/use-after-free problem right now, so I think the only thing that will really make it easier to maintain this is refactor into just a couple of symmetrical functions. Making a single change (e.g. patching rbd_dev_unparent() to only touch ->parent and not ->parent_spec) is going to be invasive - there are just way too many different error paths. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: set max_sectors explicitly
On Fri, Oct 16, 2015 at 2:22 PM, Alex Elder <el...@ieee.org> wrote: > On 10/07/2015 12:00 PM, Ilya Dryomov wrote: >> Commit 30e2bc08b2bb ("Revert "block: remove artifical max_hw_sectors >> cap"") restored a clamp on max_sectors. It's now 2560 sectors instead >> of 1024, but it's not good enough: we set max_hw_sectors to rbd object >> size because we don't want object sized I/Os to be split, and the >> default object size is 4M. >> >> So, set max_sectors to max_hw_sectors in rbd at queue init time. >> >> Signed-off-by: Ilya Dryomov <idryo...@gmail.com> >> --- >> drivers/block/rbd.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >> index 05072464d25e..04e69b4df664 100644 >> --- a/drivers/block/rbd.c >> +++ b/drivers/block/rbd.c >> @@ -3760,6 +3760,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) >> /* set io sizes to object size */ >> segment_size = rbd_obj_bytes(_dev->header); >> blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); >> + q->limits.max_sectors = queue_max_hw_sectors(q); > > This currently is done by default by blk_queue_max_hw_sectors(). > Do you see any different behavior with this patch in place? > > This change seems at least harmless so if it's not too late: Not quite. The commit mentioned above reintroduced this bit into blk_limits_max_hw_sectors() in 4.3-rc1: 241 limits->max_sectors = min_t(unsigned int, max_hw_sectors, 242 BLK_DEF_MAX_SECTORS); So this patch fixes a performance regression. If you spin up 4.3-rc1+ and map an rbd image without changing any settings, the best you'll see is 1280k I/Os despite a 4096k (typically) max_hw_sectors. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/osd_client: add support for CEPH_OSD_OP_GETXATTR
On Thu, Oct 15, 2015 at 12:51 AM, David Disseldorp <dd...@suse.de> wrote: > On Wed, 14 Oct 2015 19:57:46 +0200, Ilya Dryomov wrote: > >> On Wed, Oct 14, 2015 at 7:37 PM, David Disseldorp <dd...@suse.de> wrote: > ... >> > Ping, any feedback on the patch? >> >> The patch itself looks OK, except for the part where you rename a local >> variable for no reason, AFACT. > > Thanks for the feedback Ilya. I presume you're referring to the pagelist > -> req_pagelist rename - I'll drop that change and send an update. Also, the prefix for net/ceph patches is libceph. > >> We've discussed some of this last week. As it is, all rbd image >> properties are stored in omap, so PR info strings stored in xattrs is >> something different, but it should be fine for now. I'd rather not >> merge any kernel patches related to rbd-target work until we see >> a complete patchset though. There's been a lot of back and forth >> between Mike, Christoph and target people and the general approach had >> changed at least twice, so I'd like to wait for things to settle down. > > I understand the desire to wait on any RBD-target changes until upstream > consensus has been reached, but I'd argue that this change is isolated, > and complementary to the existing SETXATTR / CMPXATTR support. That's true, but we already have a bunch of unused code in net/ceph, which was added using precisely this reasoning, so I'm a bit hesitant. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()
On Thu, Oct 15, 2015 at 7:10 PM, Alex Elder <el...@ieee.org> wrote: > On 10/11/2015 01:03 PM, Ilya Dryomov wrote: >> Currently we leak parent_spec and trigger a "parent reference >> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. >> The problem is we take the !parent out_err branch and that only drops >> refcounts; parent_spec that would've been freed had we called >> rbd_dev_unparent() remains and triggers rbd_warn() in >> rbd_dev_parent_put() - at that point we have parent_spec != NULL and >> parent_ref == 0, so counter ends up being -1 after the decrement. >> >> Redo rbd_dev_probe_parent() to fix this. > > I'm just starting to look at this. > > My up-front problem is that I want to understand why > it's OK to no longer grab references to the parent spec > and the client before creating the new parent device. > Is it simply because the rbd_dev that's the subject > of this call is already holding a reference to both, > so no need to get another? I think that's right, but > you should say that in the explanation. It's a > change that's independent of the other change (which > you describe). I still grab references, I just moved that code after rbd_dev_create(). Fundamentally rbd_dev_create() is just a memory allocation, so whether we acquire references before or after doesn't matter, except that acquiring them before complicates the error path. > > OK, onto the change you describe. > > You say that we get the underflow if rbd_dev_create() fails. > > At that point in the function, we have: > parent == NULL > rbd_dev->parent_spec != NULL > parent_spec holds a new reference to that > rbdc holds a new reference to the client > > rbd_dev_create() only returns NULL if the kzalloc() fails. > And if so, we jump to out_err, take the !parent branch, > and we drop the references we took in rbdc and parent_spec > before returning. > > Where is the leak? (Actually, underflow means we're > dropping more references than we took.) We enter rbd_dev_probe_parent(). If ->parent_spec != NULL (and therefore has a refcount of 1), we bump refcounts and proceed to allocating parent rbd_dev. If rbd_dev_create() fails, we drop refcounts and exit rbd_dev_probe_parent() with ->parent_spec != NULL and a refcount of 1 on that spec. We take err_out_probe in rbd_dev_image_probe() and through rbd_dev_unprobe() end up in rbd_dev_parent_put(). Now, ->parent_spec != NULL but ->parent_ref == 0 because we never got to atomic_set(_dev->parent_ref, 1) in rbd_dev_probe_parent(). Local counter ends up -1 after the decrement and so instead of calling rbd_dev_unparent() which would have freed ->parent_spec we issue an underflow warning and bail. ->parent_spec is leaked. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph/osd_client: add support for CEPH_OSD_OP_GETXATTR
On Wed, Oct 14, 2015 at 7:37 PM, David Disseldorpwrote: > On Fri, 9 Oct 2015 16:43:09 +0200, David Disseldorp wrote: > >> Allows for xattr retrieval. Response data buffer allocation is the >> responsibility of the osd_req_op_xattr_init() caller. > > Ping, any feedback on the patch? The patch itself looks OK, except for the part where you rename a local variable for no reason, AFACT. We've discussed some of this last week. As it is, all rbd image properties are stored in omap, so PR info strings stored in xattrs is something different, but it should be fine for now. I'd rather not merge any kernel patches related to rbd-target work until we see a complete patchset though. There's been a lot of back and forth between Mike, Christoph and target people and the general approach had changed at least twice, so I'd like to wait for things to settle down. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel RBD Readahead
On Tue, Oct 13, 2015 at 11:33 AM, Olivier Bonvalet <ceph.l...@daevel.fr> wrote: > Le mardi 25 août 2015 à 17:50 +0300, Ilya Dryomov a écrit : >> > Ok. I might try and create a 4.1 kernel with the blk-mq queue >> depth/IO size + readahead +max_segments fixes in as I'm think the >> TCP_NODELAY bug will still be present in my old 3.14 kernel. >> >> I can build 4.2-rc8 + readahead patch on gitbuilders for you. > > > Hi, > > are thoses patches available in 4.3 kernel ? 4.3 isn't released yet. Everything but a readahead fix will be in 4.3. AFAICT readahead patch was picked up by Andrew for 4.4 [1] - the problem is not in rbd, so we have no direct control over it. [1] http://comments.gmane.org/gmane.linux.kernel.commits.mm/96124 Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reply: [PATCH] rbd: prevent kernel stack blow up on rbd map
On Mon, Oct 12, 2015 at 4:22 AM, Caoxudongwrote: > By the way, do you think it's necessary that we add the clone-chain-length > limit in user-space code too? librbd is different in a lot of ways and there isn't a clean separation between the client part (i.e. what is essentially reimplemented in the kernel) and the rest (management and maintenance parts, etc). It's certainly not necessary, whether it's desirable - I'm not sure. Also, librbd limit, if we were to introduce one, would probably have to be bigger. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()
Currently we leak parent_spec and trigger a "parent reference underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. The problem is we take the !parent out_err branch and that only drops refcounts; parent_spec that would've been freed had we called rbd_dev_unparent() remains and triggers rbd_warn() in rbd_dev_parent_put() - at that point we have parent_spec != NULL and parent_ref == 0, so counter ends up being -1 after the decrement. Redo rbd_dev_probe_parent() to fix this. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 36 1 file changed, 16 insertions(+), 20 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index cd00e4653e49..ccbc3cbbf24e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5134,41 +5134,37 @@ out_err: static int rbd_dev_probe_parent(struct rbd_device *rbd_dev) { struct rbd_device *parent = NULL; - struct rbd_spec *parent_spec; - struct rbd_client *rbdc; int ret; if (!rbd_dev->parent_spec) return 0; - /* -* We need to pass a reference to the client and the parent -* spec when creating the parent rbd_dev. Images related by -* parent/child relationships always share both. -*/ - parent_spec = rbd_spec_get(rbd_dev->parent_spec); - rbdc = __rbd_get_client(rbd_dev->rbd_client); - ret = -ENOMEM; - parent = rbd_dev_create(rbdc, parent_spec, NULL); - if (!parent) + parent = rbd_dev_create(rbd_dev->rbd_client, rbd_dev->parent_spec, + NULL); + if (!parent) { + ret = -ENOMEM; goto out_err; + } + + /* +* Images related by parent/child relationships always share +* rbd_client and spec/parent_spec, so bump their refcounts. +*/ + __rbd_get_client(rbd_dev->rbd_client); + rbd_spec_get(rbd_dev->parent_spec); ret = rbd_dev_image_probe(parent, false); if (ret < 0) goto out_err; + rbd_dev->parent = parent; atomic_set(_dev->parent_ref, 1); - return 0; + out_err: - if (parent) { - rbd_dev_unparent(rbd_dev); + rbd_dev_unparent(rbd_dev); + if (parent) rbd_dev_destroy(parent); - } else { - rbd_put_client(rbdc); - rbd_spec_put(parent_spec); - } - return ret; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: prevent kernel stack blow up on rbd map
Mapping an image with a long parent chain (e.g. image foo, whose parent is bar, whose parent is baz, etc) currently leads to a kernel stack overflow, due to the following recursion in the reply path: rbd_osd_req_callback() rbd_obj_request_complete() rbd_img_obj_callback() rbd_img_parent_read_callback() rbd_obj_request_complete() ... Limit the parent chain to 8 images, which is ~3K worth of stack. It's probably a good thing to do regardless - performance with more than a few images long parent chain is likely to be pretty bad. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 33 +++-- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index ccbc3cbbf24e..07f666f4ca18 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -96,6 +96,8 @@ static int atomic_dec_return_safe(atomic_t *v) #define RBD_MINORS_PER_MAJOR 256 #define RBD_SINGLE_MAJOR_PART_SHIFT4 +#define RBD_MAX_PARENT_CHAIN_LEN 8 + #define RBD_SNAP_DEV_NAME_PREFIX "snap_" #define RBD_MAX_SNAP_NAME_LEN \ (NAME_MAX - (sizeof (RBD_SNAP_DEV_NAME_PREFIX) - 1)) @@ -426,7 +428,7 @@ static ssize_t rbd_add_single_major(struct bus_type *bus, const char *buf, size_t count); static ssize_t rbd_remove_single_major(struct bus_type *bus, const char *buf, size_t count); -static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping); +static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth); static void rbd_spec_put(struct rbd_spec *spec); static int rbd_dev_id_to_minor(int dev_id) @@ -5131,7 +5133,12 @@ out_err: return ret; } -static int rbd_dev_probe_parent(struct rbd_device *rbd_dev) +/* + * @depth is rbd_dev_image_probe() -> rbd_dev_probe_parent() -> + * rbd_dev_image_probe() recursion depth, which means it's also the + * length of the already discovered part of the parent chain. + */ +static int rbd_dev_probe_parent(struct rbd_device *rbd_dev, int depth) { struct rbd_device *parent = NULL; int ret; @@ -5139,6 +5146,12 @@ static int rbd_dev_probe_parent(struct rbd_device *rbd_dev) if (!rbd_dev->parent_spec) return 0; + if (++depth > RBD_MAX_PARENT_CHAIN_LEN) { + pr_info("parent chain is too long (%d)\n", depth); + ret = -EINVAL; + goto out_err; + } + parent = rbd_dev_create(rbd_dev->rbd_client, rbd_dev->parent_spec, NULL); if (!parent) { @@ -5153,7 +5166,7 @@ static int rbd_dev_probe_parent(struct rbd_device *rbd_dev) __rbd_get_client(rbd_dev->rbd_client); rbd_spec_get(rbd_dev->parent_spec); - ret = rbd_dev_image_probe(parent, false); + ret = rbd_dev_image_probe(parent, depth); if (ret < 0) goto out_err; @@ -5282,7 +5295,7 @@ static void rbd_dev_image_release(struct rbd_device *rbd_dev) * parent), initiate a watch on its header object before using that * object to get detailed information about the rbd image. */ -static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) +static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth) { int ret; @@ -5300,7 +5313,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) if (ret) goto err_out_format; - if (mapping) { + if (!depth) { ret = rbd_dev_header_watch_sync(rbd_dev); if (ret) { if (ret == -ENOENT) @@ -5321,7 +5334,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) * Otherwise this is a parent image, identified by pool, image * and snap ids - need to fill in names for those ids. */ - if (mapping) + if (!depth) ret = rbd_spec_fill_snap_id(rbd_dev); else ret = rbd_spec_fill_names(rbd_dev); @@ -5343,12 +5356,12 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) * Need to warn users if this image is the one being * mapped and has a parent. */ - if (mapping && rbd_dev->parent_spec) + if (!depth && rbd_dev->parent_spec) rbd_warn(rbd_dev, "WARNING: kernel layering is EXPERIMENTAL!"); } - ret = rbd_dev_probe_parent(rbd_dev); + ret = rbd_dev_probe_parent(rbd_dev, depth); if (ret) goto err_out_probe; @@ -5359,7 +5372,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) err_out_probe:
[PATCH] rbd: set max_sectors explicitly
Commit 30e2bc08b2bb ("Revert "block: remove artifical max_hw_sectors cap"") restored a clamp on max_sectors. It's now 2560 sectors instead of 1024, but it's not good enough: we set max_hw_sectors to rbd object size because we don't want object sized I/Os to be split, and the default object size is 4M. So, set max_sectors to max_hw_sectors in rbd at queue init time. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 05072464d25e..04e69b4df664 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3760,6 +3760,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) /* set io sizes to object size */ segment_size = rbd_obj_bytes(_dev->header); blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); + q->limits.max_sectors = queue_max_hw_sectors(q); blk_queue_max_segments(q, segment_size / SECTOR_SIZE); blk_queue_max_segment_size(q, segment_size); blk_queue_io_min(q, segment_size); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: use writefull op for object size writes
This covers only the simplest case - an object size sized write, but it's still useful in tiering setups when EC is used for the base tier as writefull op can be proxied, saving an object promotion. Even though updating ceph_osdc_new_request() to allow writefull should just be a matter of fixing an assert, I didn't do it because its only user is cephfs. All other sites were updated. Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 9 +++-- net/ceph/osd_client.c | 13 + 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 04e69b4df664..cd00e4653e49 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1863,9 +1863,11 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, rbd_osd_read_callback(obj_request); break; case CEPH_OSD_OP_SETALLOCHINT: - rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE); + rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE || + osd_req->r_ops[1].op == CEPH_OSD_OP_WRITEFULL); /* fall through */ case CEPH_OSD_OP_WRITE: + case CEPH_OSD_OP_WRITEFULL: rbd_osd_write_callback(obj_request); break; case CEPH_OSD_OP_STAT: @@ -2401,7 +2403,10 @@ static void rbd_img_obj_request_fill(struct rbd_obj_request *obj_request, opcode = CEPH_OSD_OP_ZERO; } } else if (op_type == OBJ_OP_WRITE) { - opcode = CEPH_OSD_OP_WRITE; + if (!offset && length == object_size) + opcode = CEPH_OSD_OP_WRITEFULL; + else + opcode = CEPH_OSD_OP_WRITE; osd_req_op_alloc_hint_init(osd_request, num_ops, object_size, object_size); num_ops++; diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 80b94e37c94a..f79ccac6699f 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -285,6 +285,7 @@ static void osd_req_op_data_release(struct ceph_osd_request *osd_req, switch (op->op) { case CEPH_OSD_OP_READ: case CEPH_OSD_OP_WRITE: + case CEPH_OSD_OP_WRITEFULL: ceph_osd_data_release(>extent.osd_data); break; case CEPH_OSD_OP_CALL: @@ -485,13 +486,14 @@ void osd_req_op_extent_init(struct ceph_osd_request *osd_req, size_t payload_len = 0; BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE && - opcode != CEPH_OSD_OP_ZERO && opcode != CEPH_OSD_OP_TRUNCATE); + opcode != CEPH_OSD_OP_WRITEFULL && opcode != CEPH_OSD_OP_ZERO && + opcode != CEPH_OSD_OP_TRUNCATE); op->extent.offset = offset; op->extent.length = length; op->extent.truncate_size = truncate_size; op->extent.truncate_seq = truncate_seq; - if (opcode == CEPH_OSD_OP_WRITE) + if (opcode == CEPH_OSD_OP_WRITE || opcode == CEPH_OSD_OP_WRITEFULL) payload_len += length; op->payload_len = payload_len; @@ -670,9 +672,11 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req, break; case CEPH_OSD_OP_READ: case CEPH_OSD_OP_WRITE: + case CEPH_OSD_OP_WRITEFULL: case CEPH_OSD_OP_ZERO: case CEPH_OSD_OP_TRUNCATE: - if (src->op == CEPH_OSD_OP_WRITE) + if (src->op == CEPH_OSD_OP_WRITE || + src->op == CEPH_OSD_OP_WRITEFULL) request_data_len = src->extent.length; dst->extent.offset = cpu_to_le64(src->extent.offset); dst->extent.length = cpu_to_le64(src->extent.length); @@ -681,7 +685,8 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req, dst->extent.truncate_seq = cpu_to_le32(src->extent.truncate_seq); osd_data = >extent.osd_data; - if (src->op == CEPH_OSD_OP_WRITE) + if (src->op == CEPH_OSD_OP_WRITE || + src->op == CEPH_OSD_OP_WRITEFULL) ceph_osdc_msg_data_add(req->r_request, osd_data); else ceph_osdc_msg_data_add(req->r_reply, osd_data); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: use writefull op for object size writes
On Wed, Oct 7, 2015 at 5:36 PM, Alex Elder <el...@ieee.org> wrote: > On 10/07/2015 12:02 PM, Ilya Dryomov wrote: >> This covers only the simplest case - an object size sized write, but >> it's still useful in tiering setups when EC is used for the base tier >> as writefull op can be proxied, saving an object promotion. >> >> Even though updating ceph_osdc_new_request() to allow writefull should >> just be a matter of fixing an assert, I didn't do it because its only >> user is cephfs. All other sites were updated. >> >> Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b. > > I haven't looked at this at all. But can you give me a > short explanation of what "writefull" is? > > Full object write? Well, in a way. It replaces previous data, you can think of it as an atomic truncate to 0 + write from offset 0. So it always writes (replaces) entire objects. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph:Remove unused goto labels in decode crush map functions
On Fri, Oct 2, 2015 at 9:48 PM, Nicholas Krausewrote: > This removes unused goto labels in decode crush map functions related > to error paths due to them never being used on any error path for these > particular functions in the file, osdmap.c. > > Signed-off-by: Nicholas Krause > --- > net/ceph/osdmap.c | 10 -- > 1 file changed, 10 deletions(-) > > diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c > index 7d8f581..2f8e41c 100644 > --- a/net/ceph/osdmap.c > +++ b/net/ceph/osdmap.c > @@ -59,8 +59,6 @@ static int crush_decode_uniform_bucket(void **p, void *end, > ceph_decode_need(p, end, (1+b->h.size) * sizeof(u32), bad); ^^^ > b->item_weight = ceph_decode_32(p); > return 0; > -bad: ^^^ > - return -EINVAL; > } I realize that these macros are sneaky, but you should at least compile-test your patches before you send them out. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [CEPH-DEVEL] [ceph-users] occasional failure to unmap rbd
On Sat, Sep 26, 2015 at 5:54 AM, Shinobu Kinjowrote: > I think it's more helpful to put returned value in: > > # ./src/krbd.cc > 530 cerr << "rbd: sysfs write failed" << std::endl; > > like: > > 530 cerr << "rbd: sysfs write failed (" << r << ")" << std::endl; > > So that we exactly know what **write** complains about. > Because **write** has some return values in case of error. > > What do you think? It's already doing that: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy sysfs_write_rbd_remove() return value is propagated up and reported. The code is written in such a way that return values are preserved and never overwritten (hopefully!). Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to sepcify the point a rbd mapped?
On Fri, Sep 25, 2015 at 6:25 AM, Jaze Leewrote: > Hello, > I know we can map a rbd image to a block device into kernel by > ‘rbd map’ command。 > But we can not specify which block device. For example, if i have > a rbd named rbd_0, > i want it mapped into /dev/rbd_0 when it mapped. > > Is there some way to do that? No, it's going to pick the lowest available ID. The ID is picked atomically by the kernel driver, you have no control over it. If you want stable names, you should use udev symlinks, which look like /dev/rbd//[@] Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD mirroring CLI proposal ...
On Wed, Sep 23, 2015 at 4:08 PM, Jason Dillamanwrote: >> > In this case the commands look a little confusing to me, as from their >> > names I would rather think they enable/disable mirror for existent >> > images too. Also, I don't see a command to check what current >> > behaviour is. And, I suppose it would be useful if we could configure >> > other default features for a pool (exclusive-lock, object-map, ...) >> > Also, I am not sure we should specify this way, as it is >> > not consistent with other rbd commands. By default rbd operates on >> > 'rbd' pool, which can be changed by --pool option. So what do you >> > think if we have something similar to 'rbd feature' commands? >> > >> > rbd [--pool ] default-feature enable >> > rbd [--pool ] default-feature disable >> > rbd [--pool ] default-feature show [] >> > >> > (If is not specified in the last command, all features are >> > shown). >> >> I haven't read the discussion in full, so feel free to ignore, but I'd >> much rather have rbd create create images with the same set of features >> enabled no matter which pool it's pointed to. >> >> I'm not clear on what exactly a pool policy is. If it's just a set of >> feature bits to enable at create time, I don't think it's worth >> introducing at all. If it's a set feature bits + a set of mirroring >> related options, I think it should just be a set of those options. >> Then, rbd create --enable-mirroring could be a way to create an image >> with mirroring enabled. >> >> My point is a plain old "rbd create foo" shouldn't depend an any new >> pool-level metadata. It's not that hard to remember which features you >> want for which pools and rbd create shortcuts like --enable-object-map >> and --enable-mirroring would hide feature bit dependencies and save >> typing. --enable-mirroring would also serve as a ticket to go look at >> new metadata and pull out any mirroring related defaults. >> > > While I agree that I don't necessarily want to go down the road of specifying > per-pool default features, I think there is a definite use-case need to be > able to enable mirroring on a per-pool basis by default. Without such an > ability, taking OpenStack for example, it would require changes to Glance, > Cinder, and Nova in order to support DR configurations -- or you could get it > automatically with a little pre-configuration. So a pool policy is just a set of feature bits? I think Cinder at least creates images with rbd_default_features from ceph.conf and adds in layering if it's not set, meaning there is no interface for passing through feature bits (or anything else really, things like striping options, etc). What pool-level default feature bits infrastructure would do is replace a big (cluster-level) hammer with a smaller (pool-level) hammer. You'd have to add librbd APIs for it and someone eventually will try to follow suit and add defaults for other settings. You said you weren't attempting to create a mechanism to specify arbitrary default features for a given pool, but I think it will come up in the future if we introduce this - it's only logical. What we might want to do instead is use this mirroring milestone to add support for a proper key-value interface for passing in features and other settings for individual rbd images to OpenStack. I assume it's all python dicts with OpenStack, so it shouldn't be hard? I know that getting patches into OpenStack can be frustrating at times and I might be underestimating the importance of the use case you have in mind, but patching our OpenStack drivers rather than adding what essentially is a workaround to librbd makes a lot more sense to me. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD mirroring CLI proposal ...
On Wed, Sep 23, 2015 at 9:28 PM, Jason Dillamanwrote: >> So a pool policy is just a set of feature bits? > > It would have to store additional details as well. > >> I think Cinder at least creates images with rbd_default_features from >> ceph.conf and adds in layering if it's not set, meaning there is no >> interface for passing through feature bits (or anything else really, >> things like striping options, etc). What pool-level default feature >> bits infrastructure would do is replace a big (cluster-level) hammer >> with a smaller (pool-level) hammer. You'd have to add librbd APIs for >> it and someone eventually will try to follow suit and add defaults for >> other settings. You said you weren't attempting to create a mechanism >> to specify arbitrary default features for a given pool, but I think it >> will come up in the future if we introduce this - it's only logical. >> >> What we might want to do instead is use this mirroring milestone to add >> support for a proper key-value interface for passing in features and >> other settings for individual rbd images to OpenStack. I assume it's >> all python dicts with OpenStack, so it shouldn't be hard? I know that >> getting patches into OpenStack can be frustrating at times and I might >> be underestimating the importance of the use case you have in mind, but >> patching our OpenStack drivers rather than adding what essentially is >> a workaround to librbd makes a lot more sense to me. >> > > It would be less work to skip adding the pool-level defaults which is a plus > given everything else required. However, putting aside how long it would > take for the required changes to trickle down from OpenStack, Qemu, etc > (since I agree that shouldn't drive design), in some ways your proposal could > be seen as blurring the configuration encapsulation between clients and Ceph. > > Is the goal to configure my storage policies in one place or should I have to > update all my client configuration settings (not that big of a deal if you > are using something like Puppet to push down consistent configs across your > servers)? Trying to think like an end-user, I think I would prefer > configuring it once within the storage system itself. I am not familiar with > any other storage systems that configure mirroring via OpenStack config > files, but I could be wrong since there are a lot of volume drivers now. I'm not very familiar with OpenStack so I don't know either, I'm just pointing out that, as far as at least cinder goes, we currently use a cluster-wide default for something that is inherently a per-image property, that there is no way to change it, and that there is a way to configure only a small subset of settings. I don't see it as blurring the configuration encapsulation: if a user is creating an image from OpenStack (or any other client for that matter), they should be able to specify all the settings they want for a given image and not rely on cluster-wide or pool-wide defaults. (Maybe I'm too fixed on this idea that per-image properties should be per-image and you are trying to think bigger. What I'm ranting about here is status quo, mirroring and the new use cases and configuration challanges it brings along are somewhat off to the side.) I'm not against pool-level defaults per se, I just think if we go down this road it's going to be hard to draw a line in the future, and I want to make sure we are not adding it just to work around deficiencies in our OpenStack drivers (and possibly librbd create-like APIs). Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD mirroring CLI proposal ...
On Wed, Sep 23, 2015 at 9:33 AM, Mykola Golubwrote: > On Tue, Sep 22, 2015 at 01:32:49PM -0400, Jason Dillaman wrote: > >> > > * rbd mirror pool enable >> > > This will, by default, ensure that all images created in this >> > > pool have exclusive lock, journaling, and mirroring feature bits >> > > enabled. >> > > >> > > * rbd mirror pool disable >> > > This will clear the default image features for new images in this >> > > pool. >> > >> > Will 'rbd mirror pool enable|disable' change behaviour only for newly >> > created images in the pool or will enable|disable mirroring for >> > existent images too? >> >> Since the goal is to set default pool behavior, it would only apply >> to newly created images. You can enable/disable on specific images >> using the 'rbd mirror image enable/disable' commands. > > In this case the commands look a little confusing to me, as from their > names I would rather think they enable/disable mirror for existent > images too. Also, I don't see a command to check what current > behaviour is. And, I suppose it would be useful if we could configure > other default features for a pool (exclusive-lock, object-map, ...) > Also, I am not sure we should specify this way, as it is > not consistent with other rbd commands. By default rbd operates on > 'rbd' pool, which can be changed by --pool option. So what do you > think if we have something similar to 'rbd feature' commands? > > rbd [--pool ] default-feature enable > rbd [--pool ] default-feature disable > rbd [--pool ] default-feature show [] > > (If is not specified in the last command, all features are > shown). I haven't read the discussion in full, so feel free to ignore, but I'd much rather have rbd create create images with the same set of features enabled no matter which pool it's pointed to. I'm not clear on what exactly a pool policy is. If it's just a set of feature bits to enable at create time, I don't think it's worth introducing at all. If it's a set feature bits + a set of mirroring related options, I think it should just be a set of those options. Then, rbd create --enable-mirroring could be a way to create an image with mirroring enabled. My point is a plain old "rbd create foo" shouldn't depend an any new pool-level metadata. It's not that hard to remember which features you want for which pools and rbd create shortcuts like --enable-object-map and --enable-mirroring would hide feature bit dependencies and save typing. --enable-mirroring would also serve as a ticket to go look at new metadata and pull out any mirroring related defaults. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: partprobe or partx or ... ?
On Sat, Sep 19, 2015 at 11:08 PM, Loic Dacharywrote: > > > On 19/09/2015 17:23, Loic Dachary wrote: >> Hi Ilya, >> >> At present ceph-disk uses partprobe to ensure the kernel is aware of the >> latest partition changes after a new one is created, or after zapping the >> partition table. Although it works reliably (in the sense that the kernel is >> indeed aware of the desired partition layout), it goes as far as to remove >> all partition devices of the current kernel table, only to re-add them with >> the new partition table. The delay it implies is not an issue because >> ceph-disk is rarely called. It however generate many udev events (dozens >> remove/change/add for a two partition disk) and almost always creates border >> cases that are difficult to figure out and debug. While it is a good way to >> ensure that ceph-disk is idempotent and immune to race conditions, maybe it >> is needlessly hard. >> >> Do you know of a light weight alternative to partprobe ? In the past we've >> used partx but I remember it failed to address some border cases in >> non-intuitive ways. Do you know of another, simpler, approach to this ? >> >> Thanks in advance for your help :-) >> > > For the record using /sys/block/sdX/device/rescan sounds good but does not > exist for devices created via devicemapper (used for dmcrypt and multipath). Hi Loic, Yeah, partprobe loops through the entire partition table, trying do delete/add every slot. As an aside, the in-kernel way to do this (blockdev --rereadpt) is similar in that it also drops all partitions and re-adds them later, but it's faster and probably generates less change events. The downside is it won't work on busy device. I don't think there is any alternative, except for using partx --add with --nr, that is targeting specific slots in the partition table. If all you are doing is adding partitions and zapping entire partition tables, that may work well enough. That said, given that the resulting delay (which can be in the seconds range, especially if your disk happens to have a busy partition) isn't a problem, what difference does it make? What are you listening to those events for? /sys/block/sdX/device/rescan is sd only, and AFAIK it doesn't generally trigger a re-read of a partition table. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ceph-community] Getting WARN in __kick_osd_requests doing stress testing
On Fri, Sep 18, 2015 at 9:48 AM, Abhishek Lwrote: > Redirecting to ceph-devel, where such a question might have a better > chance of a reply. > > On Fri, Sep 18, 2015 at 4:03 AM, wrote: >> I'm running in a 3-node cluster and doing osd/rbd creation and deletion, and >> ran across this WARN >> Note, it only happened once (on one rbd add) after approximately 500 cycles >> of the test, but was wondering if >> someone can explain to me why this warning would be happening, and how I can >> prevent it. >> >> Here is what my test script is doing: >> >> while(1): >> create 5 ceph pools - sleep 2 between each pool create >> sleep 5 >> create 5 ceph volumes - sleep 2 between each pool create >> sleep 5 >> delete 5 ceph volumes - sleep 2 between each pool create >> sleep 5 >> delete 5 ceph pools - sleep 2 between each pool create >> sleep 5 >> >> >> 333940 Sep 17 00:31:54 10.0.41.9 [18372.272771] Call Trace: >> 333941 Sep 17 00:31:54 10.0.41.9 [18372.273489] [] >> dump_stack+0x45/0x57 >> 333942 Sep 17 00:31:54 10.0.41.9 [18372.274226] [] >> warn_slowpath_common+0x97/0xe0 >> 333943 Sep 17 00:31:54 10.0.41.9 [18372.274923] [] >> warn_slowpath_null+0x1a/0x20 >> 333944 Sep 17 00:31:54 10.0.41.9 [18372.275635] [] >> __kick_osd_requests+0x1dc/0x240 [libceph] >> 333945 Sep 17 00:31:54 10.0.41.9 [18372.276305] [] >> osd_reset+0x57/0xa0 [libceph] >> 333946 Sep 17 00:31:54 10.0.41.9 [18372.276962] [] >> con_work+0x112/0x290 [libceph] >> 333947 Sep 17 00:31:54 10.0.41.9 [18372.277608] [] >> process_one_work+0x144/0x470 >> 333948 Sep 17 00:31:54 10.0.41.9 [18372.278247] [] >> worker_thread+0x11e/0x450 >> 333949 Sep 17 00:31:54 10.0.41.9 [18372.278880] [] ? >> create_worker+0x1f0/0x1f0 >> 333950 Sep 17 00:31:54 10.0.41.9 [18372.279543] [] >> kthread+0xc9/0xe0 >> 333951 Sep 17 00:31:54 10.0.41.9 [18372.280174] [] ? >> flush_kthread_worker+0x90/0x90 >> 333952 Sep 17 00:31:54 10.0.41.9 [18372.280803] [] >> ret_from_fork+0x58/0x90 >> 333953 Sep 17 00:31:54 10.0.41.9 [18372.281430] [] ? >> flush_kthread_worker+0x90/0x90 >> >> static void __kick_osd_requests(struct ceph_osd_client *osdc, >> struct ceph_osd *osd) >> { >> : >> list_for_each_entry_safe(req, nreq, >o_linger_requests, >> r_linger_osd_item) { >> WARN_ON(!list_empty(>r_req_lru_item)); >> __kick_linger_request(req); >> } >> : >> } What is your kernel version? There is no mention of rbd map/unmap in the pseudo code you provided. How are you mapping/unmapping those rbd images? More details or the script itself would be nice to see. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: advertise support for keepalive2
On Wed, Sep 16, 2015 at 9:28 AM, Yan, Zheng <uker...@gmail.com> wrote: > On Mon, Sep 14, 2015 at 9:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >> We are the client, but advertise keepalive2 anyway - for consistency, >> if nothing else. In the future the server might want to know whether >> its clients support keepalive2. > > the kernel code still does fully support KEEPALIVE2 (it does not > recognize CEPH_MSGR_TAG_KEEPALIVE2 tag). I think it's better to not > advertise support for keepalive2. I guess by "does not recognize" you mean the kernel client knows how to write TAG_KEEPALIVE2, but not how to read it? The same is true about TAG_KEEPALIVE tag and the reverse (we can read, but can't write) is true about TAG_KEEPALIVE2_ACK. What I'm getting at is the kernel client is the client, and so it doesn't need to know how to read keepalive bytes or write keepalive acks. The server however might want to know if its clients can send keepalive2 bytes and handle keepalive2 acks. Does this make sense? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: advertise support for keepalive2
We are the client, but advertise keepalive2 anyway - for consistency, if nothing else. In the future the server might want to know whether its clients support keepalive2. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- include/linux/ceph/ceph_features.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h index 4763ad64e832..f89b31d45cc8 100644 --- a/include/linux/ceph/ceph_features.h +++ b/include/linux/ceph/ceph_features.h @@ -107,6 +107,7 @@ static inline u64 ceph_sanitize_features(u64 features) CEPH_FEATURE_OSDMAP_ENC | \ CEPH_FEATURE_CRUSH_TUNABLES3 | \ CEPH_FEATURE_OSD_PRIMARY_AFFINITY |\ +CEPH_FEATURE_MSGR_KEEPALIVE2 | \ CEPH_FEATURE_CRUSH_V4) #define CEPH_FEATURES_REQUIRED_DEFAULT \ -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: don't access invalid memory in keepalive2 path
This struct ceph_timespec ceph_ts; ... con_out_kvec_add(con, sizeof(ceph_ts), _ts); wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet ceph_ts becomes invalid on return from prepare_write_keepalive(). As a result, we send out bogus keepalive2 stamps. Fix this by encoding into a ceph_timespec member, similar to how acks are read and written. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- include/linux/ceph/messenger.h | 4 +++- net/ceph/messenger.c | 9 + 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 7e1252e97a30..b2371d9b51fa 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -238,6 +238,8 @@ struct ceph_connection { bool out_kvec_is_msg; /* kvec refers to out_msg */ int out_more;/* there is more data after the kvecs */ __le64 out_temp_ack; /* for writing an ack */ + struct ceph_timespec out_temp_keepalive2; /* for writing keepalive2 +stamp */ /* message in temps */ struct ceph_msg_header in_hdr; @@ -248,7 +250,7 @@ struct ceph_connection { int in_base_pos; /* bytes read */ __le64 in_temp_ack; /* for reading an ack */ - struct timespec last_keepalive_ack; + struct timespec last_keepalive_ack; /* keepalive2 ack stamp */ struct delayed_work work; /* send|recv work */ unsigned long delay; /* current delay interval */ diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 525f454f7531..b9b0e3b5da49 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1353,11 +1353,12 @@ static void prepare_write_keepalive(struct ceph_connection *con) dout("prepare_write_keepalive %p\n", con); con_out_kvec_reset(con); if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) { - struct timespec ts = CURRENT_TIME; - struct ceph_timespec ceph_ts; - ceph_encode_timespec(_ts, ); + struct timespec now = CURRENT_TIME; + con_out_kvec_add(con, sizeof(tag_keepalive2), _keepalive2); - con_out_kvec_add(con, sizeof(ceph_ts), _ts); + ceph_encode_timespec(>out_temp_keepalive2, ); + con_out_kvec_add(con, sizeof(con->out_temp_keepalive2), +>out_temp_keepalive2); } else { con_out_kvec_add(con, sizeof(tag_keepalive), _keepalive); } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 33/39] rbd: drop null test before destroy functions
On Sun, Sep 13, 2015 at 3:15 PM, Julia Lawallwrote: > Remove unneeded NULL test. > > The semantic patch that makes this change is as follows: > (http://coccinelle.lip6.fr/) > > // > @@ expression x; @@ > -if (x != NULL) { > \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); > x = NULL; > -} > // > > Signed-off-by: Julia Lawall > > --- > drivers/block/rbd.c |6 ++ > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index d93a037..0507246 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -5645,10 +5645,8 @@ static int rbd_slab_init(void) > if (rbd_segment_name_cache) > return 0; > out_err: > - if (rbd_obj_request_cache) { > - kmem_cache_destroy(rbd_obj_request_cache); > - rbd_obj_request_cache = NULL; > - } > + kmem_cache_destroy(rbd_obj_request_cache); > + rbd_obj_request_cache = NULL; > > kmem_cache_destroy(rbd_img_request_cache); > rbd_img_request_cache = NULL; Applied. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
size_t and related types on mn10300
On Thu, Sep 10, 2015 at 10:57 AM, kbuild test robotwrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > master > head: 22dc312d56ba077db27a9798b340e7d161f1df05 > commit: 5f1c79a71766ba656762636936edf708089bdb14 [12335/12685] libceph: check > data_len in ->alloc_msg() > config: mn10300-allmodconfig (attached as .config) > reproduce: > wget > https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross > -O ~/bin/make.cross > chmod +x ~/bin/make.cross > git checkout 5f1c79a71766ba656762636936edf708089bdb14 > # save the attached .config to linux build tree > make.cross ARCH=mn10300 > > All warnings (new ones prefixed by >>): > >net/ceph/osd_client.c: In function 'get_reply': >>> net/ceph/osd_client.c:2865:3: warning: format '%zu' expects argument of >>> type 'size_t', but argument 6 has type 'unsigned int' [-Wformat=] > pr_warn("%s osd%d tid %llu data %d > preallocated %zu, skipping\n", > ^ > > vim +2865 net/ceph/osd_client.c > > 2849 req->r_reply, req->r_reply->con); > 2850 ceph_msg_revoke_incoming(req->r_reply); > 2851 > 2852 if (front_len > req->r_reply->front_alloc_len) { > 2853 pr_warn("%s osd%d tid %llu front %d > preallocated > %d\n", > 2854 __func__, osd->o_osd, req->r_tid, front_len, > 2855 req->r_reply->front_alloc_len); > 2856 m = ceph_msg_new(CEPH_MSG_OSD_OPREPLY, front_len, > GFP_NOFS, > 2857 false); > 2858 if (!m) > 2859 goto out; > 2860 ceph_msg_put(req->r_reply); > 2861 req->r_reply = m; > 2862 } > 2863 > 2864 if (data_len > req->r_reply->data_length) { >> 2865 pr_warn("%s osd%d tid %llu data %d > preallocated %zu, >> skipping\n", > 2866 __func__, osd->o_osd, req->r_tid, data_len, > 2867 req->r_reply->data_length); > 2868 m = NULL; > 2869 *skip = 1; > 2870 goto out; > 2871 } > 2872 > 2873 m = ceph_msg_get(req->r_reply); req->r_reply->data_length is size_t, formatted with %zu, as it should be. I think this is a problem with either mn10300 itself or the build toolchain - compiling mn10300 defconfig throws a whole bunch of similar errors all over the place. arch/mn10300/include/uapi/asm/posix_types.h 30 #if __GNUC__ == 4 31 typedef unsigned int__kernel_size_t; 32 typedef signed int __kernel_ssize_t; 33 #else 34 typedef unsigned long __kernel_size_t; 35 typedef signed long __kernel_ssize_t; 36 #endif This came from commit 3ad001c04f1a ("MN10300: Fix size_t and ssize_t") by David. Now, if that's right, it should probably be >= 4. But then $ echo | /opt/gcc-4.9.0-nolibc/am33_2.0-linux/bin/am33_2.0-linux-gcc -dM -E - | grep __SIZE_TYPE__ #define __SIZE_TYPE__ long unsigned int and sure enough gcc/config/mn10300/linux.h 87 #undef SIZE_TYPE 88 #undef PTRDIFF_TYPE 89 #undef WCHAR_TYPE 90 #undef WCHAR_TYPE_SIZE gcc/config/mn10300/mn10300.h 125 #undef SIZE_TYPE 126 #define SIZE_TYPE "unsigned int" so it looks like "linux" variant uses gcc defaults, at least since gcc.git commit 05381348ac78 (which is dated a few months after David's commit made it into the kernel). Can someone who cares about mn10300 or at least knows what it is look at this? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive
On Wed, Sep 2, 2015 at 5:22 AM, Yan, Zhengwrote: > timespec_to_jiffies() does not work this way. it convert time delta in form > of timespec to time delta in form of jiffies. Ah sorry, con->last_keepalive_ack is a realtime timespec from userspace. > > I will updated the patch according to the rest comments . I still want ceph_con_keepalive_expired() to handle interval == 0 internally, so that opt->monc_ping_timeout can be passed without any checks in the caller. I also noticed you added delay = max_t(unsigned long, delay, HZ); to __schedule_delayed(). Is it really necessary? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive
On Wed, Sep 2, 2015 at 12:25 PM, Yan, Zheng <z...@redhat.com> wrote: > >> On Sep 2, 2015, at 17:12, Ilya Dryomov <idryo...@gmail.com> wrote: >> >> On Wed, Sep 2, 2015 at 5:22 AM, Yan, Zheng <z...@redhat.com> wrote: >>> timespec_to_jiffies() does not work this way. it convert time delta in form >>> of timespec to time delta in form of jiffies. >> >> Ah sorry, con->last_keepalive_ack is a realtime timespec from >> userspace. >> >>> >>> I will updated the patch according to the rest comments . >> >> I still want ceph_con_keepalive_expired() to handle interval == 0 >> internally, so that opt->monc_ping_timeout can be passed without any >> checks in the caller. >> >> I also noticed you added delay = max_t(unsigned long, delay, HZ); to >> __schedule_delayed(). Is it really necessary? >> > > Updated. please review. I was concerned about user facing option names and timeouts, since that's what I had to deal with recently - those parts look OK to me now. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: check data_len in ->alloc_msg()
Only ->alloc_msg() should check data_len of the incoming message against the preallocated ceph_msg, doing it in the messenger is not right. The contract is that either ->alloc_msg() returns a ceph_msg which will fit all of the portions of the incoming message, or it returns NULL and possibly sets skip, signaling whether NULL is due to an -ENOMEM. ->alloc_msg() should be the only place where we make the skip/no-skip decision. I stumbled upon this while looking at con/osd ref counting. Right now, if we get a non-extent message with a larger data portion than we are prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip it in the messenger, we don't put the con/osd ref acquired in ceph_con_in_msg_alloc() (which is normally put in process_message()), so this also fixes a memory leak. An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't corrupt random memory should a buggy ->alloc_msg() return an unfit ceph_msg. While at it, I changed the "unknown tid" dout() to a pr_warn() to make sure all skips are seen and unified format strings. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- net/ceph/messenger.c | 7 --- net/ceph/osd_client.c | 51 ++- 2 files changed, 18 insertions(+), 40 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 36757d46ac40..525f454f7531 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2337,13 +2337,6 @@ static int read_partial_message(struct ceph_connection *con) return ret; BUG_ON(!con->in_msg ^ skip); - if (con->in_msg && data_len > con->in_msg->data_length) { - pr_warn("%s skipping long message (%u > %zd)\n", - __func__, data_len, con->in_msg->data_length); - ceph_msg_put(con->in_msg); - con->in_msg = NULL; - skip = 1; - } if (skip) { /* skip this message */ dout("alloc_msg said skip message\n"); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 50033677c0fa..80b94e37c94a 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2817,8 +2817,9 @@ out: } /* - * lookup and return message for incoming reply. set up reply message - * pages. + * Lookup and return message for incoming reply. Don't try to do + * anything about a larger than preallocated data portion of the + * message at the moment - for now, just skip the message. */ static struct ceph_msg *get_reply(struct ceph_connection *con, struct ceph_msg_header *hdr, @@ -2836,10 +2837,10 @@ static struct ceph_msg *get_reply(struct ceph_connection *con, mutex_lock(>request_mutex); req = __lookup_request(osdc, tid); if (!req) { - *skip = 1; + pr_warn("%s osd%d tid %llu unknown, skipping\n", + __func__, osd->o_osd, tid); m = NULL; - dout("get_reply unknown tid %llu from osd%d\n", tid, -osd->o_osd); + *skip = 1; goto out; } @@ -2849,10 +2850,9 @@ static struct ceph_msg *get_reply(struct ceph_connection *con, ceph_msg_revoke_incoming(req->r_reply); if (front_len > req->r_reply->front_alloc_len) { - pr_warn("get_reply front %d > preallocated %d (%u#%llu)\n", - front_len, req->r_reply->front_alloc_len, - (unsigned int)con->peer_name.type, - le64_to_cpu(con->peer_name.num)); + pr_warn("%s osd%d tid %llu front %d > preallocated %d\n", + __func__, osd->o_osd, req->r_tid, front_len, + req->r_reply->front_alloc_len); m = ceph_msg_new(CEPH_MSG_OSD_OPREPLY, front_len, GFP_NOFS, false); if (!m) @@ -2860,37 +2860,22 @@ static struct ceph_msg *get_reply(struct ceph_connection *con, ceph_msg_put(req->r_reply); req->r_reply = m; } - m = ceph_msg_get(req->r_reply); - - if (data_len > 0) { - struct ceph_osd_data *osd_data; - /* -* XXX This is assuming there is only one op containing -* XXX page data. Probably OK for reads, but this -* XXX ought to be done more generally. -*/ - osd_data = osd_req_op_extent_osd_data(req, 0); - if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) { - if (osd_data->pages && - unl
Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive
On Wed, Sep 2, 2015 at 12:25 PM, Yan, Zheng <z...@redhat.com> wrote: > >> On Sep 2, 2015, at 17:12, Ilya Dryomov <idryo...@gmail.com> wrote: >> >> On Wed, Sep 2, 2015 at 5:22 AM, Yan, Zheng <z...@redhat.com> wrote: >>> timespec_to_jiffies() does not work this way. it convert time delta in form >>> of timespec to time delta in form of jiffies. >> >> Ah sorry, con->last_keepalive_ack is a realtime timespec from >> userspace. >> >>> >>> I will updated the patch according to the rest comments . >> >> I still want ceph_con_keepalive_expired() to handle interval == 0 >> internally, so that opt->monc_ping_timeout can be passed without any >> checks in the caller. >> >> I also noticed you added delay = max_t(unsigned long, delay, HZ); to >> __schedule_delayed(). Is it really necessary? >> > > Updated. please review. One other thing you might want to do is to / 3 instead of >> 2 in delayed_work(). That way the tick interval in the non-hunting case is going to be 10 seconds, which matches the default for ping interval (mon_client_ping_interval) in userspace. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: devicemapper and udev events
On Tue, Sep 1, 2015 at 1:09 AM, Loic Dacharywrote: > Hi Ilya, > > While working on multipath, I noticed that udev add event are not triggered > when /dev/dm-0 etc. are created. Should I expect a udev add event for every > device ? Or does it not apply to devices created by / with devicemapper ? I'd expect an event - are you sure it's not triggered? Are you using udevadm monitor or is it just some udev rule not getting executed? Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive
On Tue, Sep 1, 2015 at 5:21 PM, Yan, Zhengwrote: > Signed-off-by: Yan, Zheng > --- > include/linux/ceph/libceph.h | 2 ++ > include/linux/ceph/messenger.h | 4 +++ > include/linux/ceph/msgr.h | 4 ++- > net/ceph/ceph_common.c | 18 - > net/ceph/messenger.c | 60 > ++ > net/ceph/mon_client.c | 38 -- > 6 files changed, 111 insertions(+), 15 deletions(-) Hi Zheng, Just want to make sure you saw my "Re: " message - I had a few suggestions. This one looks like the one I commented on, straight from the testing branch. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: format 2TB rbd device is too slow
On Mon, Aug 31, 2015 at 4:21 AM, Ma, Jianpengwrote: > Ilya > I modify kernel code. The patch like this: > [root@dev linux]# git diff > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index bc67a93..e4c4ea9 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -55,7 +55,7 @@ > * universally 512 bytes. These symbols are just slightly more > * meaningful than the bare numbers they represent. > */ > -#defineSECTOR_SHIFT9 > +#defineSECTOR_SHIFT12 > #defineSECTOR_SIZE (1ULL << SECTOR_SHIFT) > > /* > @@ -3811,6 +3811,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) > blk_queue_io_min(q, segment_size); > blk_queue_io_opt(q, segment_size); > > + blk_queue_physical_block_size(q, SECTOR_SIZE); > /* enable the discard support */ > queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); > q->limits.discard_granularity = segment_size; > > I don't know the exact why this can decrease the time. Well, that's the answer ;) set_capacity()'s @size is always in 512 byte sectors, and it's called as follows in rbd: set_capacity(rbd_dev->disk, rbd_dev->mapping.size / SECTOR_SIZE); By changing the SECTOR_SIZE from 512 to 4096 (i.e. by a factor of 8) you just shrink the block device by the same factor. What used to be a 2T mapping would now be a 256G mapping, and since the amount of data mkfs.xfs has to write is proportional to the size of the device, your 7 times decrease in time is easily explained by an 8 times decrease in size of the mapping. > > For the -b of mkfs.xfs, I try for 65536. But it don't change any. Right. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: fix double free on rbd_dev->header_name
If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name is freed twice: once in rbd_dev_probe_parent() and then in its caller rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to handle parent images). rbd_dev_probe_parent() is responsible for probing the parent, so it shoudn't mock with clone's fields. Signed-off-by: Ilya Dryomov <idryo...@gmail.com> --- drivers/block/rbd.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index bc67a93aa4f4..324bf35ec4dd 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5201,7 +5201,6 @@ static int rbd_dev_probe_parent(struct rbd_device *rbd_dev) out_err: if (parent) { rbd_dev_unparent(rbd_dev); - kfree(rbd_dev->header_name); rbd_dev_destroy(parent); } else { rbd_put_client(rbdc); -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: format 2TB rbd device is too slow
On Thu, Aug 27, 2015 at 3:43 AM, huang jun hjwsm1...@gmail.com wrote: hi,llya 2015-08-26 23:56 GMT+08:00 Ilya Dryomov idryo...@gmail.com: On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com wrote: On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote: hi,all we create a 2TB rbd image, after map it to local, then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318 seconds to finish, but local physical disk with the same size just need 6 seconds. I think librbd have two PR related to this. After debug, we found there are two steps in rbd module during formating: a) send 233093 DELETE requests to osds(number_of_requests = 2TB / 4MB), this step spent almost 92 seconds. I guess this(https://github.com/ceph/ceph/pull/4221/files) may help It's submitting deletes for non-existent objects, not zeroing. The only thing that will really help here is the addition of rbd object map support to the kernel client. That could happen in 4.4, but 4.5 is a safer bet. b) send 4238 messages like this: [set-alloc-hint object_size 4194304 write_size 4194304,write 0~512] to osds, that spent 227 seconds. I think kernel rbd also need to use https://github.com/ceph/ceph/pull/4983/files set-alloc-hint may be a problem, but I think a bigger problem is the size of the write. Are all those writes 512 bytes long? In another test to format 2TB rbd device, there are : 2 messages,each write 131072 bytes 4000 messages, each write 262144 bytes 112 messages, each write 4096 bytes 194 messages, each write 512 bytes So the majority of writes is not 512 bytes long. I don't think disabling set-alloc-hint (and, as of now at least, you can't disable it anyway) would drastically change the numbers. If you are doing mkfs right after creating and mapping an image for the first time, you can add -K option to mkfs, which will tell it to not try to discard. As for the write phase, I can't suggest anything off hand. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: format 2TB rbd device is too slow
On Fri, Aug 28, 2015 at 10:36 AM, Ma, Jianpeng jianpeng...@intel.com wrote: Hi Ilya, We can change sector size from 512 to 4096. This can reduce the count of write. I did a simple test: for 900G, mkfs.xfs -f For default: 1m10s Physical sector size = 4096: 0m10s. But if change sector size, we need rbd meta record this. What exactly do you mean by changing the sector size? xfs sector size or physical block size of the rbd device? If the latter, meaning mkfs.xfs -s size 4096, I fail to see how it can result in the above numbers. If the former, you have to patch the kernel and I fail to see how it can result in such an improvement too. I probably didn't have enough coffee, please elaborate. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: format 2TB rbd device is too slow
On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com wrote: On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote: hi,all we create a 2TB rbd image, after map it to local, then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318 seconds to finish, but local physical disk with the same size just need 6 seconds. I think librbd have two PR related to this. After debug, we found there are two steps in rbd module during formating: a) send 233093 DELETE requests to osds(number_of_requests = 2TB / 4MB), this step spent almost 92 seconds. I guess this(https://github.com/ceph/ceph/pull/4221/files) may help It's submitting deletes for non-existent objects, not zeroing. The only thing that will really help here is the addition of rbd object map support to the kernel client. That could happen in 4.4, but 4.5 is a safer bet. b) send 4238 messages like this: [set-alloc-hint object_size 4194304 write_size 4194304,write 0~512] to osds, that spent 227 seconds. I think kernel rbd also need to use https://github.com/ceph/ceph/pull/4983/files set-alloc-hint may be a problem, but I think a bigger problem is the size of the write. Are all those writes 512 bytes long? Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel RBD Readahead
On Tue, Aug 25, 2015 at 5:05 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 25 August 2015 09:45 To: Nick Fisk n...@fisk.me.uk Cc: Ceph Development ceph-devel@vger.kernel.org Subject: Re: Kernel RBD Readahead On Tue, Aug 25, 2015 at 10:40 AM, Nick Fisk n...@fisk.me.uk wrote: I have done two tests one with 1MB objects and another with 4MB objects, my cluster is a little busier than when I did the quick test yesterday, so all speeds are slightly down across the board but you can see the scaling effect nicely. Results: 1MB Order RBD Readahead DD-Null Speed 128 18MB/s 102432MB/s 204840MB/s 409658MB/s 819275MB/s 16384 91MB/s 32768 160MB/s 4MB Order RBD 128 42MB/s 102456MB/s 204861MB/s 409698MB/s 8192121MB/s 16384 170MB/s 32768 195MB/s 65536 221MB/s 131072 271MB/s I think the results confirm my suspicions, where a full stripe in a raid array will usually only be a couple of MB (eg 256kb chunk * 8 disks) and so a relatively small readahead will involve all the disks for max performance. In a Ceph RBD a full stripe will be 4MB * number of OSD's in the cluster. So I think that if sequential read performance is the only goal, then readahead probably needs to equal that figure, which could be massive. But in reality like me you will probably find that you get sufficient performance at a lower value. Of course all this theory could all change when the kernel client gets striping support. However in terms of a default, that’s a tricky one. Even setting it to 4096 would probably start to have a negative impact on pure random IO latency. Each read would make a OSD read a whole 4MB object, see small table below for IOP/Read size for disks in my cluster. I would imagine somewhere between 256-1024 would be a good trade off between where the OSD disks latency starts to rise. Users would need to be aware of their workload and tweak readahead if needed. IOPs (4k Random Read)83 (64k Random Read) 81 (256k Random Read) 73 (1M Random Read)52 (4M Random Read)25 Yeah, we want a sensible default, but it's always going to be a trade off. librbd has readahead knobs, but the only real use case there is shortening qemu boot times, so we can't copy those settings. I'll have to think about it some more - it might make more sense to leave things as is. Users with large sequential read workloads should know to check and adjust readahead settings, and likely won't be satisfied with 1x object size anyway. Ok. I might try and create a 4.1 kernel with the blk-mq queue depth/IO size + readahead +max_segments fixes in as I'm think the TCP_NODELAY bug will still be present in my old 3.14 kernel. I can build 4.2-rc8 + readahead patch on gitbuilders for you. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel RBD Readahead
On Tue, Aug 25, 2015 at 10:40 AM, Nick Fisk n...@fisk.me.uk wrote: I have done two tests one with 1MB objects and another with 4MB objects, my cluster is a little busier than when I did the quick test yesterday, so all speeds are slightly down across the board but you can see the scaling effect nicely. Results: 1MB Order RBD Readahead DD-Null Speed 128 18MB/s 102432MB/s 204840MB/s 409658MB/s 819275MB/s 16384 91MB/s 32768 160MB/s 4MB Order RBD 128 42MB/s 102456MB/s 204861MB/s 409698MB/s 8192121MB/s 16384 170MB/s 32768 195MB/s 65536 221MB/s 131072 271MB/s I think the results confirm my suspicions, where a full stripe in a raid array will usually only be a couple of MB (eg 256kb chunk * 8 disks) and so a relatively small readahead will involve all the disks for max performance. In a Ceph RBD a full stripe will be 4MB * number of OSD's in the cluster. So I think that if sequential read performance is the only goal, then readahead probably needs to equal that figure, which could be massive. But in reality like me you will probably find that you get sufficient performance at a lower value. Of course all this theory could all change when the kernel client gets striping support. However in terms of a default, that’s a tricky one. Even setting it to 4096 would probably start to have a negative impact on pure random IO latency. Each read would make a OSD read a whole 4MB object, see small table below for IOP/Read size for disks in my cluster. I would imagine somewhere between 256-1024 would be a good trade off between where the OSD disks latency starts to rise. Users would need to be aware of their workload and tweak readahead if needed. IOPs (4k Random Read)83 (64k Random Read) 81 (256k Random Read) 73 (1M Random Read)52 (4M Random Read)25 Yeah, we want a sensible default, but it's always going to be a trade off. librbd has readahead knobs, but the only real use case there is shortening qemu boot times, so we can't copy those settings. I'll have to think about it some more - it might make more sense to leave things as is. Users with large sequential read workloads should know to check and adjust readahead settings, and likely won't be satisfied with 1x object size anyway. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel RBD Readahead
On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 23 August 2015 18:33 To: Nick Fisk n...@fisk.me.uk Cc: Ceph Development ceph-devel@vger.kernel.org Subject: Re: Kernel RBD Readahead On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote: Hi Ilya, I was wondering if I could just get your thoughts on a matter I have run into? Its surrounding read performance of the RBD kernel client and blk-mq, mainly when doing large single threaded reads. During testing performance seems to be limited to around 40MB/s, which is probably fairly similar to what you would expect to get from a single OSD. This is to be expected as an RBD is just a long chain of objects each on a different OSD which is being read through in order one at a time. In theory readahead should make up for this by making the RBD client read from several OSD’s ahead of the current required block. However from what I can see it seems that setting a readahead value higher than max_sectors_kb doesn’t appear to have any effect, meaning that readahead is limited to each object that is currently being read. Would you be able to confirm if this is correct and if this is by design? [CCing ceph-devel] Certainly not by design. rbd is just a block device driver, so if the kernel submits a readahead read, it will obey and carry it out in full. The readahead is driven by the VM in pages, it doesn't care about rbd object boundaries and such. That said, one problem is in the VM subsystem, where readaheads get capped at 512 pages (= 2M). If you do a simple single threaded read test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os: $ rbd info foo | grep order order 24 (16384 kB objects) $ blockdev --getra /dev/rbd0 32768 $ dd if=/dev/rbd0 of=/dev/null bs=32M # avgrq-sz is 4096.00 This was introduced in commit 6d2be915e589 (mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages) [1], which went into 3.15. The hard limit was Linus' suggestion, apparently. #define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a * sensible upper limit. */ unsigned long max_sane_readahead(unsigned long nr) { return min(nr, MAX_READAHEAD); } This limit used to be dynamic and depended on the number of free pages in the system. There has been an attempt to bring that behaviour back [2], but it didn't go very far as far as getting into mainline. It looks like Red Hat and Oracle are shipping [2] in some of their kernels though. If you apply it, you'll see 32768 sector (= 16M) I/Os in the above test, which is how it should be. [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6d 2be915e589b58cb11418cbe1f22ff90732b6ac [2] http://thread.gmane.org/gmane.linux.kernel/1893680 One thing we should be doing is setting read_ahead_kb to object size, the default 128k doesn't really cut it for rbd. I'll send a patch for that. Thanks, Ilya Thanks for your response. I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so that is likely to be partly the cause of the poor performance I am seeing. However I tried a 3.14 kernel and saw the same level performance, but this time the IO's were limited to 1024 sectors. The queue depth was at around 8 so I guess this means its submitting 8*512kb IO's up to the max_sectors_kb limit of 4096KB. From the OSD point of view, this will still be accessing 1 OSD at a time. Maybe I'm expecting the wrong results but I was expecting that either 1 of these 2 scenarios would happen. 1. Kernel submits a large enough IO to satisfy the readahead value, max_sectors_kb would need to be higher than the object size (currently not possible) and the RADOS layer would be responsible for doing the parallel reads to the OSD's to satisfy it. 2. Kernel recognises that the readahead is bigger than the max_sectors_kb value and submits several IO's in parallel to the RBD device to satisfy the readahead request. Ie 32MB readahead would submit 8x4MB IO's in parallel. Please let me know if I have got the wrong idea here. But in my head either solution should improve sequential reads by a large amount, with the 2nd possibly slightly better as you are only waiting on the 1st OSD to respond to complete the request. Thanks for including the Ceph-Devel list, unfortunately despite several attempts I have not been able to post to this list after subscribing, please can you forward any correspondence if you think it would be useful to share. Did you remember to set max_sectors_kb to max_hw_sectors_kb? The block layer in 3.14 leaves max_sectors_kb at 512, even when max_hw_sectors_kb is set to a much bigger value by the driver
Re: Kernel RBD Readahead
On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov idryo...@gmail.com wrote: On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 23 August 2015 18:33 To: Nick Fisk n...@fisk.me.uk Cc: Ceph Development ceph-devel@vger.kernel.org Subject: Re: Kernel RBD Readahead On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote: Hi Ilya, I was wondering if I could just get your thoughts on a matter I have run into? Its surrounding read performance of the RBD kernel client and blk-mq, mainly when doing large single threaded reads. During testing performance seems to be limited to around 40MB/s, which is probably fairly similar to what you would expect to get from a single OSD. This is to be expected as an RBD is just a long chain of objects each on a different OSD which is being read through in order one at a time. In theory readahead should make up for this by making the RBD client read from several OSD’s ahead of the current required block. However from what I can see it seems that setting a readahead value higher than max_sectors_kb doesn’t appear to have any effect, meaning that readahead is limited to each object that is currently being read. Would you be able to confirm if this is correct and if this is by design? [CCing ceph-devel] Certainly not by design. rbd is just a block device driver, so if the kernel submits a readahead read, it will obey and carry it out in full. The readahead is driven by the VM in pages, it doesn't care about rbd object boundaries and such. That said, one problem is in the VM subsystem, where readaheads get capped at 512 pages (= 2M). If you do a simple single threaded read test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os: $ rbd info foo | grep order order 24 (16384 kB objects) $ blockdev --getra /dev/rbd0 32768 $ dd if=/dev/rbd0 of=/dev/null bs=32M # avgrq-sz is 4096.00 This was introduced in commit 6d2be915e589 (mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages) [1], which went into 3.15. The hard limit was Linus' suggestion, apparently. #define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a * sensible upper limit. */ unsigned long max_sane_readahead(unsigned long nr) { return min(nr, MAX_READAHEAD); } This limit used to be dynamic and depended on the number of free pages in the system. There has been an attempt to bring that behaviour back [2], but it didn't go very far as far as getting into mainline. It looks like Red Hat and Oracle are shipping [2] in some of their kernels though. If you apply it, you'll see 32768 sector (= 16M) I/Os in the above test, which is how it should be. [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6d 2be915e589b58cb11418cbe1f22ff90732b6ac [2] http://thread.gmane.org/gmane.linux.kernel/1893680 One thing we should be doing is setting read_ahead_kb to object size, the default 128k doesn't really cut it for rbd. I'll send a patch for that. Thanks, Ilya Thanks for your response. I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so that is likely to be partly the cause of the poor performance I am seeing. However I tried a 3.14 kernel and saw the same level performance, but this time the IO's were limited to 1024 sectors. The queue depth was at around 8 so I guess this means its submitting 8*512kb IO's up to the max_sectors_kb limit of 4096KB. From the OSD point of view, this will still be accessing 1 OSD at a time. Maybe I'm expecting the wrong results but I was expecting that either 1 of these 2 scenarios would happen. 1. Kernel submits a large enough IO to satisfy the readahead value, max_sectors_kb would need to be higher than the object size (currently not possible) and the RADOS layer would be responsible for doing the parallel reads to the OSD's to satisfy it. 2. Kernel recognises that the readahead is bigger than the max_sectors_kb value and submits several IO's in parallel to the RBD device to satisfy the readahead request. Ie 32MB readahead would submit 8x4MB IO's in parallel. Please let me know if I have got the wrong idea here. But in my head either solution should improve sequential reads by a large amount, with the 2nd possibly slightly better as you are only waiting on the 1st OSD to respond to complete the request. Thanks for including the Ceph-Devel list, unfortunately despite several attempts I have not been able to post to this list after subscribing, please can you forward any correspondence if you think it would be useful to share. Did you remember to set max_sectors_kb to max_hw_sectors_kb? The block layer in 3.14 leaves max_sectors_kb
Re: Kernel RBD Readahead
On Mon, Aug 24, 2015 at 7:00 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 24 August 2015 16:07 To: Nick Fisk n...@fisk.me.uk Cc: Ceph Development ceph-devel@vger.kernel.org Subject: Re: Kernel RBD Readahead On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov idryo...@gmail.com wrote: On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 23 August 2015 18:33 To: Nick Fisk n...@fisk.me.uk Cc: Ceph Development ceph-devel@vger.kernel.org Subject: Re: Kernel RBD Readahead On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote: Hi Ilya, I was wondering if I could just get your thoughts on a matter I have run into? Its surrounding read performance of the RBD kernel client and blk-mq, mainly when doing large single threaded reads. During testing performance seems to be limited to around 40MB/s, which is probably fairly similar to what you would expect to get from a single OSD. This is to be expected as an RBD is just a long chain of objects each on a different OSD which is being read through in order one at a time. In theory readahead should make up for this by making the RBD client read from several OSD’s ahead of the current required block. However from what I can see it seems that setting a readahead value higher than max_sectors_kb doesn’t appear to have any effect, meaning that readahead is limited to each object that is currently being read. Would you be able to confirm if this is correct and if this is by design? [CCing ceph-devel] Certainly not by design. rbd is just a block device driver, so if the kernel submits a readahead read, it will obey and carry it out in full. The readahead is driven by the VM in pages, it doesn't care about rbd object boundaries and such. That said, one problem is in the VM subsystem, where readaheads get capped at 512 pages (= 2M). If you do a simple single threaded read test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os: $ rbd info foo | grep order order 24 (16384 kB objects) $ blockdev --getra /dev/rbd0 32768 $ dd if=/dev/rbd0 of=/dev/null bs=32M # avgrq-sz is 4096.00 This was introduced in commit 6d2be915e589 (mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages) [1], which went into 3.15. The hard limit was Linus' suggestion, apparently. #define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a * sensible upper limit. */ unsigned long max_sane_readahead(unsigned long nr) { return min(nr, MAX_READAHEAD); } This limit used to be dynamic and depended on the number of free pages in the system. There has been an attempt to bring that behaviour back [2], but it didn't go very far as far as getting into mainline. It looks like Red Hat and Oracle are shipping [2] in some of their kernels though. If you apply it, you'll see 32768 sector (= 16M) I/Os in the above test, which is how it should be. [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/comm it/?id=6d 2be915e589b58cb11418cbe1f22ff90732b6ac [2] http://thread.gmane.org/gmane.linux.kernel/1893680 One thing we should be doing is setting read_ahead_kb to object size, the default 128k doesn't really cut it for rbd. I'll send a patch for that. Thanks, Ilya Thanks for your response. I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so that is likely to be partly the cause of the poor performance I am seeing. However I tried a 3.14 kernel and saw the same level performance, but this time the IO's were limited to 1024 sectors. The queue depth was at around 8 so I guess this means its submitting 8*512kb IO's up to the max_sectors_kb limit of 4096KB. From the OSD point of view, this will still be accessing 1 OSD at a time. Maybe I'm expecting the wrong results but I was expecting that either 1 of these 2 scenarios would happen. 1. Kernel submits a large enough IO to satisfy the readahead value, max_sectors_kb would need to be higher than the object size (currently not possible) and the RADOS layer would be responsible for doing the parallel reads to the OSD's to satisfy it. 2. Kernel recognises that the readahead is bigger than the max_sectors_kb value and submits several IO's in parallel to the RBD device to satisfy the readahead request. Ie 32MB readahead would submit 8x4MB IO's in parallel. Please let me know if I have got the wrong idea here. But in my head either solution should improve sequential reads by a large amount, with the 2nd possibly slightly better as you are only waiting on the 1st OSD
Re: Kernel RBD Readahead
On Mon, Aug 24, 2015 at 11:11 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 24 August 2015 18:19 To: Nick Fisk n...@fisk.me.uk Cc: Ceph Development ceph-devel@vger.kernel.org Subject: Re: Kernel RBD Readahead On Mon, Aug 24, 2015 at 7:00 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 24 August 2015 16:07 To: Nick Fisk n...@fisk.me.uk Cc: Ceph Development ceph-devel@vger.kernel.org Subject: Re: Kernel RBD Readahead On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov idryo...@gmail.com wrote: On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: 23 August 2015 18:33 To: Nick Fisk n...@fisk.me.uk Cc: Ceph Development ceph-devel@vger.kernel.org Subject: Re: Kernel RBD Readahead On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote: Hi Ilya, I was wondering if I could just get your thoughts on a matter I have run into? Its surrounding read performance of the RBD kernel client and blk-mq, mainly when doing large single threaded reads. During testing performance seems to be limited to around 40MB/s, which is probably fairly similar to what you would expect to get from a single OSD. This is to be expected as an RBD is just a long chain of objects each on a different OSD which is being read through in order one at a time. In theory readahead should make up for this by making the RBD client read from several OSD’s ahead of the current required block. However from what I can see it seems that setting a readahead value higher than max_sectors_kb doesn’t appear to have any effect, meaning that readahead is limited to each object that is currently being read. Would you be able to confirm if this is correct and if this is by design? [CCing ceph-devel] Certainly not by design. rbd is just a block device driver, so if the kernel submits a readahead read, it will obey and carry it out in full. The readahead is driven by the VM in pages, it doesn't care about rbd object boundaries and such. That said, one problem is in the VM subsystem, where readaheads get capped at 512 pages (= 2M). If you do a simple single threaded read test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os: $ rbd info foo | grep order order 24 (16384 kB objects) $ blockdev --getra /dev/rbd0 32768 $ dd if=/dev/rbd0 of=/dev/null bs=32M # avgrq-sz is 4096.00 This was introduced in commit 6d2be915e589 (mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages) [1], which went into 3.15. The hard limit was Linus' suggestion, apparently. #define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a * sensible upper limit. */ unsigned long max_sane_readahead(unsigned long nr) { return min(nr, MAX_READAHEAD); } This limit used to be dynamic and depended on the number of free pages in the system. There has been an attempt to bring that behaviour back [2], but it didn't go very far as far as getting into mainline. It looks like Red Hat and Oracle are shipping [2] in some of their kernels though. If you apply it, you'll see 32768 sector (= 16M) I/Os in the above test, which is how it should be. [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/c omm it/?id=6d 2be915e589b58cb11418cbe1f22ff90732b6ac [2] http://thread.gmane.org/gmane.linux.kernel/1893680 One thing we should be doing is setting read_ahead_kb to object size, the default 128k doesn't really cut it for rbd. I'll send a patch for that. Thanks, Ilya Thanks for your response. I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so that is likely to be partly the cause of the poor performance I am seeing. However I tried a 3.14 kernel and saw the same level performance, but this time the IO's were limited to 1024 sectors. The queue depth was at around 8 so I guess this means its submitting 8*512kb IO's up to the max_sectors_kb limit of 4096KB. From the OSD point of view, this will still be accessing 1 OSD at a time. Maybe I'm expecting the wrong results but I was expecting that either 1 of these 2 scenarios would happen. 1. Kernel submits a large enough IO to satisfy the readahead value, max_sectors_kb would need to be higher than the object size (currently not possible) and the RADOS layer would be responsible for doing the parallel reads to the OSD's to satisfy it. 2. Kernel recognises
Re: Kernel RBD Readahead
On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk n...@fisk.me.uk wrote: Hi Ilya, I was wondering if I could just get your thoughts on a matter I have run into? Its surrounding read performance of the RBD kernel client and blk-mq, mainly when doing large single threaded reads. During testing performance seems to be limited to around 40MB/s, which is probably fairly similar to what you would expect to get from a single OSD. This is to be expected as an RBD is just a long chain of objects each on a different OSD which is being read through in order one at a time. In theory readahead should make up for this by making the RBD client read from several OSD’s ahead of the current required block. However from what I can see it seems that setting a readahead value higher than max_sectors_kb doesn’t appear to have any effect, meaning that readahead is limited to each object that is currently being read. Would you be able to confirm if this is correct and if this is by design? [CCing ceph-devel] Certainly not by design. rbd is just a block device driver, so if the kernel submits a readahead read, it will obey and carry it out in full. The readahead is driven by the VM in pages, it doesn't care about rbd object boundaries and such. That said, one problem is in the VM subsystem, where readaheads get capped at 512 pages (= 2M). If you do a simple single threaded read test, you'll see 4096 sector (= 2M) I/Os instead of object size I/Os: $ rbd info foo | grep order order 24 (16384 kB objects) $ blockdev --getra /dev/rbd0 32768 $ dd if=/dev/rbd0 of=/dev/null bs=32M # avgrq-sz is 4096.00 This was introduced in commit 6d2be915e589 (mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages) [1], which went into 3.15. The hard limit was Linus' suggestion, apparently. #define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) /* * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a * sensible upper limit. */ unsigned long max_sane_readahead(unsigned long nr) { return min(nr, MAX_READAHEAD); } This limit used to be dynamic and depended on the number of free pages in the system. There has been an attempt to bring that behaviour back [2], but it didn't go very far as far as getting into mainline. It looks like Red Hat and Oracle are shipping [2] in some of their kernels though. If you apply it, you'll see 32768 sector (= 16M) I/Os in the above test, which is how it should be. [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6d2be915e589b58cb11418cbe1f22ff90732b6ac [2] http://thread.gmane.org/gmane.linux.kernel/1893680 One thing we should be doing is setting read_ahead_kb to object size, the default 128k doesn't really cut it for rbd. I'll send a patch for that. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: /sys/block and /dev and partitions
On Tue, Aug 18, 2015 at 12:46 AM, Loic Dachary l...@dachary.org wrote: Hi Ilya, For regular devices such as /dev/vdb2 or /dev/sda3, do you think it is safe to use /sys/dev/block/M:m/partition to figure out the partition number ? Or could it vary depending on the disk driver or the partition table layout ? Or the kernel version ? Yes, I think it is. partition was added specifically for finding out partition numbers after extended devt scheme was merged, because it made it impossible to use arithmetic on minor numbers. That happened around 2.6.27 - 2.6.28, so it's a more recent one compared to others, but I don't think you care about anything pre 2.6.32. (You could choose not to rely on partition and access() start for figuring out if the thing is a partition - it's been around forever, but then, the only way to figure out the partition index would be to parse the name, which you want to avoid.) Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd: default map options, new options, misc (Re: creating the issue for the v0.94.4 release)
On Mon, Aug 17, 2015 at 8:41 PM, Loic Dachary l...@dachary.org wrote: On 17/08/2015 18:31, Josh Durgin wrote: Yes, this would make sense in firefly too, since it's useful for newer kernels regardless of ceph userspace versions. Ok, I'll schedule it for backport. Is there an issue associated to this pull request ? No, probably not. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: /sys/block and /dev and partitions
On Sat, Aug 15, 2015 at 11:56 PM, Loic Dachary l...@dachary.org wrote: Hi Ilya, On 15/08/2015 19:42, Ilya Dryomov wrote: On Sat, Aug 15, 2015 at 6:35 PM, Loic Dachary l...@dachary.org wrote: Hi Sage, On 15/08/2015 16:28, Sage Weil wrote: On Sat, 15 Aug 2015, Loic Dachary wrote: Hi, Is there a portable and consistent way to figure out if a given /dev/XXX path (for instance /dev/dm-1) is a partition of a whole device ? Although checking /sys/block/dm-1/dm/name for a number at the end (like mpatha1 or mpatha2) would probably work, it feels like a fragile hack. Looking into /sys/block/dm-1/slaves will lead to /sys/block/dm-1/slaves/dm-0 and we can check that /sys/block/dm-*/subsystem is class/block. But that does not necessarily mean dm-1 is a partition of dm-0, just that it's a slave of dm-0. Take a look at is_partition in ceph-disk, whih is the best I came up with. Basically it checks if the device name appears as /sys/block/*/$foo... For regular devices, you can access() /sys/dev/block/maj:min/partition. If it's there, it's a partition - no need to iterate over /sys/block. I added http://tracker.ceph.com/issues/12706 for when someone has time to rework that part of the code. That is consistently updated for /dev/sdb or /dev/vdb but things are different when using multipath. I'll rely on /sys/block/dm-?/dm/name instead until a better solution is found. A better way might be to rely on the fact that a dm partition will necessarily have its uuid prefixed by part. In that case, it should be safe to assume that the thing in slaves is a whole disk - I think that's what various util-linux tools do. However, IIRC the dm uuid is optional, so that won't work on a dm device without a uuid. It looks like multipath on both CentOS 7 and Ubuntu 14.04 set the uuid in this way. Is it also safe to assume that if the uuid is: $ cat /sys/dev/block/253:?/dm/uuid mpath-3533007d0 part1-mpath-3533007d0 part2-mpath-3533007d0 it means these were created by multipath because of the mpath ? When asking dmsetup with: Yes, I think so. I'm pretty sure these partid- and mpath- prefixes were devised for exactly this purpose. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: /sys/block and /dev and partitions
On Sat, Aug 15, 2015 at 6:35 PM, Loic Dachary l...@dachary.org wrote: Hi Sage, On 15/08/2015 16:28, Sage Weil wrote: On Sat, 15 Aug 2015, Loic Dachary wrote: Hi, Is there a portable and consistent way to figure out if a given /dev/XXX path (for instance /dev/dm-1) is a partition of a whole device ? Although checking /sys/block/dm-1/dm/name for a number at the end (like mpatha1 or mpatha2) would probably work, it feels like a fragile hack. Looking into /sys/block/dm-1/slaves will lead to /sys/block/dm-1/slaves/dm-0 and we can check that /sys/block/dm-*/subsystem is class/block. But that does not necessarily mean dm-1 is a partition of dm-0, just that it's a slave of dm-0. Take a look at is_partition in ceph-disk, whih is the best I came up with. Basically it checks if the device name appears as /sys/block/*/$foo... For regular devices, you can access() /sys/dev/block/maj:min/partition. If it's there, it's a partition - no need to iterate over /sys/block. That is consistently updated for /dev/sdb or /dev/vdb but things are different when using multipath. I'll rely on /sys/block/dm-?/dm/name instead until a better solution is found. A better way might be to rely on the fact that a dm partition will necessarily have its uuid prefixed by part. In that case, it should be safe to assume that the thing in slaves is a whole disk - I think that's what various util-linux tools do. However, IIRC the dm uuid is optional, so that won't work on a dm device without a uuid. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd object map
On Fri, Aug 7, 2015 at 9:13 AM, Max Yehorov myeho...@skytap.com wrote: Hi, Object map feature was added in hammer. It is possible to create and delete an image with object map enabled using rbd, though it is not possible to map the image which has object map or exclusive lock features enabled, e.g.: ceph@ceph1:~$ rbd create foo --size=42 --image-format=2 --image-features=15 ceph@ceph1:~$ sudo rbd map foo rbd: sysfs write failed rbd: map failed: (6) No such device or address I checked rbd.c in linux/drivers/block/rbd.c, it looks like that it only supports layering and striping as of kernel 4.1 branch. Correct, and note that RBD_FEATURE_STRIPINGV2 is only partially supported - you can map a STRIPINGV2 image only if it was created with default striping parameters, i.e. you didn't specify --stripe-unit or --stripe-count at create time. #define RBD_FEATURE_LAYERING(10) #define RBD_FEATURE_STRIPINGV2 (11) #define RBD_FEATURES_ALL \ (RBD_FEATURE_LAYERING | RBD_FEATURE_STRIPINGV2) /* Features supported by this (client software) implementation. */ #define RBD_FEATURES_SUPPORTED (RBD_FEATURES_ALL) Is object map a partial feature in hammer? If so, will there be a complete support of object map in some upcoming ceph release? ceph releases and kernel releases are not tied together. object map is fully supported by librbd, it's just not supported by the kernel client yet. Work is under way, but it won't show up in anything earlier than kernel 4.4. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newbie question about metadata_list.
On Fri, Aug 7, 2015 at 10:12 AM, Łukasz Szymczyk lukasz.szymc...@corp.ovh.com wrote: On Thu, 6 Aug 2015 15:01:54 +0300 Ilya Dryomov idryo...@gmail.com wrote: Hi, On Thu, Aug 6, 2015 at 12:26 PM, Łukasz Szymczyk lukasz.szymc...@corp.ovh.com wrote: Hi, I'm writing some program to replace image in cluster with it's copy. But I have problem with metadata_list. librbd::ImageCtx *ic = (librbd::ImageCtx*)im; std::string start; int max = 1000; bufferlist in, out; ::encode(start, in); ::encode(max, in); ret = ((librados::IoCtx*)io)-exec(ic-header_oid, rbd, metadata_list, in, out); if (ret 0) printf(fail\n); You should use rbd_metadata_list() C API instead of this. Looks like my ceph version doesn't have it. return 0; } So, my question is: what should be set/enabled to get those metadata? Or maybe what I'm doing wrong here. Try rbd image-meta list image? I've got this: # rbd image-meta list image rbd: error parsing command 'image-meta'; -h or --help for usage It's a fairly recent feature, are you sure your OSDs support it? What's the output of ceph daemon osd.0 version? # ceph daemon osd.0 version {version:0.80.9} image metadata was introduced in v9.0.0. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newbie question about metadata_list.
On Thu, Aug 6, 2015 at 12:26 PM, Łukasz Szymczyk lukasz.szymc...@corp.ovh.com wrote: Hi, I'm writing some program to replace image in cluster with it's copy. But I have problem with metadata_list. I created pool: #rados mkpool dupa then I created image: #rbd create --size 1000 -p mypool image --image-format 2 Below is code which tries to get metadata, but it fails with -EOPNOTSUPP. I compile it with command like this: g++ file.cpp -lrbd -lrados -I/ceph/source/directory/src #include stdio.h #include stdlib.h #include include/rbd/librbd.h #include include/rbd/librbd.hpp using namespace ceph; #include librbd/ImageCtx.h int main() { rados_t clu; int ret = rados_create(clu, NULL); if (ret) return -1; ret = rados_conf_read_file(clu, NULL); if (ret) return -1; rados_conf_parse_env(clu, NULL); ret = rados_connect(clu); if (ret) return -1; rados_ioctx_t io; ret = rados_ioctx_create(clu, mypool, io); if (ret) return -1; rbd_image_t im; ret = rbd_open(io, image, im, NULL); if (ret) return -1; librbd::ImageCtx *ic = (librbd::ImageCtx*)im; std::string start; int max = 1000; bufferlist in, out; ::encode(start, in); ::encode(max, in); ret = ((librados::IoCtx*)io)-exec(ic-header_oid, rbd, metadata_list, in, out); if (ret 0) printf(fail\n); You should use rbd_metadata_list() C API instead of this. return 0; } So, my question is: what should be set/enabled to get those metadata? Or maybe what I'm doing wrong here. Try rbd image-meta list image? It's a fairly recent feature, are you sure your OSDs support it? What's the output of ceph daemon osd.0 version? Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: fix copyup completion race
On Tue, Jul 28, 2015 at 8:48 PM, Alex Elder el...@ieee.org wrote: On 07/17/2015 05:36 AM, Ilya Dryomov wrote: For write/discard obj_requests that involved a copyup method call, the opcode of the first op is CEPH_OSD_OP_CALL and the -callback is rbd_img_obj_copyup_callback(). The latter frees copyup pages, sets -xferred and delegates to rbd_img_obj_callback(), the normal image object callback, for reporting to block layer and putting refs. First of all, I'm really sorry it took so long to get to reviewing this. The short of this is it looks good, but it would be nice for you to read my description and affirm it matches reality. I also suggest you rename a function. rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op, which means obj_request is marked done in rbd_osd_trivial_callback(), This callback path really should be fixed so that every op in the r_ops[] array gets handled separately. That was sort of planned, but we initially only had the one case (write copyup), and it only had two ops. Now we have 1, 2, or 3, and in principle there could be more, so it would be better to support things generically. The obj_request should only be marked done *after* all individual op callbacks have been processed. But... that can be done another day. Yeah, I'm planning on reworking the entire completion infrastructure. The sneakiness of rbd_img_obj_callback() is slightly above the level I'm comfortable with and the things we have to do to make it work (bumping img_request kref for each obj_request, for example) are confusing - it should just be a plain atomic_t. The fact that rbd_img_obj_callback() may be called after we think we are done with the request (i.e. after rbd_img_request_complete() was called) is also alarming to me. I stared at it for a while and for now convinced myself there are no *obvious* bugs, but who knows... All that coupled with the sheer number of functions ending with _callback (some of which are actual callbacks but most are just called from those callbacks) makes the code hard to follow and even harder to make changes to, so I'm going to try to restructure it, hopefully in the near future. *before* -callback is invoked and rbd_img_obj_copyup_callback() has a chance to run. Marking obj_request done essentially means giving rbd_img_obj_callback() a license to end it at any moment, so if another obj_request from the same img_request is being completed concurrently, rbd_img_obj_end_request() may very well be called on such prematurally marked done request: obj_request-1/2 reply handle_reply() rbd_osd_req_callback() rbd_osd_trivial_callback() rbd_obj_request_complete() rbd_img_obj_copyup_callback() rbd_img_obj_callback() obj_request-2/2 reply handle_reply() rbd_osd_req_callback() rbd_osd_trivial_callback() for_each_obj_request(obj_request-img_request) { rbd_img_obj_end_request(obj_request-1/2) rbd_img_obj_end_request(obj_request-2/2) -- } Calling rbd_img_obj_end_request() on such a request leads to trouble, in particular because its -xfferred is 0. We report 0 to the block layer with blk_update_request(), get back 1 for this request has more data in flight and then trip on rbd_assert(more ^ (which == img_request-obj_request_count)); Another of these pesky assertions actually identifies a problem. Most of those rbd_assert() assert pointer != NULL, which is pretty useless. However, there is definitely a bunch of useful ones, this one being one of those, of course. with rhs (which == ...) being 1 because rbd_img_obj_end_request() has been called for both requests and lhs (more) being 1 because we haven't got a chance to set -xfferred in rbd_img_obj_copyup_callback() yet. To fix this, leverage that rbd wants to call class methods in only two cases: one is a generic method call wrapper (obj_request is standalone) and the other is a copyup (obj_request is part of an img_request). So make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke rbd_img_obj_copyup_callback() from it if obj_request is part of an img_request, similar to how CEPH_OSD_OP_READ handler invokes rbd_img_obj_request_read_callback(). OK, I understand what you're doing here. I'm going to restate it just to confirm that. When an OSD request completes, rbd_osd_req_callback() is called. It examines the first op in the object request's r_ops[] array. If the first op in an r_ops[] array is a CALL, you handle it with a new OSD request callback (sub)function rbd_osd_call_callback(). (That's a reasonable structural change to make anyway, I think.) That function now distinguishes between a copyup request (which will be for an image object) and a simple method call (which will be a request on an object not associated
Re: max rbd devices
On Tue, Jul 28, 2015 at 5:19 PM, Wyllys Ingersoll wyllys.ingers...@keepertech.com wrote: Actually, its a 4.1rc8 kernel. And modinfo shows that the single_major parameter defaults to false. $ sudo modinfo rbd filename: /lib/modules/4.1.0-040100rc8-generic/kernel/drivers/block/rbd.ko license:GPL description:RADOS Block Device (RBD) driver author: Jeff Garzik j...@garzik.org author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net author: Alex Elder el...@inktank.com srcversion: C0EF017464E250E25E466E8 depends:libceph intree: Y vermagic: 4.1.0-040100rc8-generic SMP mod_unload modversions signer: Build time autogenerated kernel key sig_key:FC:1C:37:16:92:C8:E7:E5:17:9C:06:C5:5F:C4:DF:7E:19:00:A9:0F sig_hashalgo: sha512 parm: single_major:Use a single major number for all rbd devices (default: false) (bool) Yes, it defaults to false. But if rbd.ko isn't loaded by the time you do rbd map (which it shouldn't), rbd cli tool will load it for you with single_major=Y. That's why I asked if you loaded it by hand. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: max rbd devices
On Tue, Jul 28, 2015 at 5:59 PM, Wyllys Ingersoll wyllys.ingers...@keepertech.com wrote: Interesting to note - /etc/modules contains a line for rbd, which means it gets loaded at boot time and NOT via the rbd cli, so perhaps this is why it is loaded with the flag set to N. Yeah, that's probably it - try removing it and reboot. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: max rbd devices
On Tue, Jul 28, 2015 at 5:35 PM, Wyllys Ingersoll wyllys.ingers...@keepertech.com wrote: OK, I misunderstood. It is not loaded by hand. We have permanent mappings listed in /etc/ceph/rbdmap which then get mapped automatically when the system boots. So, in that case, shouldn't single_major=Y since it essentially is just calling rbd map from an init script? Let's step back a second. What is the output of $ cat /sys/module/rbd/parameters/single_major right now? Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html