Re: puzzling disapearance of /dev/sdc1
Hi Ilya, It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe does. It is used intensively by ceph-disk and inevitably leads to races where a device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 14.04 with a 3.13.0-62-generic kernel only generates two udev change events and does not remove / add partitions. The source code between sgdisk 0.8.6 and sgdisk 0.8.8 did not change in a significant way and the output of strace -e ioctl sgdisk -i 2 /dev/vdb is identical in both environments. ioctl(3, BLKGETSIZE, 20971520) = 0 ioctl(3, BLKGETSIZE64, 10737418240) = 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 ioctl(3, BLKGETSIZE, 20971520) = 0 ioctl(3, BLKGETSIZE64, 10737418240) = 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKGETSIZE, 20971520) = 0 ioctl(3, BLKGETSIZE64, 10737418240) = 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 ioctl(3, BLKSSZGET, 512)= 0 This leads me to the conclusion that the difference is in how the kernel reacts to these ioctl. What do you think ? Cheers On 17/12/2015 17:26, Ilya Dryomov wrote: > On Thu, Dec 17, 2015 at 3:10 PM, Loic Dacharywrote: >> Hi Sage, >> >> On 17/12/2015 14:31, Sage Weil wrote: >>> On Thu, 17 Dec 2015, Loic Dachary wrote: Hi Ilya, This is another puzzling behavior (the log of all commands is at http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a series of sgdisk -i commands to examine various devices including /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup again although I don't have a definitive proof of this). It looks like a side effect of a previous partprobe command, the only command I can think of that removes / re-adds devices. I thought calling udevadm settle after running partprobe would be enough to ensure partprobe completed (and since it takes as much as 2mn30 to return, I would be shocked if it does not ;-). > > Yeah, IIRC partprobe goes through every slot in the partition table, > trying to first remove and then add the partition back. But, I don't > see any mention of partprobe in the log you referred to. > > Should udevadm settle for a few vd* devices be taking that much time? > I'd investigate that regardless of the issue at hand. > Any idea ? I desperately try to find a consistent behavior, something reliable that we could use to say : "wait for the partition table to be up to date in the kernel and all udev events generated by the partition table update to complete". >>> >>> I wonder if the underlying issue is that we shouldn't be calling udevadm >>> settle from something running from udev. Instead, of a udev-triggered >>> run of ceph-disk does something that changes the partitions, it >>> should just exit and let udevadm run ceph-disk again on the new >>> devices...? > >> >> Unless I missed something this is on CentOS 7 and ceph-disk is only called >> from udev as ceph-disk trigger which does nothing else but asynchronously >> delegate the work to systemd. Therefore there is no udevadm settle from >> within udev (which would deadlock and timeout every time... I hope ;-). > > That's a sure lockup, until one of them times out. > > How are you delegating to systemd? Is it to avoid long-running udev > events? I'm probably missing something - udevadm settle wouldn't block > on anything other than udev, so if you are shipping work off to > somewhere else, udev can't be relied upon for waiting. > > Thanks, > > Ilya > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: puzzling disapearance of /dev/sdc1
On Fri, Dec 18, 2015 at 1:38 PM, Loic Dacharywrote: > Hi Ilya, > > It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds > them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe > does. It is used intensively by ceph-disk and inevitably leads to races where > a device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu > 14.04 with a 3.13.0-62-generic kernel only generates two udev change events > and does not remove / add partitions. The source code between sgdisk 0.8.6 > and sgdisk 0.8.8 did not change in a significant way and the output of strace > -e ioctl sgdisk -i 2 /dev/vdb is identical in both environments. > > ioctl(3, BLKGETSIZE, 20971520) = 0 > ioctl(3, BLKGETSIZE64, 10737418240) = 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 > ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 > ioctl(3, BLKGETSIZE, 20971520) = 0 > ioctl(3, BLKGETSIZE64, 10737418240) = 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKGETSIZE, 20971520) = 0 > ioctl(3, BLKGETSIZE64, 10737418240) = 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > ioctl(3, BLKSSZGET, 512)= 0 > > This leads me to the conclusion that the difference is in how the kernel > reacts to these ioctl. I'm pretty sure it's not the kernel versions that matter here, but systemd versions. Those are all get-property ioctls, and I don't think sgdisk -i does anything with the partition table. What it probably does though is it opens the disk for write for some reason. When it closes it, udevd (systemd-udevd process) picks that close up via inotify and issues the BLKRRPART ioctl, instructing the kernel to re-read the partition table. Technically, that's different from what partprobe does, but it still generates those udev events you are seeing in the monitor. AFAICT udevd started doing this in v214. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: puzzling disapearance of /dev/sdc1
On 18/12/2015 16:31, Ilya Dryomov wrote: > On Fri, Dec 18, 2015 at 1:38 PM, Loic Dacharywrote: >> Hi Ilya, >> >> It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds >> them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way >> partprobe does. It is used intensively by ceph-disk and inevitably leads to >> races where a device temporarily disapears. The same command (sgdisk 0.8.8) >> on Ubuntu 14.04 with a 3.13.0-62-generic kernel only generates two udev >> change events and does not remove / add partitions. The source code between >> sgdisk 0.8.6 and sgdisk 0.8.8 did not change in a significant way and the >> output of strace -e ioctl sgdisk -i 2 /dev/vdb is identical in both >> environments. >> >> ioctl(3, BLKGETSIZE, 20971520) = 0 >> ioctl(3, BLKGETSIZE64, 10737418240) = 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 >> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 >> ioctl(3, BLKGETSIZE, 20971520) = 0 >> ioctl(3, BLKGETSIZE64, 10737418240) = 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKGETSIZE, 20971520) = 0 >> ioctl(3, BLKGETSIZE64, 10737418240) = 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> ioctl(3, BLKSSZGET, 512)= 0 >> >> This leads me to the conclusion that the difference is in how the kernel >> reacts to these ioctl. > > I'm pretty sure it's not the kernel versions that matter here, but > systemd versions. Those are all get-property ioctls, and I don't think > sgdisk -i does anything with the partition table. > > What it probably does though is it opens the disk for write for some > reason. When it closes it, udevd (systemd-udevd process) picks that > close up via inotify and issues the BLKRRPART ioctl, instructing the > kernel to re-read the partition table. Technically, that's different > from what partprobe does, but it still generates those udev events you > are seeing in the monitor. > > AFAICT udevd started doing this in v214. That explains everything indeed. # strace -f -e open sgdisk -i 2 /dev/vdb ... open("/dev/vdb", O_RDONLY) = 4 open("/dev/vdb", O_WRONLY|O_CREAT, 0644) = 4 open("/dev/vdb", O_RDONLY) = 4 Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) Partition unique GUID: 7BBAA731-AA45-47B8-8661-B4FAA53C4162 First sector: 2048 (at 1024.0 KiB) Last sector: 204800 (at 100.0 MiB) Partition size: 202753 sectors (99.0 MiB) Attribute flags: Partition name: 'ceph journal' # strace -f -e open blkid /dev/vdb2 ... open("/etc/blkid.conf", O_RDONLY) = 4 open("/dev/.blkid.tab", O_RDONLY) = 4 open("/dev/vdb2", O_RDONLY) = 4 open("/sys/dev/block/253:18", O_RDONLY) = 5 open("/sys/block/vdb/dev", O_RDONLY)= 6 open("/dev/.blkid.tab-hVvwJi", O_RDWR|O_CREAT|O_EXCL, 0600) = 4 blkid does not open the device for write, hence the different behavior. Switching sgdisk in favor of blkid fixes the issue. Nice catch ! > Thanks, > > Ilya > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Weighted Priority Op Queue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I adjusted the algorithm from the Weighted Round Robin Queue and resolved the SSD performance issue. Since it is different, I've renamed it so that it doesn't cause confusion later. My tests are all showing a performance improvement of 3-17%. The enqueue and dequeue latencies are the same or just slightly better than Prioritized Queue, but the distribution of OPs is much more fair especially in the contention situations (more enqueue than dequeue). The new queue is always right on the expected distribution in all cases, even with highly skewed ops (higher priority ops have low costs/size and low priority ops have high costs/size). I could probably get a little more performance out of it by using intrusive containers, but since I'm getting the same latency with better overall performance gains, I wanted to get this in and tested. I can create another PR later if intrusive containers provide more performance. The PR is at https://github.com/ceph/ceph/pull/6964 I've closed the previous RP (https://github.com/ceph/ceph/pull/6781) as this one supersedes it. Any feedback is appreciated. Thanks, - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWdDMuCRDmVDuy+mK58QAAet0P/iRkGILB96Ryy6HPYZny ODzzn1ld/LkwK/JmVsWti48tOZbnG91WhxQIAQMeha5GeRAcv/Ie2OBZe7Y4 Jo4tCaXbPGwZ9sk1kxjwTm4SYHKwCYf0vuyOVIT2+On0BrXUPfDFPFBHpUwY nDUkyZTi6YCo9W7qgST7AQJHI9wAeb0SlClAsBN3vd8JOoZfuWuA6+hNP6q5 tI6RueXXy8LgbiIiNmm5MvAJQg/7I+rWs2JZyyyu0BaBSm3GPY15LPpIWa7F zg0+ax8pb5J0Ug4fSwzG7iKKNhO/TSC32bC2kRWPThQCmVQ6mUmrBWW311vK ZpFRAPy6mXCVKysFqZFxav5BFagkQZO470Vjej4riYHxRa4QdGOkjZODAb6+ 2GN+wtQH76dVsxm4mOGi4sUFJ5QLjk+nTDILIS5uh3x7nJ94UbqAetzgQA7m bAVQocMb5B0JZb6vHjg4TkwQ2pVgzBWxYQX7Lum3hm3DDVhm4BPbU0juCZaO o2XD6KgM40mxJmsjmyL/siCUV8wqZDamoHMAyljIr0hfkvBR+AB9a6B2+1iE DIIBX/blM7lSQRHIOrBj/FRBeplGbGP2cBXS0v4+N3l5jGFgk4Aisf1KPr22 qaGbxv6pteMJhbpeBf4bOTeU8spwHmpAxmLINYECUk6ySoYJNva9T+IYsBiQ EoPN =bQwT -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: puzzling disapearance of /dev/sdc1
Nevermind, got it: CHANGES WITH 214: * As an experimental feature, udev now tries to lock the disk device node (flock(LOCK_SH|LOCK_NB)) while it executes events for the disk or any of its partitions. Applications like partitioning programs can lock the disk device node (flock(LOCK_EX)) and claim temporary device ownership that way; udev will entirely skip all event handling for this disk and its partitions. If the disk was opened for writing, the close will trigger a partition table rescan in udev's "watch" facility, and if needed synthesize "change" events for the disk and all its partitions. This is now unconditionally enabled, and if it turns out to cause major problems, we might turn it on only for specific devices, or might need to disable it entirely. Device Mapper devices are excluded from this logic. On 18/12/2015 17:32, Loic Dachary wrote: > >>> AFAICT udevd started doing this in v214. > > Do you have a specific commit / changelog entry in mind ? I'd like to add it > to the commit message fixing the problem reference. > > Thanks ! > > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
2016 Ceph Tech Talks
Hey cephers, Before we all head off to various holiday shenanigans and befuddle our senses with rest, relaxation, and glorious meals of legend, I wanted to give you something to look forward to for 2016 in the form of Ceph Tech Talks! http://ceph.com/ceph-tech-talks/ First on the docket in January is our rescheduled talk from earlier this year discussing a PostgreSQL setup on Ceph under Mesos/Aurora with Docker. That should be a great talk that hits a lot of the questions I am frequently asked about database workloads, ceph, and containers all in one. While I haven’t solidified the specific speaker/date/time, our plans for February are to dig in to the immanent release of CephFS (hooray!) in Jewel. We’ll take a look at what awesomeness is being delivered, and where CephFS is headed next. March is wide open, so if you or someone you know would like to give a Ceph Tech Talk, I’d love to find a community volunteer to talk about a technical topic that is Ceph-related for about an hour over videoconference. Please drop me a line if this is interesting to you. In April we will once again be visiting the OpenStack Developer Summit (this time in TX), as well as working to deliver a Ceph track like we did in Tokyo. My hope is to broadcast some of this content for consumption by remote participants. Keep an eye out! If you have any questions about upcoming events or community endeavors please feel free to drop me a line. Thanks! -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Best way to measure client and recovery I/O
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I've been working with Sam Just today and we would like to get some performance data around client I/O and recovery I/O to test the new Op queue I've been working on. I know that we can just set and OSD out/in and such, but there seems like there could be a lot of variation in the results making it difficult to come to a good conclusion. We could just run the test many times, but I'd love to spend my time doing other things. Please let me know if you have any great ideas around this problem. Thanks, - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWdJsOCRDmVDuy+mK58QAAKnMP/iJ1sef1wqwsWCKxFXWA dSbrcZg8QhqLg8HCFmb8qg6RTb3x440xXkKSLiXyOTy7ZyHJlbwOtasRLakl uVikor8/hgLkKeWRtw3p3jHj7quOVoY7zQ6i6Y9aRlgSyCSOliAl02ufdjGS k2Zh7WRwPl2qnBfLkgghTwR7yXaKXYYhHBfW9JHniIL6/pTVd2bgNFTUxtZ4 NgnMjq56TAM/dL4Q8byaMin9TbORz726tabTFeo8iK9EpyjRgJ8VIItFuTx2 sG5WMpFVkQ8Er72IOhzKn7ukI9hyWVl6ruYV1I2lUr+qPqLPdLQ0aZVaP8Xz hYvik58wprkwTyg6iG3Vka0yiLJpW3RrtrLk/OjM3nZ9fO3G1eFX9EE3k6wu SsxM66B2iHYc/Q6xGmB5sY22+Y8pVsxN9ULn+c2HKQqcL83tmS2yE6NIbk5s u0XZheZTDRAj6buL8T6PukSZomI1bQaLOz5p11IqttzqOAH9EjY3nYqN4CYi kXKHn8c5pUb2ocrZKM3Y/ooQ6kNvGHAWBmdQCXT9t5BGHDQsSHwDX6IUjFX4 QfdzAwlEscle7Zy2CGnTdd4nz9Ny88msxbSRPt97KeJkBV93b9hSNYd0W2qk qELgDYzurR7LuVGINOnTeCfcWiw5oW0jdDU8Z4wWsYpNBkGyiuEmtLSV7h6M X3OO =TXse -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/14] libceph: use list_for_each_entry_safe
Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang--- net/ceph/messenger.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 9981039..c664b7f 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -3361,9 +3361,8 @@ static void ceph_msg_free(struct ceph_msg *m) static void ceph_msg_release(struct kref *kref) { struct ceph_msg *m = container_of(kref, struct ceph_msg, kref); - LIST_HEAD(data); - struct list_head *links; - struct list_head *next; + LIST_HEAD(head); + struct ceph_msg_data *data, *next; dout("%s %p\n", __func__, m); WARN_ON(!list_empty(>list_head)); @@ -3376,12 +3375,9 @@ static void ceph_msg_release(struct kref *kref) m->middle = NULL; } - list_splice_init(>data, ); - list_for_each_safe(links, next, ) { - struct ceph_msg_data *data; - - data = list_entry(links, struct ceph_msg_data, links); - list_del_init(links); + list_splice_init(>data, ); + list_for_each_entry_safe(data, next, , links) { + list_del_init(>links); ceph_msg_data_destroy(data); } m->data_length = 0; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with Ceph File System and LIO
Eric, Do you have iSCSI data digests on? On 12/15/2015 12:08 AM, Eric Eastman wrote: > I am testing Linux Target SCSI, LIO, with a Ceph File System backstore > and I am seeing this error on my LIO gateway. I am using Ceph v9.2.0 > on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File > System. A file on the Ceph File System is exported via iSCSI to a > VMware ESXi 5.0 server, and I am seeing this error when doing a lot of > I/O on the ESXi server. Is this a LIO or a Ceph issue? > > [Tue Dec 15 00:46:55 2015] [ cut here ] > [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at > /home/kernel/COD/linux/fs/ceph/addr.c:125 > ceph_set_page_dirty+0x230/0x240 [ceph]() > [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables > x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt > tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock > target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost > qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc > libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper > gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si > sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek > irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac > edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport > mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic > usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last > unloaded: target_core_mod] > [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx > Tainted: GW I 4.4.0-040400rc4-generic #201512061930 > [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS > P64 01/22/2015 > [Tue Dec 15 00:46:55 2015] fdc0ce43 > 880bf38c38c0 813c8ab4 > [Tue Dec 15 00:46:55 2015] 880bf38c38f8 > 8107d772 ea00127a8680 > [Tue Dec 15 00:46:55 2015] 8804e52c1448 8804e52c15b0 > 8804e52c10f0 0200 > [Tue Dec 15 00:46:55 2015] Call Trace: > [Tue Dec 15 00:46:55 2015] [] dump_stack+0x44/0x60 > [Tue Dec 15 00:46:55 2015] [] > warn_slowpath_common+0x82/0xc0 > [Tue Dec 15 00:46:55 2015] [] warn_slowpath_null+0x1a/0x20 > [Tue Dec 15 00:46:55 2015] [] > ceph_set_page_dirty+0x230/0x240 [ceph] > [Tue Dec 15 00:46:55 2015] [] ? > pagecache_get_page+0x150/0x1c0 > [Tue Dec 15 00:46:55 2015] [] ? > ceph_pool_perm_check+0x48/0x700 [ceph] > [Tue Dec 15 00:46:55 2015] [] set_page_dirty+0x3d/0x70 > [Tue Dec 15 00:46:55 2015] [] > ceph_write_end+0x5e/0x180 [ceph] > [Tue Dec 15 00:46:55 2015] [] ? > iov_iter_copy_from_user_atomic+0x156/0x220 > [Tue Dec 15 00:46:55 2015] [] > generic_perform_write+0x114/0x1c0 > [Tue Dec 15 00:46:55 2015] [] > ceph_write_iter+0xf8a/0x1050 [ceph] > [Tue Dec 15 00:46:55 2015] [] ? > ceph_put_cap_refs+0x143/0x320 [ceph] > [Tue Dec 15 00:46:55 2015] [] ? > check_preempt_wakeup+0xfa/0x220 > [Tue Dec 15 00:46:55 2015] [] ? zone_statistics+0x7c/0xa0 > [Tue Dec 15 00:46:55 2015] [] ? copy_page_to_iter+0x5e/0xa0 > [Tue Dec 15 00:46:55 2015] [] ? > skb_copy_datagram_iter+0x122/0x250 > [Tue Dec 15 00:46:55 2015] [] vfs_iter_write+0x76/0xc0 > [Tue Dec 15 00:46:55 2015] [] > fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file] > [Tue Dec 15 00:46:55 2015] [] > fd_execute_rw+0xc5/0x2a0 [target_core_file] > [Tue Dec 15 00:46:55 2015] [] > sbc_execute_rw+0x22/0x30 [target_core_mod] > [Tue Dec 15 00:46:55 2015] [] > __target_execute_cmd+0x1f/0x70 [target_core_mod] > [Tue Dec 15 00:46:55 2015] [] > target_execute_cmd+0x195/0x2a0 [target_core_mod] > [Tue Dec 15 00:46:55 2015] [] > iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod] > [Tue Dec 15 00:46:55 2015] [] > iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod] > [Tue Dec 15 00:46:55 2015] [] > iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod] > [Tue Dec 15 00:46:55 2015] [] ? __switch_to+0x1dc/0x5a0 > [Tue Dec 15 00:46:55 2015] [] ? > iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod] > [Tue Dec 15 00:46:55 2015] [] kthread+0xd8/0xf0 > [Tue Dec 15 00:46:55 2015] [] ? > kthread_create_on_node+0x1a0/0x1a0 > [Tue Dec 15 00:46:55 2015] [] ret_from_fork+0x3f/0x70 > [Tue Dec 15 00:46:55 2015] [] ? > kthread_create_on_node+0x1a0/0x1a0 > [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]--- > [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: > 95784927 > [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already > complete, skipping > > If it is a Ceph File System issue, let me know and I will open a bug. > > Thanks > > Eric > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to
[PATCH] ceph: Avoid to propagate the invalid page point
The variant pagep will still get the invalid page point, although ceph fails in function ceph_update_writeable_page. To fix this issue, Assigne the page to pagep until there is no failure in function ceph_update_writeable_page. Signed-off-by: Minfei Huang--- fs/ceph/addr.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b7d218a..6491079 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1149,7 +1149,6 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping, page = grab_cache_page_write_begin(mapping, index, 0); if (!page) return -ENOMEM; - *pagep = page; dout("write_begin file %p inode %p page %p %d~%d\n", file, inode, page, (int)pos, (int)len); -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Best way to measure client and recovery I/O
> I've been working with Sam Just today and we would like to get some > performance data around client I/O and recovery I/O to test the new Op > queue I've been working on. I know that we can just set and OSD out/in > and such, but there seems like there could be a lot of variation in > the results making it difficult to come to a good conclusion. We could > just run the test many times, but I'd love to spend my time doing > other things. CBT [1] can do failure simulations while pushing load against the cluster, here is a config to get you started: https://gist.github.com/mmgaggle/471cd4227e961a243b22 The osds array in the recovery test portion is the list of osd ids that you want to mark out during the test. CBT requires a bit of setup, but there is a script that can do most of it on a rpm based system. Make sure that your cbt head node has keyless ssh to itself, the mons, clients, and osd hosts (including accepting host keys). Let me know if you need help setting it up! [1] https://github.com/ceph/cbt -- Kyle Bader -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with Ceph File System and LIO
Hi Mike, On the EXSi server both Header Digest and Data Digest are set to Prohibited. Eric On Fri, Dec 18, 2015 at 2:54 PM, Mike Christiewrote: > Eric, > > Do you have iSCSI data digests on? > > On 12/15/2015 12:08 AM, Eric Eastman wrote: >> I am testing Linux Target SCSI, LIO, with a Ceph File System backstore >> and I am seeing this error on my LIO gateway. I am using Ceph v9.2.0 >> on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File >> System. A file on the Ceph File System is exported via iSCSI to a >> VMware ESXi 5.0 server, and I am seeing this error when doing a lot of >> I/O on the ESXi server. Is this a LIO or a Ceph issue? >> >> [Tue Dec 15 00:46:55 2015] [ cut here ] >> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at >> /home/kernel/COD/linux/fs/ceph/addr.c:125 >> ceph_set_page_dirty+0x230/0x240 [ceph]() >> [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables >> x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt >> tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock >> target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost >> qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc >> libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper >> gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si >> sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek >> irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac >> edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport >> mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic >> usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last >> unloaded: target_core_mod] >> [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx >> Tainted: GW I 4.4.0-040400rc4-generic #201512061930 >> [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS >> P64 01/22/2015 >> [Tue Dec 15 00:46:55 2015] fdc0ce43 >> 880bf38c38c0 813c8ab4 >> [Tue Dec 15 00:46:55 2015] 880bf38c38f8 >> 8107d772 ea00127a8680 >> [Tue Dec 15 00:46:55 2015] 8804e52c1448 8804e52c15b0 >> 8804e52c10f0 0200 >> [Tue Dec 15 00:46:55 2015] Call Trace: >> [Tue Dec 15 00:46:55 2015] [] dump_stack+0x44/0x60 >> [Tue Dec 15 00:46:55 2015] [] >> warn_slowpath_common+0x82/0xc0 >> [Tue Dec 15 00:46:55 2015] [] warn_slowpath_null+0x1a/0x20 >> [Tue Dec 15 00:46:55 2015] [] >> ceph_set_page_dirty+0x230/0x240 [ceph] >> [Tue Dec 15 00:46:55 2015] [] ? >> pagecache_get_page+0x150/0x1c0 >> [Tue Dec 15 00:46:55 2015] [] ? >> ceph_pool_perm_check+0x48/0x700 [ceph] >> [Tue Dec 15 00:46:55 2015] [] set_page_dirty+0x3d/0x70 >> [Tue Dec 15 00:46:55 2015] [] >> ceph_write_end+0x5e/0x180 [ceph] >> [Tue Dec 15 00:46:55 2015] [] ? >> iov_iter_copy_from_user_atomic+0x156/0x220 >> [Tue Dec 15 00:46:55 2015] [] >> generic_perform_write+0x114/0x1c0 >> [Tue Dec 15 00:46:55 2015] [] >> ceph_write_iter+0xf8a/0x1050 [ceph] >> [Tue Dec 15 00:46:55 2015] [] ? >> ceph_put_cap_refs+0x143/0x320 [ceph] >> [Tue Dec 15 00:46:55 2015] [] ? >> check_preempt_wakeup+0xfa/0x220 >> [Tue Dec 15 00:46:55 2015] [] ? zone_statistics+0x7c/0xa0 >> [Tue Dec 15 00:46:55 2015] [] ? >> copy_page_to_iter+0x5e/0xa0 >> [Tue Dec 15 00:46:55 2015] [] ? >> skb_copy_datagram_iter+0x122/0x250 >> [Tue Dec 15 00:46:55 2015] [] vfs_iter_write+0x76/0xc0 >> [Tue Dec 15 00:46:55 2015] [] >> fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file] >> [Tue Dec 15 00:46:55 2015] [] >> fd_execute_rw+0xc5/0x2a0 [target_core_file] >> [Tue Dec 15 00:46:55 2015] [] >> sbc_execute_rw+0x22/0x30 [target_core_mod] >> [Tue Dec 15 00:46:55 2015] [] >> __target_execute_cmd+0x1f/0x70 [target_core_mod] >> [Tue Dec 15 00:46:55 2015] [] >> target_execute_cmd+0x195/0x2a0 [target_core_mod] >> [Tue Dec 15 00:46:55 2015] [] >> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod] >> [Tue Dec 15 00:46:55 2015] [] >> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod] >> [Tue Dec 15 00:46:55 2015] [] >> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod] >> [Tue Dec 15 00:46:55 2015] [] ? __switch_to+0x1dc/0x5a0 >> [Tue Dec 15 00:46:55 2015] [] ? >> iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod] >> [Tue Dec 15 00:46:55 2015] [] kthread+0xd8/0xf0 >> [Tue Dec 15 00:46:55 2015] [] ? >> kthread_create_on_node+0x1a0/0x1a0 >> [Tue Dec 15 00:46:55 2015] [] ret_from_fork+0x3f/0x70 >> [Tue Dec 15 00:46:55 2015] [] ? >> kthread_create_on_node+0x1a0/0x1a0 >> [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]--- >> [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: >> 95784927 >> [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already >> complete, skipping >> >> If it is a Ceph File System issue, let me know and I will open a bug. >> >> Thanks >> >> Eric >> -- >> To unsubscribe from this