Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary
Hi Ilya,

It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe 
does. It is used intensively by ceph-disk and inevitably leads to races where a 
device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 14.04 
with a 3.13.0-62-generic kernel only generates two udev change events and does 
not remove / add partitions. The source code between sgdisk 0.8.6 and sgdisk 
0.8.8 did not change in a significant way and the output of strace -e ioctl 
sgdisk -i 2 /dev/vdb is identical in both environments.

ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0

This leads me to the conclusion that the difference is in how the kernel reacts 
to these ioctl.

What do you think ? 

Cheers

On 17/12/2015 17:26, Ilya Dryomov wrote:
> On Thu, Dec 17, 2015 at 3:10 PM, Loic Dachary  wrote:
>> Hi Sage,
>>
>> On 17/12/2015 14:31, Sage Weil wrote:
>>> On Thu, 17 Dec 2015, Loic Dachary wrote:
 Hi Ilya,

 This is another puzzling behavior (the log of all commands is at
 http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a
 series of sgdisk -i commands to examine various devices including
 /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup
 again although I don't have a definitive proof of this).

 It looks like a side effect of a previous partprobe command, the only
 command I can think of that removes / re-adds devices. I thought calling
 udevadm settle after running partprobe would be enough to ensure
 partprobe completed (and since it takes as much as 2mn30 to return, I
 would be shocked if it does not ;-).
> 
> Yeah, IIRC partprobe goes through every slot in the partition table,
> trying to first remove and then add the partition back.  But, I don't
> see any mention of partprobe in the log you referred to.
> 
> Should udevadm settle for a few vd* devices be taking that much time?
> I'd investigate that regardless of the issue at hand.
> 

 Any idea ? I desperately try to find a consistent behavior, something
 reliable that we could use to say : "wait for the partition table to be
 up to date in the kernel and all udev events generated by the partition
 table update to complete".
>>>
>>> I wonder if the underlying issue is that we shouldn't be calling udevadm
>>> settle from something running from udev.  Instead, of a udev-triggered
>>> run of ceph-disk does something that changes the partitions, it
>>> should just exit and let udevadm run ceph-disk again on the new
>>> devices...?
> 
>>
>> Unless I missed something this is on CentOS 7 and ceph-disk is only called 
>> from udev as ceph-disk trigger which does nothing else but asynchronously 
>> delegate the work to systemd. Therefore there is no udevadm settle from 
>> within udev (which would deadlock and timeout every time... I hope ;-).
> 
> That's a sure lockup, until one of them times out.
> 
> How are you delegating to systemd?  Is it to avoid long-running udev
> events?  I'm probably missing something - udevadm settle wouldn't block
> on anything other than udev, so if you are shipping work off to
> somewhere else, udev can't be relied upon for waiting.
> 
> Thanks,
> 
> Ilya
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Ilya Dryomov
On Fri, Dec 18, 2015 at 1:38 PM, Loic Dachary  wrote:
> Hi Ilya,
>
> It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
> them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe 
> does. It is used intensively by ceph-disk and inevitably leads to races where 
> a device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 
> 14.04 with a 3.13.0-62-generic kernel only generates two udev change events 
> and does not remove / add partitions. The source code between sgdisk 0.8.6 
> and sgdisk 0.8.8 did not change in a significant way and the output of strace 
> -e ioctl sgdisk -i 2 /dev/vdb is identical in both environments.
>
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
>
> This leads me to the conclusion that the difference is in how the kernel 
> reacts to these ioctl.

I'm pretty sure it's not the kernel versions that matter here, but
systemd versions.  Those are all get-property ioctls, and I don't think
sgdisk -i does anything with the partition table.

What it probably does though is it opens the disk for write for some
reason.  When it closes it, udevd (systemd-udevd process) picks that
close up via inotify and issues the BLKRRPART ioctl, instructing the
kernel to re-read the partition table.  Technically, that's different
from what partprobe does, but it still generates those udev events you
are seeing in the monitor.

AFAICT udevd started doing this in v214.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary


On 18/12/2015 16:31, Ilya Dryomov wrote:
> On Fri, Dec 18, 2015 at 1:38 PM, Loic Dachary  wrote:
>> Hi Ilya,
>>
>> It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
>> them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way 
>> partprobe does. It is used intensively by ceph-disk and inevitably leads to 
>> races where a device temporarily disapears. The same command (sgdisk 0.8.8) 
>> on Ubuntu 14.04 with a 3.13.0-62-generic kernel only generates two udev 
>> change events and does not remove / add partitions. The source code between 
>> sgdisk 0.8.6 and sgdisk 0.8.8 did not change in a significant way and the 
>> output of strace -e ioctl sgdisk -i 2 /dev/vdb is identical in both 
>> environments.
>>
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
>> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>>
>> This leads me to the conclusion that the difference is in how the kernel 
>> reacts to these ioctl.
> 
> I'm pretty sure it's not the kernel versions that matter here, but
> systemd versions.  Those are all get-property ioctls, and I don't think
> sgdisk -i does anything with the partition table.
> 
> What it probably does though is it opens the disk for write for some
> reason.  When it closes it, udevd (systemd-udevd process) picks that
> close up via inotify and issues the BLKRRPART ioctl, instructing the
> kernel to re-read the partition table.  Technically, that's different
> from what partprobe does, but it still generates those udev events you
> are seeing in the monitor.
> 
> AFAICT udevd started doing this in v214.

That explains everything indeed.

# strace -f -e open sgdisk -i 2 /dev/vdb
...
open("/dev/vdb", O_RDONLY)  = 4
open("/dev/vdb", O_WRONLY|O_CREAT, 0644) = 4
open("/dev/vdb", O_RDONLY)  = 4
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 7BBAA731-AA45-47B8-8661-B4FAA53C4162
First sector: 2048 (at 1024.0 KiB)
Last sector: 204800 (at 100.0 MiB)
Partition size: 202753 sectors (99.0 MiB)
Attribute flags: 
Partition name: 'ceph journal'

# strace -f -e open blkid /dev/vdb2
...
open("/etc/blkid.conf", O_RDONLY)   = 4
open("/dev/.blkid.tab", O_RDONLY)   = 4
open("/dev/vdb2", O_RDONLY) = 4
open("/sys/dev/block/253:18", O_RDONLY) = 5
open("/sys/block/vdb/dev", O_RDONLY)= 6
open("/dev/.blkid.tab-hVvwJi", O_RDWR|O_CREAT|O_EXCL, 0600) = 4

blkid does not open the device for write, hence the different behavior. 
Switching sgdisk in favor of blkid fixes the issue.

Nice catch !

> Thanks,
> 
> Ilya
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Weighted Priority Op Queue

2015-12-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I adjusted the algorithm from the Weighted Round Robin Queue and
resolved the SSD performance issue. Since it is different, I've
renamed it so that it doesn't cause confusion later.

My tests are all showing a performance improvement of 3-17%. The
enqueue and dequeue latencies are the same or just slightly better
than Prioritized Queue, but the distribution of OPs is much more fair
especially in the contention situations (more enqueue than dequeue).
The new queue is always right on the expected distribution in all
cases, even with highly skewed ops (higher priority ops have low
costs/size and low priority ops have high costs/size).

I could probably get a little more performance out of it by using
intrusive containers, but since I'm getting the same latency with
better overall performance gains, I wanted to get this in and tested.
I can create another PR later if intrusive containers provide more
performance.

The PR is at https://github.com/ceph/ceph/pull/6964

I've closed the previous RP (https://github.com/ceph/ceph/pull/6781)
as this one supersedes it. Any feedback is appreciated.

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWdDMuCRDmVDuy+mK58QAAet0P/iRkGILB96Ryy6HPYZny
ODzzn1ld/LkwK/JmVsWti48tOZbnG91WhxQIAQMeha5GeRAcv/Ie2OBZe7Y4
Jo4tCaXbPGwZ9sk1kxjwTm4SYHKwCYf0vuyOVIT2+On0BrXUPfDFPFBHpUwY
nDUkyZTi6YCo9W7qgST7AQJHI9wAeb0SlClAsBN3vd8JOoZfuWuA6+hNP6q5
tI6RueXXy8LgbiIiNmm5MvAJQg/7I+rWs2JZyyyu0BaBSm3GPY15LPpIWa7F
zg0+ax8pb5J0Ug4fSwzG7iKKNhO/TSC32bC2kRWPThQCmVQ6mUmrBWW311vK
ZpFRAPy6mXCVKysFqZFxav5BFagkQZO470Vjej4riYHxRa4QdGOkjZODAb6+
2GN+wtQH76dVsxm4mOGi4sUFJ5QLjk+nTDILIS5uh3x7nJ94UbqAetzgQA7m
bAVQocMb5B0JZb6vHjg4TkwQ2pVgzBWxYQX7Lum3hm3DDVhm4BPbU0juCZaO
o2XD6KgM40mxJmsjmyL/siCUV8wqZDamoHMAyljIr0hfkvBR+AB9a6B2+1iE
DIIBX/blM7lSQRHIOrBj/FRBeplGbGP2cBXS0v4+N3l5jGFgk4Aisf1KPr22
qaGbxv6pteMJhbpeBf4bOTeU8spwHmpAxmLINYECUk6ySoYJNva9T+IYsBiQ
EoPN
=bQwT
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary
Nevermind, got it:

CHANGES WITH 214:

* As an experimental feature, udev now tries to lock the
  disk device node (flock(LOCK_SH|LOCK_NB)) while it
  executes events for the disk or any of its partitions.
  Applications like partitioning programs can lock the
  disk device node (flock(LOCK_EX)) and claim temporary
  device ownership that way; udev will entirely skip all event
  handling for this disk and its partitions. If the disk
  was opened for writing, the close will trigger a partition
  table rescan in udev's "watch" facility, and if needed
  synthesize "change" events for the disk and all its partitions.
  This is now unconditionally enabled, and if it turns out to
  cause major problems, we might turn it on only for specific
  devices, or might need to disable it entirely. Device Mapper
  devices are excluded from this logic.


On 18/12/2015 17:32, Loic Dachary wrote:
> 
>>> AFAICT udevd started doing this in v214.
> 
> Do you have a specific commit / changelog entry in mind ? I'd like to add it 
> to the commit message fixing the problem reference.
> 
> Thanks !
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


2016 Ceph Tech Talks

2015-12-18 Thread Patrick McGarry
Hey cephers,

Before we all head off to various holiday shenanigans and befuddle our
senses with rest, relaxation, and glorious meals of legend, I wanted
to give you something to look forward to for 2016 in the form of Ceph
Tech Talks!

http://ceph.com/ceph-tech-talks/

First on the docket in January is our rescheduled talk from earlier
this year discussing a PostgreSQL setup on Ceph under Mesos/Aurora
with Docker. That should be a great talk that hits a lot of the
questions I am frequently asked about database workloads, ceph, and
containers all in one.

While I haven’t solidified the specific speaker/date/time, our plans
for February are to dig in to the immanent release of CephFS (hooray!)
in Jewel. We’ll take a look at what awesomeness is being delivered,
and where CephFS is headed next.

March is wide open, so if you or someone you know would like to give a
Ceph Tech Talk, I’d love to find a community volunteer to talk about a
technical topic that is Ceph-related for about an hour over
videoconference. Please drop me a line if this is interesting to you.

In April we will once again be visiting the OpenStack Developer Summit
(this time in TX), as well as working to deliver a Ceph track like we
did in Tokyo. My hope is to broadcast some of this content for
consumption by remote participants. Keep an eye out!

If you have any questions about upcoming events or community endeavors
please feel free to drop me a line. Thanks!


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best way to measure client and recovery I/O

2015-12-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've been working with Sam Just today and we would like to get some
performance data around client I/O and recovery I/O to test the new Op
queue I've been working on. I know that we can just set and OSD out/in
and such, but there seems like there could be a lot of variation in
the results making it difficult to come to a good conclusion. We could
just run the test many times, but I'd love to spend my time doing
other things.

Please let me know if you have any great ideas around this problem.

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWdJsOCRDmVDuy+mK58QAAKnMP/iJ1sef1wqwsWCKxFXWA
dSbrcZg8QhqLg8HCFmb8qg6RTb3x440xXkKSLiXyOTy7ZyHJlbwOtasRLakl
uVikor8/hgLkKeWRtw3p3jHj7quOVoY7zQ6i6Y9aRlgSyCSOliAl02ufdjGS
k2Zh7WRwPl2qnBfLkgghTwR7yXaKXYYhHBfW9JHniIL6/pTVd2bgNFTUxtZ4
NgnMjq56TAM/dL4Q8byaMin9TbORz726tabTFeo8iK9EpyjRgJ8VIItFuTx2
sG5WMpFVkQ8Er72IOhzKn7ukI9hyWVl6ruYV1I2lUr+qPqLPdLQ0aZVaP8Xz
hYvik58wprkwTyg6iG3Vka0yiLJpW3RrtrLk/OjM3nZ9fO3G1eFX9EE3k6wu
SsxM66B2iHYc/Q6xGmB5sY22+Y8pVsxN9ULn+c2HKQqcL83tmS2yE6NIbk5s
u0XZheZTDRAj6buL8T6PukSZomI1bQaLOz5p11IqttzqOAH9EjY3nYqN4CYi
kXKHn8c5pUb2ocrZKM3Y/ooQ6kNvGHAWBmdQCXT9t5BGHDQsSHwDX6IUjFX4
QfdzAwlEscle7Zy2CGnTdd4nz9Ny88msxbSRPt97KeJkBV93b9hSNYd0W2qk
qELgDYzurR7LuVGINOnTeCfcWiw5oW0jdDU8Z4wWsYpNBkGyiuEmtLSV7h6M
X3OO
=TXse
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/14] libceph: use list_for_each_entry_safe

2015-12-18 Thread Geliang Tang
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang 
---
 net/ceph/messenger.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9981039..c664b7f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -3361,9 +3361,8 @@ static void ceph_msg_free(struct ceph_msg *m)
 static void ceph_msg_release(struct kref *kref)
 {
struct ceph_msg *m = container_of(kref, struct ceph_msg, kref);
-   LIST_HEAD(data);
-   struct list_head *links;
-   struct list_head *next;
+   LIST_HEAD(head);
+   struct ceph_msg_data *data, *next;
 
dout("%s %p\n", __func__, m);
WARN_ON(!list_empty(>list_head));
@@ -3376,12 +3375,9 @@ static void ceph_msg_release(struct kref *kref)
m->middle = NULL;
}
 
-   list_splice_init(>data, );
-   list_for_each_safe(links, next, ) {
-   struct ceph_msg_data *data;
-
-   data = list_entry(links, struct ceph_msg_data, links);
-   list_del_init(links);
+   list_splice_init(>data, );
+   list_for_each_entry_safe(data, next, , links) {
+   list_del_init(>links);
ceph_msg_data_destroy(data);
}
m->data_length = 0;
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-18 Thread Mike Christie
Eric,

Do you have iSCSI data digests on?

On 12/15/2015 12:08 AM, Eric Eastman wrote:
> I am testing Linux Target SCSI, LIO, with a Ceph File System backstore
> and I am seeing this error on my LIO gateway.  I am using Ceph v9.2.0
> on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File
> System.  A file on the Ceph File System is exported via iSCSI to a
> VMware ESXi 5.0 server, and I am seeing this error when doing a lot of
> I/O on the ESXi server.   Is this a LIO or a Ceph issue?
> 
> [Tue Dec 15 00:46:55 2015] [ cut here ]
> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at
> /home/kernel/COD/linux/fs/ceph/addr.c:125
> ceph_set_page_dirty+0x230/0x240 [ceph]()
> [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables
> x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt
> tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock
> target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost
> qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc
> libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper
> gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si
> sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek
> irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac
> edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport
> mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic
> usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last
> unloaded: target_core_mod]
> [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx
> Tainted: GW I 4.4.0-040400rc4-generic #201512061930
> [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
> P64 01/22/2015
> [Tue Dec 15 00:46:55 2015]   fdc0ce43
> 880bf38c38c0 813c8ab4
> [Tue Dec 15 00:46:55 2015]   880bf38c38f8
> 8107d772 ea00127a8680
> [Tue Dec 15 00:46:55 2015]  8804e52c1448 8804e52c15b0
> 8804e52c10f0 0200
> [Tue Dec 15 00:46:55 2015] Call Trace:
> [Tue Dec 15 00:46:55 2015]  [] dump_stack+0x44/0x60
> [Tue Dec 15 00:46:55 2015]  [] 
> warn_slowpath_common+0x82/0xc0
> [Tue Dec 15 00:46:55 2015]  [] warn_slowpath_null+0x1a/0x20
> [Tue Dec 15 00:46:55 2015]  []
> ceph_set_page_dirty+0x230/0x240 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> pagecache_get_page+0x150/0x1c0
> [Tue Dec 15 00:46:55 2015]  [] ?
> ceph_pool_perm_check+0x48/0x700 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] set_page_dirty+0x3d/0x70
> [Tue Dec 15 00:46:55 2015]  []
> ceph_write_end+0x5e/0x180 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> iov_iter_copy_from_user_atomic+0x156/0x220
> [Tue Dec 15 00:46:55 2015]  []
> generic_perform_write+0x114/0x1c0
> [Tue Dec 15 00:46:55 2015]  []
> ceph_write_iter+0xf8a/0x1050 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> ceph_put_cap_refs+0x143/0x320 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> check_preempt_wakeup+0xfa/0x220
> [Tue Dec 15 00:46:55 2015]  [] ? zone_statistics+0x7c/0xa0
> [Tue Dec 15 00:46:55 2015]  [] ? copy_page_to_iter+0x5e/0xa0
> [Tue Dec 15 00:46:55 2015]  [] ?
> skb_copy_datagram_iter+0x122/0x250
> [Tue Dec 15 00:46:55 2015]  [] vfs_iter_write+0x76/0xc0
> [Tue Dec 15 00:46:55 2015]  []
> fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file]
> [Tue Dec 15 00:46:55 2015]  []
> fd_execute_rw+0xc5/0x2a0 [target_core_file]
> [Tue Dec 15 00:46:55 2015]  []
> sbc_execute_rw+0x22/0x30 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> __target_execute_cmd+0x1f/0x70 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> target_execute_cmd+0x195/0x2a0 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  [] ? __switch_to+0x1dc/0x5a0
> [Tue Dec 15 00:46:55 2015]  [] ?
> iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  [] kthread+0xd8/0xf0
> [Tue Dec 15 00:46:55 2015]  [] ?
> kthread_create_on_node+0x1a0/0x1a0
> [Tue Dec 15 00:46:55 2015]  [] ret_from_fork+0x3f/0x70
> [Tue Dec 15 00:46:55 2015]  [] ?
> kthread_create_on_node+0x1a0/0x1a0
> [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]---
> [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: 
> 95784927
> [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already
> complete, skipping
> 
> If it is a Ceph File System issue, let me know and I will open a bug.
> 
> Thanks
> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to 

[PATCH] ceph: Avoid to propagate the invalid page point

2015-12-18 Thread Minfei Huang
The variant pagep will still get the invalid page point, although ceph
fails in function ceph_update_writeable_page.

To fix this issue, Assigne the page to pagep until there is no failure
in function ceph_update_writeable_page.

Signed-off-by: Minfei Huang 
---
 fs/ceph/addr.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index b7d218a..6491079 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1149,7 +1149,6 @@ static int ceph_write_begin(struct file *file, struct 
address_space *mapping,
page = grab_cache_page_write_begin(mapping, index, 0);
if (!page)
return -ENOMEM;
-   *pagep = page;
 
dout("write_begin file %p inode %p page %p %d~%d\n", file,
 inode, page, (int)pos, (int)len);
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best way to measure client and recovery I/O

2015-12-18 Thread Kyle Bader
> I've been working with Sam Just today and we would like to get some
> performance data around client I/O and recovery I/O to test the new Op
> queue I've been working on. I know that we can just set and OSD out/in
> and such, but there seems like there could be a lot of variation in
> the results making it difficult to come to a good conclusion. We could
> just run the test many times, but I'd love to spend my time doing
> other things.

CBT [1] can do failure simulations while pushing load against the
cluster, here is a config to get you started:

https://gist.github.com/mmgaggle/471cd4227e961a243b22

The osds array in the recovery test portion is the list of osd ids
that you want to mark out during the test.

CBT requires a bit of setup, but there is a script that can do most of
it on a rpm based system. Make sure that your cbt head node has
keyless ssh to itself, the mons, clients, and osd hosts (including
accepting host keys). Let me know if you need help setting it up!

[1] https://github.com/ceph/cbt

-- 

Kyle Bader
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-18 Thread Eric Eastman
Hi Mike,

On the EXSi server both Header Digest and Data Digest are set to Prohibited.

Eric

On Fri, Dec 18, 2015 at 2:54 PM, Mike Christie  wrote:
> Eric,
>
> Do you have iSCSI data digests on?
>
> On 12/15/2015 12:08 AM, Eric Eastman wrote:
>> I am testing Linux Target SCSI, LIO, with a Ceph File System backstore
>> and I am seeing this error on my LIO gateway.  I am using Ceph v9.2.0
>> on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File
>> System.  A file on the Ceph File System is exported via iSCSI to a
>> VMware ESXi 5.0 server, and I am seeing this error when doing a lot of
>> I/O on the ESXi server.   Is this a LIO or a Ceph issue?
>>
>> [Tue Dec 15 00:46:55 2015] [ cut here ]
>> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at
>> /home/kernel/COD/linux/fs/ceph/addr.c:125
>> ceph_set_page_dirty+0x230/0x240 [ceph]()
>> [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables
>> x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt
>> tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock
>> target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost
>> qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc
>> libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper
>> gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si
>> sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek
>> irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac
>> edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport
>> mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic
>> usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last
>> unloaded: target_core_mod]
>> [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx
>> Tainted: GW I 4.4.0-040400rc4-generic #201512061930
>> [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
>> P64 01/22/2015
>> [Tue Dec 15 00:46:55 2015]   fdc0ce43
>> 880bf38c38c0 813c8ab4
>> [Tue Dec 15 00:46:55 2015]   880bf38c38f8
>> 8107d772 ea00127a8680
>> [Tue Dec 15 00:46:55 2015]  8804e52c1448 8804e52c15b0
>> 8804e52c10f0 0200
>> [Tue Dec 15 00:46:55 2015] Call Trace:
>> [Tue Dec 15 00:46:55 2015]  [] dump_stack+0x44/0x60
>> [Tue Dec 15 00:46:55 2015]  [] 
>> warn_slowpath_common+0x82/0xc0
>> [Tue Dec 15 00:46:55 2015]  [] warn_slowpath_null+0x1a/0x20
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_set_page_dirty+0x230/0x240 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> pagecache_get_page+0x150/0x1c0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> ceph_pool_perm_check+0x48/0x700 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] set_page_dirty+0x3d/0x70
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_write_end+0x5e/0x180 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> iov_iter_copy_from_user_atomic+0x156/0x220
>> [Tue Dec 15 00:46:55 2015]  []
>> generic_perform_write+0x114/0x1c0
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_write_iter+0xf8a/0x1050 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> ceph_put_cap_refs+0x143/0x320 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> check_preempt_wakeup+0xfa/0x220
>> [Tue Dec 15 00:46:55 2015]  [] ? zone_statistics+0x7c/0xa0
>> [Tue Dec 15 00:46:55 2015]  [] ? 
>> copy_page_to_iter+0x5e/0xa0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> skb_copy_datagram_iter+0x122/0x250
>> [Tue Dec 15 00:46:55 2015]  [] vfs_iter_write+0x76/0xc0
>> [Tue Dec 15 00:46:55 2015]  []
>> fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file]
>> [Tue Dec 15 00:46:55 2015]  []
>> fd_execute_rw+0xc5/0x2a0 [target_core_file]
>> [Tue Dec 15 00:46:55 2015]  []
>> sbc_execute_rw+0x22/0x30 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> __target_execute_cmd+0x1f/0x70 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> target_execute_cmd+0x195/0x2a0 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  [] ? __switch_to+0x1dc/0x5a0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  [] kthread+0xd8/0xf0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> kthread_create_on_node+0x1a0/0x1a0
>> [Tue Dec 15 00:46:55 2015]  [] ret_from_fork+0x3f/0x70
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> kthread_create_on_node+0x1a0/0x1a0
>> [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]---
>> [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: 
>> 95784927
>> [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already
>> complete, skipping
>>
>> If it is a Ceph File System issue, let me know and I will open a bug.
>>
>> Thanks
>>
>> Eric
>> --
>> To unsubscribe from this