Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary
Hi Ilya,

It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe 
does. It is used intensively by ceph-disk and inevitably leads to races where a 
device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 14.04 
with a 3.13.0-62-generic kernel only generates two udev change events and does 
not remove / add partitions. The source code between sgdisk 0.8.6 and sgdisk 
0.8.8 did not change in a significant way and the output of strace -e ioctl 
sgdisk -i 2 /dev/vdb is identical in both environments.

ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKGETSIZE, 20971520)  = 0
ioctl(3, BLKGETSIZE64, 10737418240) = 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0
ioctl(3, BLKSSZGET, 512)= 0

This leads me to the conclusion that the difference is in how the kernel reacts 
to these ioctl.

What do you think ? 

Cheers

On 17/12/2015 17:26, Ilya Dryomov wrote:
> On Thu, Dec 17, 2015 at 3:10 PM, Loic Dachary  wrote:
>> Hi Sage,
>>
>> On 17/12/2015 14:31, Sage Weil wrote:
>>> On Thu, 17 Dec 2015, Loic Dachary wrote:
 Hi Ilya,

 This is another puzzling behavior (the log of all commands is at
 http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a
 series of sgdisk -i commands to examine various devices including
 /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup
 again although I don't have a definitive proof of this).

 It looks like a side effect of a previous partprobe command, the only
 command I can think of that removes / re-adds devices. I thought calling
 udevadm settle after running partprobe would be enough to ensure
 partprobe completed (and since it takes as much as 2mn30 to return, I
 would be shocked if it does not ;-).
> 
> Yeah, IIRC partprobe goes through every slot in the partition table,
> trying to first remove and then add the partition back.  But, I don't
> see any mention of partprobe in the log you referred to.
> 
> Should udevadm settle for a few vd* devices be taking that much time?
> I'd investigate that regardless of the issue at hand.
> 

 Any idea ? I desperately try to find a consistent behavior, something
 reliable that we could use to say : "wait for the partition table to be
 up to date in the kernel and all udev events generated by the partition
 table update to complete".
>>>
>>> I wonder if the underlying issue is that we shouldn't be calling udevadm
>>> settle from something running from udev.  Instead, of a udev-triggered
>>> run of ceph-disk does something that changes the partitions, it
>>> should just exit and let udevadm run ceph-disk again on the new
>>> devices...?
> 
>>
>> Unless I missed something this is on CentOS 7 and ceph-disk is only called 
>> from udev as ceph-disk trigger which does nothing else but asynchronously 
>> delegate the work to systemd. Therefore there is no udevadm settle from 
>> within udev (which would deadlock and timeout every time... I hope ;-).
> 
> That's a sure lockup, until one of them times out.
> 
> How are you delegating to systemd?  Is it to avoid long-running udev
> events?  I'm probably missing something - udevadm settle wouldn't block
> on anything other than udev, so if you are shipping work off to
> somewhere else, udev can't be relied upon for waiting.
> 
> Thanks,
> 
> Ilya
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Ilya Dryomov
On Fri, Dec 18, 2015 at 1:38 PM, Loic Dachary  wrote:
> Hi Ilya,
>
> It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
> them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe 
> does. It is used intensively by ceph-disk and inevitably leads to races where 
> a device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 
> 14.04 with a 3.13.0-62-generic kernel only generates two udev change events 
> and does not remove / add partitions. The source code between sgdisk 0.8.6 
> and sgdisk 0.8.8 did not change in a significant way and the output of strace 
> -e ioctl sgdisk -i 2 /dev/vdb is identical in both environments.
>
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKGETSIZE, 20971520)  = 0
> ioctl(3, BLKGETSIZE64, 10737418240) = 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
> ioctl(3, BLKSSZGET, 512)= 0
>
> This leads me to the conclusion that the difference is in how the kernel 
> reacts to these ioctl.

I'm pretty sure it's not the kernel versions that matter here, but
systemd versions.  Those are all get-property ioctls, and I don't think
sgdisk -i does anything with the partition table.

What it probably does though is it opens the disk for write for some
reason.  When it closes it, udevd (systemd-udevd process) picks that
close up via inotify and issues the BLKRRPART ioctl, instructing the
kernel to re-read the partition table.  Technically, that's different
from what partprobe does, but it still generates those udev events you
are seeing in the monitor.

AFAICT udevd started doing this in v214.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary


On 18/12/2015 16:31, Ilya Dryomov wrote:
> On Fri, Dec 18, 2015 at 1:38 PM, Loic Dachary  wrote:
>> Hi Ilya,
>>
>> It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
>> them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way 
>> partprobe does. It is used intensively by ceph-disk and inevitably leads to 
>> races where a device temporarily disapears. The same command (sgdisk 0.8.8) 
>> on Ubuntu 14.04 with a 3.13.0-62-generic kernel only generates two udev 
>> change events and does not remove / add partitions. The source code between 
>> sgdisk 0.8.6 and sgdisk 0.8.8 did not change in a significant way and the 
>> output of strace -e ioctl sgdisk -i 2 /dev/vdb is identical in both 
>> environments.
>>
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
>> ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKGETSIZE, 20971520)  = 0
>> ioctl(3, BLKGETSIZE64, 10737418240) = 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>> ioctl(3, BLKSSZGET, 512)= 0
>>
>> This leads me to the conclusion that the difference is in how the kernel 
>> reacts to these ioctl.
> 
> I'm pretty sure it's not the kernel versions that matter here, but
> systemd versions.  Those are all get-property ioctls, and I don't think
> sgdisk -i does anything with the partition table.
> 
> What it probably does though is it opens the disk for write for some
> reason.  When it closes it, udevd (systemd-udevd process) picks that
> close up via inotify and issues the BLKRRPART ioctl, instructing the
> kernel to re-read the partition table.  Technically, that's different
> from what partprobe does, but it still generates those udev events you
> are seeing in the monitor.
> 
> AFAICT udevd started doing this in v214.

That explains everything indeed.

# strace -f -e open sgdisk -i 2 /dev/vdb
...
open("/dev/vdb", O_RDONLY)  = 4
open("/dev/vdb", O_WRONLY|O_CREAT, 0644) = 4
open("/dev/vdb", O_RDONLY)  = 4
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 7BBAA731-AA45-47B8-8661-B4FAA53C4162
First sector: 2048 (at 1024.0 KiB)
Last sector: 204800 (at 100.0 MiB)
Partition size: 202753 sectors (99.0 MiB)
Attribute flags: 
Partition name: 'ceph journal'

# strace -f -e open blkid /dev/vdb2
...
open("/etc/blkid.conf", O_RDONLY)   = 4
open("/dev/.blkid.tab", O_RDONLY)   = 4
open("/dev/vdb2", O_RDONLY) = 4
open("/sys/dev/block/253:18", O_RDONLY) = 5
open("/sys/block/vdb/dev", O_RDONLY)= 6
open("/dev/.blkid.tab-hVvwJi", O_RDWR|O_CREAT|O_EXCL, 0600) = 4

blkid does not open the device for write, hence the different behavior. 
Switching sgdisk in favor of blkid fixes the issue.

Nice catch !

> Thanks,
> 
> Ilya
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: puzzling disapearance of /dev/sdc1

2015-12-18 Thread Loic Dachary
Nevermind, got it:

CHANGES WITH 214:

* As an experimental feature, udev now tries to lock the
  disk device node (flock(LOCK_SH|LOCK_NB)) while it
  executes events for the disk or any of its partitions.
  Applications like partitioning programs can lock the
  disk device node (flock(LOCK_EX)) and claim temporary
  device ownership that way; udev will entirely skip all event
  handling for this disk and its partitions. If the disk
  was opened for writing, the close will trigger a partition
  table rescan in udev's "watch" facility, and if needed
  synthesize "change" events for the disk and all its partitions.
  This is now unconditionally enabled, and if it turns out to
  cause major problems, we might turn it on only for specific
  devices, or might need to disable it entirely. Device Mapper
  devices are excluded from this logic.


On 18/12/2015 17:32, Loic Dachary wrote:
> 
>>> AFAICT udevd started doing this in v214.
> 
> Do you have a specific commit / changelog entry in mind ? I'd like to add it 
> to the commit message fixing the problem reference.
> 
> Thanks !
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: puzzling disapearance of /dev/sdc1

2015-12-17 Thread Loic Dachary
Hi Sage,

On 17/12/2015 14:31, Sage Weil wrote:
> On Thu, 17 Dec 2015, Loic Dachary wrote:
>> Hi Ilya,
>>
>> This is another puzzling behavior (the log of all commands is at 
>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a 
>> series of sgdisk -i commands to examine various devices including 
>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup 
>> again although I don't have a definitive proof of this).
>>
>> It looks like a side effect of a previous partprobe command, the only 
>> command I can think of that removes / re-adds devices. I thought calling 
>> udevadm settle after running partprobe would be enough to ensure 
>> partprobe completed (and since it takes as much as 2mn30 to return, I 
>> would be shocked if it does not ;-).
>>
>> Any idea ? I desperately try to find a consistent behavior, something 
>> reliable that we could use to say : "wait for the partition table to be 
>> up to date in the kernel and all udev events generated by the partition 
>> table update to complete".
> 
> I wonder if the underlying issue is that we shouldn't be calling udevadm 
> settle from something running from udev.  Instead, of a udev-triggered 
> run of ceph-disk does something that changes the partitions, it 
> should just exit and let udevadm run ceph-disk again on the new 
> devices...?

Unless I missed something this is on CentOS 7 and ceph-disk is only called from 
udev as ceph-disk trigger which does nothing else but asynchronously delegate 
the work to systemd. Therefore there is no udevadm settle from within udev 
(which would deadlock and timeout every time... I hope ;-).

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: puzzling disapearance of /dev/sdc1

2015-12-17 Thread Sage Weil
On Thu, 17 Dec 2015, Loic Dachary wrote:
> Hi Ilya,
> 
> This is another puzzling behavior (the log of all commands is at 
> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a 
> series of sgdisk -i commands to examine various devices including 
> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup 
> again although I don't have a definitive proof of this).
> 
> It looks like a side effect of a previous partprobe command, the only 
> command I can think of that removes / re-adds devices. I thought calling 
> udevadm settle after running partprobe would be enough to ensure 
> partprobe completed (and since it takes as much as 2mn30 to return, I 
> would be shocked if it does not ;-).
> 
> Any idea ? I desperately try to find a consistent behavior, something 
> reliable that we could use to say : "wait for the partition table to be 
> up to date in the kernel and all udev events generated by the partition 
> table update to complete".

I wonder if the underlying issue is that we shouldn't be calling udevadm 
settle from something running from udev.  Instead, of a udev-triggered 
run of ceph-disk does something that changes the partitions, it 
should just exit and let udevadm run ceph-disk again on the new 
devices...?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: puzzling disapearance of /dev/sdc1

2015-12-17 Thread Ilya Dryomov
On Thu, Dec 17, 2015 at 3:10 PM, Loic Dachary  wrote:
> Hi Sage,
>
> On 17/12/2015 14:31, Sage Weil wrote:
>> On Thu, 17 Dec 2015, Loic Dachary wrote:
>>> Hi Ilya,
>>>
>>> This is another puzzling behavior (the log of all commands is at
>>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a
>>> series of sgdisk -i commands to examine various devices including
>>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup
>>> again although I don't have a definitive proof of this).
>>>
>>> It looks like a side effect of a previous partprobe command, the only
>>> command I can think of that removes / re-adds devices. I thought calling
>>> udevadm settle after running partprobe would be enough to ensure
>>> partprobe completed (and since it takes as much as 2mn30 to return, I
>>> would be shocked if it does not ;-).

Yeah, IIRC partprobe goes through every slot in the partition table,
trying to first remove and then add the partition back.  But, I don't
see any mention of partprobe in the log you referred to.

Should udevadm settle for a few vd* devices be taking that much time?
I'd investigate that regardless of the issue at hand.

>>>
>>> Any idea ? I desperately try to find a consistent behavior, something
>>> reliable that we could use to say : "wait for the partition table to be
>>> up to date in the kernel and all udev events generated by the partition
>>> table update to complete".
>>
>> I wonder if the underlying issue is that we shouldn't be calling udevadm
>> settle from something running from udev.  Instead, of a udev-triggered
>> run of ceph-disk does something that changes the partitions, it
>> should just exit and let udevadm run ceph-disk again on the new
>> devices...?

>
> Unless I missed something this is on CentOS 7 and ceph-disk is only called 
> from udev as ceph-disk trigger which does nothing else but asynchronously 
> delegate the work to systemd. Therefore there is no udevadm settle from 
> within udev (which would deadlock and timeout every time... I hope ;-).

That's a sure lockup, until one of them times out.

How are you delegating to systemd?  Is it to avoid long-running udev
events?  I'm probably missing something - udevadm settle wouldn't block
on anything other than udev, so if you are shipping work off to
somewhere else, udev can't be relied upon for waiting.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


puzzling disapearance of /dev/sdc1

2015-12-17 Thread Loic Dachary
Hi Ilya,

This is another puzzling behavior (the log of all commands is at 
http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a series of 
sgdisk -i commands to examine various devices including /dev/sdc1, the 
/dev/sdc1 file disappears (and I think it will showup again although I don't 
have a definitive proof of this).

It looks like a side effect of a previous partprobe command, the only command I 
can think of that removes / re-adds devices. I thought calling udevadm settle 
after running partprobe would be enough to ensure partprobe completed (and 
since it takes as much as 2mn30 to return, I would be shocked if it does not 
;-).

Any idea ? I desperately try to find a consistent behavior, something reliable 
that we could use to say : "wait for the partition table to be up to date in 
the kernel and all udev events generated by the partition table update to 
complete". 

Cheers
-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature