Re: [EXT] Re: udev events for iscsi

2020-04-22 Thread The Lee-Man
On Tuesday, April 21, 2020 at 11:56:23 PM UTC-7, Uli wrote:
>
> >>> The Lee-Man  schrieb am 21.04.2020 um 20:44 
> in 
> Nachricht 
> <618_1587494664_5E9F3F08_618_445_1_7f583720-8a84-4872-8d1a-5cd284295c22@googlegr
>  
>
> ups.com>: 
> > On Tuesday, April 21, 2020 at 12:31:24 AM UTC-7, Gionatan Danti wrote: 
> >> 
> >> [reposting, as the previous one seems to be lost] 
> >> 
> >> Hi all, 
> >> I have a question regarding udev events when using iscsi disks. 
> >> 
> >> By using "udevadm monitor" I can see that events are generated when I 
> >> login and logout from an iscsi portal/resource, creating/destroying the 
> >> relative links under /dev/ 
> >> 
> >> However, I can not see anything when the remote machine simple 
> >> dies/reboots/disconnects: while "dmesg" shows the iscsi timeout 
> expiring, I 
> >> don't see anything about a removed disk (and the links under /dev/ 
> remains 
> >> unaltered, indeed). At the same time, when the remote machine and disk 
> >> become available again, no reconnection events happen. 
> >> 
> > 
> > Because of the design of iSCSI, there is no way for the initiator to 
> know 
> > the server has gone away. The only time an initiator might figure this 
> out 
> > is when it tries to communicate with the target. 
>
> My knowlege of the SCSI stack is quite poor, but I think the last 
> revisions of parallel SCSI (like Ultra 320 (or was it 160?)) had a concept 
> of "domain validation". AFAIK the leatter meant measuring the quality of 
> the wires, adjusting the transfer speed. 
> While basically SCSI assumes "the bus" won't go away magically, a future 
> iSCSI standard might contain  regular "bus checks" to trigger recovery 
> actions if the "bus" (network transport connection) seems to be gone. 
>
> > 
> > This assumes we are not using some sort of directory service, like iSNS, 
> > which can send asynchronous notifications. But even then, the iSNS 
> server 
> > would have to somehow know that the target went down. If the target 
> > crashed, that might be difficult to ascertain. 
>
> To be picky: If the traget went down (like a classical failing SCSI disk), 
> it could issue some attention message, but when the transport went down, no 
> such message can be received. So I think there's a difference between 
> "target down" (device not present, device fails to respond) and "bus down" 
> (no communication possible any more). In the second case no assumptions can 
> be made about the health of the traget device. 
>
> > 
> > So in the absence of some asynchronous notification, the initiator only 
> > knows the target is not responding if it tries to talk to that target. 
> > 
> > Normally iscsid defaults to sending periodic NO-OPs to the target every 
> 5 
> > seconds. So if the target goes away, the initiator usually notices, even 
> if 
> > no regular I/O is occurring. 
>
> So the target went away, or the bus went down? 
>

The initiator does not know the difference. As you know, there are dozens 
of things (conservatively) that can go wrong, which is why I say the disk 
"goes away". It could be sleeping. It could be dead. The cable could be 
unplugged. The system could be rebooting. The switch could be down. The 
ACLs could have changed (which is how I simulate a target going away). 

>
> > 
> > But this is where the error recovery gets tricky, because iscsi tries to 
> > handle "lossy" connections. What if the server will be right back? Maybe 
> > it's rebooting? Maybe the cable will be plugged back in? So iscsi keeps 
> > trying to reconnect. As a matter of fact, if you stop iscsid and restart 
> > it, it sees the failed connection and retries it -- forever, by default. 
> I 
> > actually added a configuration parameter called reopen_max, that can 
> limit 
> > the number of retries. But there was pushback on changing the default 
> value 
> > from 0, which is "retry forever". 
> > 
> > So what exactly do you think the system should do when a connection 
> "goes 
> > away"? How long does it have to be gone to be considered gone for good? 
> If 
> > the target comes back "later" should it get the same disc name? Should 
> we 
> > retry, and if so how much before we give up? I'm interested in your 
> views, 
> > since it seems like a non-trivial problem to me. 
>
> IMHO a "bus down" is a critical event affecting _all_ devices on that bus, 
> not just a single target. Well, it might be some extra noise if those other 
> targets have no I/O outstanding, but it's better to know that the bus is 
> down before initiating a transfer rather than concluding seconds later that 
> the target seems unreachable for some reasons unknown. 
>

There are 3 error handling levels built into the iSCSI protocol. I think 
you'll need to change/augment the protocol to change this. They are 
ERL=[0|1|2]. Error level 0 is the default, and the only one supported by 
open-iscsi. That just means we end the connection reconnect. ERL=1 adds 
handling digest error handling, and ERL=2 adds session 

Re: [EXT] Re: udev events for iscsi

2020-04-22 Thread Donald Williams
Hello

 Re: Errors  That's likely from a bad / copy paste.  I referenced the
source document I took that from.  That was done against an older RHEL
kernel.

 Don


On Wed, Apr 22, 2020 at 3:04 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Donald Williams  schrieb am 21.04.2020 um
> 20:49 in
> Nachricht
>
> <30147_1587494977_5E9F4041_30147_801_1_CAK3e-EawwxYGb3Gw74+P-yBmrnE0ktOL=Fj1OT_L
> q+czyz...@mail.gmail.com>:
> > Hello,
> >
> >  If the loss exceeds the timeout value yes.  If the 'drive' doesn't come
> > back in 30 to 60 seconds it's not likely a transitory event like a cable
> > pull.
> >
> > NOOP-IN and NOOP-OUT are also know as KeepAlive.  That's when the
>
> Actually I think that's two different mechanisms: Keepalive just prevents
> the connection from being discarded (some firewall like to do that), while
> the No_op actually is an end-to-end (almost at least) connection test.
>
> > connection is up but the target or initiator isn't responding.   If those
> > timeout the connection will be dropped and a new connection attempt made.
>
> I think the original intention for SCSI timeouts was to conclude a device
> has failed if it does not respond within time (actually there are different
> timeouts depending on the operation (like the famous rewinding of a long
> tape)). Next step for the OS would be to block I/O to a seemingly failed
> device. Recent operating systems like Linux have the choice to remove the
> device logically, requiring it to re-appear before it can be used. In some
> cases it seems preferrable to keep the device, because otherwise there
> could be a cascading effect like killing processes that have the device
> open (UNIX processes do not like it when opened devices suddenly disappear).
>
> Regards,
> Ulrich
>
> >
> >  Don
> >
> >
> > On Tue, Apr 21, 2020 at 2:44 PM The Lee-Man 
> wrote:
> >
> >> On Tuesday, April 21, 2020 at 12:31:24 AM UTC-7, Gionatan Danti wrote:
> >>>
> >>> [reposting, as the previous one seems to be lost]
> >>>
> >>> Hi all,
> >>> I have a question regarding udev events when using iscsi disks.
> >>>
> >>> By using "udevadm monitor" I can see that events are generated when I
> >>> login and logout from an iscsi portal/resource, creating/destroying the
> >>> relative links under /dev/
> >>>
> >>> However, I can not see anything when the remote machine simple
> >>> dies/reboots/disconnects: while "dmesg" shows the iscsi timeout
> expiring, I
> >>> don't see anything about a removed disk (and the links under /dev/
> remains
> >>> unaltered, indeed). At the same time, when the remote machine and disk
> >>> become available again, no reconnection events happen.
> >>>
> >>
> >> Because of the design of iSCSI, there is no way for the initiator to
> know
> >> the server has gone away. The only time an initiator might figure this
> out
> >> is when it tries to communicate with the target.
> >>
> >> This assumes we are not using some sort of directory service, like iSNS,
> >> which can send asynchronous notifications. But even then, the iSNS
> server
> >> would have to somehow know that the target went down. If the target
> >> crashed, that might be difficult to ascertain.
> >>
> >> So in the absence of some asynchronous notification, the initiator only
> >> knows the target is not responding if it tries to talk to that target.
> >>
> >> Normally iscsid defaults to sending periodic NO-OPs to the target every
> 5
> >> seconds. So if the target goes away, the initiator usually notices,
> even if
> >> no regular I/O is occurring.
> >>
> >> But this is where the error recovery gets tricky, because iscsi tries to
> >> handle "lossy" connections. What if the server will be right back? Maybe
> >> it's rebooting? Maybe the cable will be plugged back in? So iscsi keeps
> >> trying to reconnect. As a matter of fact, if you stop iscsid and restart
> >> it, it sees the failed connection and retries it -- forever, by
> default. I
> >> actually added a configuration parameter called reopen_max, that can
> limit
> >> the number of retries. But there was pushback on changing the default
> value
> >> from 0, which is "retry forever".
> >>
> >> So what exactly do you think the system should do when a connection
> "goes
> >> away"? How long does it have to be gone to be considered gone for good?
> If
> >> the target comes back "later" should it get the same disc name? Should
> we
> >> retry, and if so how much before we give up? I'm interested in your
> views,
> >> since it seems like a non-trivial problem to me.
> >>
> >>>
> >>> I can read here that, years ago, a patch was in progress to give better
> >>> integration with udev when a device disconnects/reconnects. Did the
> patch
> >>> got merged? Or does the one I described above remain the expected
> behavior?
> >>> Can be changed?
> >>>
> >>
> >> So you're saying as soon as a bad connection is detected (perhaps by a
> >> NOOP), the device should go away?
> >>
> >>>
> >>> Thanks.
> >>>
> >> --
> >> You received 

Antw: [EXT] Re: udev events for iscsi

2020-04-22 Thread Ulrich Windl
>>> Donald Williams  schrieb am 21.04.2020 um 20:49 in
Nachricht
<30147_1587494977_5E9F4041_30147_801_1_CAK3e-EawwxYGb3Gw74+P-yBmrnE0ktOL=Fj1OT_L
q+czyz...@mail.gmail.com>:
> Hello,
> 
>  If the loss exceeds the timeout value yes.  If the 'drive' doesn't come
> back in 30 to 60 seconds it's not likely a transitory event like a cable
> pull.
> 
> NOOP-IN and NOOP-OUT are also know as KeepAlive.  That's when the

Actually I think that's two different mechanisms: Keepalive just prevents the 
connection from being discarded (some firewall like to do that), while the 
No_op actually is an end-to-end (almost at least) connection test.

> connection is up but the target or initiator isn't responding.   If those
> timeout the connection will be dropped and a new connection attempt made.

I think the original intention for SCSI timeouts was to conclude a device has 
failed if it does not respond within time (actually there are different 
timeouts depending on the operation (like the famous rewinding of a long 
tape)). Next step for the OS would be to block I/O to a seemingly failed 
device. Recent operating systems like Linux have the choice to remove the 
device logically, requiring it to re-appear before it can be used. In some 
cases it seems preferrable to keep the device, because otherwise there could be 
a cascading effect like killing processes that have the device open (UNIX 
processes do not like it when opened devices suddenly disappear).

Regards,
Ulrich

> 
>  Don
> 
> 
> On Tue, Apr 21, 2020 at 2:44 PM The Lee-Man  wrote:
> 
>> On Tuesday, April 21, 2020 at 12:31:24 AM UTC-7, Gionatan Danti wrote:
>>>
>>> [reposting, as the previous one seems to be lost]
>>>
>>> Hi all,
>>> I have a question regarding udev events when using iscsi disks.
>>>
>>> By using "udevadm monitor" I can see that events are generated when I
>>> login and logout from an iscsi portal/resource, creating/destroying the
>>> relative links under /dev/
>>>
>>> However, I can not see anything when the remote machine simple
>>> dies/reboots/disconnects: while "dmesg" shows the iscsi timeout expiring, I
>>> don't see anything about a removed disk (and the links under /dev/ remains
>>> unaltered, indeed). At the same time, when the remote machine and disk
>>> become available again, no reconnection events happen.
>>>
>>
>> Because of the design of iSCSI, there is no way for the initiator to know
>> the server has gone away. The only time an initiator might figure this out
>> is when it tries to communicate with the target.
>>
>> This assumes we are not using some sort of directory service, like iSNS,
>> which can send asynchronous notifications. But even then, the iSNS server
>> would have to somehow know that the target went down. If the target
>> crashed, that might be difficult to ascertain.
>>
>> So in the absence of some asynchronous notification, the initiator only
>> knows the target is not responding if it tries to talk to that target.
>>
>> Normally iscsid defaults to sending periodic NO-OPs to the target every 5
>> seconds. So if the target goes away, the initiator usually notices, even if
>> no regular I/O is occurring.
>>
>> But this is where the error recovery gets tricky, because iscsi tries to
>> handle "lossy" connections. What if the server will be right back? Maybe
>> it's rebooting? Maybe the cable will be plugged back in? So iscsi keeps
>> trying to reconnect. As a matter of fact, if you stop iscsid and restart
>> it, it sees the failed connection and retries it -- forever, by default. I
>> actually added a configuration parameter called reopen_max, that can limit
>> the number of retries. But there was pushback on changing the default value
>> from 0, which is "retry forever".
>>
>> So what exactly do you think the system should do when a connection "goes
>> away"? How long does it have to be gone to be considered gone for good? If
>> the target comes back "later" should it get the same disc name? Should we
>> retry, and if so how much before we give up? I'm interested in your views,
>> since it seems like a non-trivial problem to me.
>>
>>>
>>> I can read here that, years ago, a patch was in progress to give better
>>> integration with udev when a device disconnects/reconnects. Did the patch
>>> got merged? Or does the one I described above remain the expected behavior?
>>> Can be changed?
>>>
>>
>> So you're saying as soon as a bad connection is detected (perhaps by a
>> NOOP), the device should go away?
>>
>>>
>>> Thanks.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "open-iscsi" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to open-iscsi+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> 
> https://groups.google.com/d/msgid/open-iscsi/7f583720-8a84-4872-8d1a-5cd28429 
> 5c22%40googlegroups.com
>> 
>   
> 

Antw: [EXT] Re: udev events for iscsi

2020-04-22 Thread Ulrich Windl
>>> The Lee-Man  schrieb am 21.04.2020 um 20:44 in
Nachricht
<618_1587494664_5E9F3F08_618_445_1_7f583720-8a84-4872-8d1a-5cd284295c22@googlegr
ups.com>:
> On Tuesday, April 21, 2020 at 12:31:24 AM UTC-7, Gionatan Danti wrote:
>>
>> [reposting, as the previous one seems to be lost]
>>
>> Hi all,
>> I have a question regarding udev events when using iscsi disks.
>>
>> By using "udevadm monitor" I can see that events are generated when I 
>> login and logout from an iscsi portal/resource, creating/destroying the 
>> relative links under /dev/
>>
>> However, I can not see anything when the remote machine simple 
>> dies/reboots/disconnects: while "dmesg" shows the iscsi timeout expiring, I 
>> don't see anything about a removed disk (and the links under /dev/ remains 
>> unaltered, indeed). At the same time, when the remote machine and disk 
>> become available again, no reconnection events happen.
>>
> 
> Because of the design of iSCSI, there is no way for the initiator to know 
> the server has gone away. The only time an initiator might figure this out 
> is when it tries to communicate with the target.

My knowlege of the SCSI stack is quite poor, but I think the last revisions of 
parallel SCSI (like Ultra 320 (or was it 160?)) had a concept of "domain 
validation". AFAIK the leatter meant measuring the quality of the wires, 
adjusting the transfer speed.
While basically SCSI assumes "the bus" won't go away magically, a future iSCSI 
standard might contain  regular "bus checks" to trigger recovery actions if the 
"bus" (network transport connection) seems to be gone.

> 
> This assumes we are not using some sort of directory service, like iSNS, 
> which can send asynchronous notifications. But even then, the iSNS server 
> would have to somehow know that the target went down. If the target 
> crashed, that might be difficult to ascertain.

To be picky: If the traget went down (like a classical failing SCSI disk), it 
could issue some attention message, but when the transport went down, no such 
message can be received. So I think there's a difference between "target down" 
(device not present, device fails to respond) and "bus down" (no communication 
possible any more). In the second case no assumptions can be made about the 
health of the traget device.

> 
> So in the absence of some asynchronous notification, the initiator only 
> knows the target is not responding if it tries to talk to that target.
> 
> Normally iscsid defaults to sending periodic NO-OPs to the target every 5 
> seconds. So if the target goes away, the initiator usually notices, even if 
> no regular I/O is occurring.

So the target went away, or the bus went down?

> 
> But this is where the error recovery gets tricky, because iscsi tries to 
> handle "lossy" connections. What if the server will be right back? Maybe 
> it's rebooting? Maybe the cable will be plugged back in? So iscsi keeps 
> trying to reconnect. As a matter of fact, if you stop iscsid and restart 
> it, it sees the failed connection and retries it -- forever, by default. I 
> actually added a configuration parameter called reopen_max, that can limit 
> the number of retries. But there was pushback on changing the default value 
> from 0, which is "retry forever".
> 
> So what exactly do you think the system should do when a connection "goes 
> away"? How long does it have to be gone to be considered gone for good? If 
> the target comes back "later" should it get the same disc name? Should we 
> retry, and if so how much before we give up? I'm interested in your views, 
> since it seems like a non-trivial problem to me.

IMHO a "bus down" is a critical event affecting _all_ devices on that bus, not 
just a single target. Well, it might be some extra noise if those other targets 
have no I/O outstanding, but it's better to know that the bus is down before 
initiating a transfer rather than concluding seconds later that the target 
seems unreachable for some reasons unknown.

> 
>>
>> I can read here that, years ago, a patch was in progress to give better 
>> integration with udev when a device disconnects/reconnects. Did the patch 
>> got merged? Or does the one I described above remain the expected behavior? 
>> Can be changed?
>>
> 
> So you're saying as soon as a bad connection is detected (perhaps by a 
> NOOP), the device should go away? 

Maybe the state should be similar to a device being in power-save mode: It's 
not accessible right now, but should be woke up ASAP. See my earlier comparison 
to NFS hard-mounts...

Regards,
Ulrich

> 
>>
>> Thanks.
>>
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to open-iscsi+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/open-iscsi/7f583720-8a84-4872-8d1a-5cd28429 
> 5c22%40googlegroups.com.



-- 
You received this 

Antw: [EXT] Re: udev events for iscsi

2020-04-22 Thread Ulrich Windl
>>> Donald Williams  schrieb am 21.04.2020 um 18:06
in
Nachricht
<29812_1587485183_5E9F19FE_29812_432_1_CAK3e-EbA-d6NeDETJ0EMHeAw3HGko_uCB_f6gsiq
jmeeyz...@mail.gmail.com>:

[...]
> 
> The default setting for Linux is 30 seconds. This can be verified using the
> command:
> 
>  # for i in $(find /sys/devices/platform –name timeout ) ; do cat $i ; done
> 30 30

Two remarks on the command above:
1) the command contains an en-dash instead of a minus, so you get funny error
messages like this:
find: ‘–iname’: No such file or directory
find: ‘timeout’: No such file or directory

2) Even with the correct command, I get no matches here (SLES12)

However I see matches within
/sys/devices/pci* and /sys/class/firmware/timeout

[...]

Regards,
Ulrich

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/open-iscsi/5E9FDF7F02A1000387BE%40gwsmtp.uni-regensburg.de.


Aw: [EXT] Re: udev events for iscsi

2020-04-21 Thread Ulrich Windl
>>>  21.04.2020, 17:20 >>>Wondering myself.> On
Apr 21, 2020, at 2:31 AM, Gionatan Danti  wrote:> >
> [reposting, as the previous one seems to be lost]> > Hi all,> I have a
question regarding udev events when using iscsi disks.> > By using "udevadm
monitor" I can see that events are generated when I login and logout from an
iscsi portal/resource, creating/destroying the relative links under /dev/So
running “udevadm monitor” on the initiator, you can see when a block device
becomes available locally.   > > However, I can not see anything when the
remote machine simple dies/reboots/disconnects: while "dmesg" shows the iscsi
timeout expiring, I don't see anything about a removed disk (and the links
under /dev/ remains unaltered, indeed). At the same time, when the remote
machine and disk become available again, no reconnection events happen.As
someone who has had an inordinate amount of experience with the iSCSi
connection breaking ( power outage, Network switch dies,  wrong ethernet cable
pulled, the target server machine hardware crashes, ...) in the middle of
production, the more info the better.   Udev event triggers would help.   I
wonder exactly how XenServer handles this as it itself seemed more resilient. 
XenServer host initiators  do something correct to recover and wonder how that
compares to the normal iSCSi initiator.   But unfortunately, XenServer
LVM-over-iSCSi  does not pass the message along to its Linux virtual drives and
VMs in the same way as Windows VMs.When the target drives became available
again,   MS Windows virtual machines would gracefully recover on their own.   
All Linux VM  filesystems went read only and those VM machines required
forceful  rebooting.   mount remount would not work. > > I can read here that,
years ago, a patch was in progress to give better integration with udev when a
device disconnects/reconnects. Did the patch got merged? Or does the one I
described above remain the expected behavior? Can be changed?> > Thanks.> -- >
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.> To unsubscribe from this group and stop receiving emails
from it, send an email to open-iscsi+unsubscr...@googlegroups.com.> To view
this discussion on the web visit
https://groups.google.com/d/msgid/open-iscsi/13d4c963-b633-4672-97d9-dd41eec5fb5b%40googlegroups.com.--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.To unsubscribe from this group and stop receiving emails
from it, send an email to open-iscsi+unsubscr...@googlegroups.com.to view this
discussion on the web visit
https://groups.google.com/d/msgid/open-iscsi/9D54680A-F97E-4465-BA6C-566562C5DC91%40eyeconsultantspc.com.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/open-iscsi/5E9F80B202A10003875F%40gwsmtp.uni-regensburg.de.