On Wed, Feb 04, 2009 at 04:33:13PM -0500, Konrad Rzeszutek wrote: > > In the 2.0-865 version when we received a ISCSI_ASYNC_MSG_REQUEST_LOGOUT we > would > logout, and then retry logging back in: > > - <28>Jul 28 20:15:40 iscsid: Target requests logout within 3 seconds for > connection^M > - <28>Jul 28 20:15:45 iscsid: connection5:0 is operational after recovery (2 > attempts)^M > > And we would have a short hiccup (5 seconds) of the connection being gone. > >This as my understanding was a mechanism for the EqualLogic box to "move" >(re-establishing >allegiance) a session to a different port, hence allowing a load-balancing >mechanism.
>In 2.0-869, the git commit 052d014485d2ce5bb7fa8dd0df875dafd1db77df changed >this >behavior so that we now actually logout and delete the session. No more >retries. The problem wasn't with iSCSI. It was with multipathd not handling device mapper events. Specifically after multipathd was started, any SCSI disks that would be added after-wards would not trigger multipathd to create a waitevent thread. The waitevent thread listens for kernel's offline/online events and thoroughly checks what the kernel sees with what multipathd thinks and if something is off, whacks multipathd to the right state. For devices which did not have a kernel device mapper helper (hp_sw, rdac, etc) and only have one single path, when the link experiences a momentary blib with I/O on it the path would be marked as failed _only_ by the kernel. This event would _not_ be propagated to multipathd (b/c it did not have a waitevent thread create). Multipathd would only do the path checker which would provide a PATH_UP event (rightly so - as the path would only be down for a second or so). However, the device mapper path group would be marked as failed, and any incoming I/O would be blocked (if queue_if_no_path was set) or fail. The end result was the multipathd would think everything was peachy while the kernel would be failing (or queueing) the I/O to the multipath device. The bug exists in SLES10 SP2 and SLES11, but not in RHEL5 U3 (line resetting the state is gone - no commit data about why), nor upstream (different patch fixes this inadvertly). The fix is quite easy. When we get an uevent for a new block device we make sure to start the waitevent thread if it has not been started. Here is the patch.. I am going to be posting on the device-mapper mailing list a patch tailored for upstream next week. diff -uNpr multipath-tools-0.4.7.orig/multipathd/main.c multipath-tools-0.4.7/multipathd/main.c --- multipath-tools-0.4.7.orig/multipathd/main.c 2009-02-06 14:15:20.000000000 -0500 +++ multipath-tools-0.4.7/multipathd/main.c 2009-02-06 14:27:22.000000000 -0500 @@ -345,6 +345,7 @@ ev_add_path (char * devname, struct vect struct multipath * mpp; struct path * pp; char empty_buff[WWID_SIZE] = {0}; + int start_waiter = 0; pp = find_path_by_dev(vecs->pathvec, devname); @@ -390,8 +391,11 @@ rescan: mpp->action = ACT_RELOAD; } else { - if ((mpp = add_map_with_path(vecs, pp, 1))) + if ((mpp = add_map_with_path(vecs, pp, 1))) { mpp->action = ACT_CREATE; + start_waiter = 1; /* We don't depend on ACT_CREATE, as domap will + set it to ACT_NOTHING when complete. */ + } else return 1; /* leave path added to pathvec */ } @@ -432,7 +436,8 @@ rescan: sync_map_state(mpp); - if (mpp->action == ACT_CREATE && + if (mpp->action == ACT_CREATE || + (mpp->action == ACT_NOTHING && start_waiter && !mpp->waiter) && start_waiter_thread(mpp, vecs)) goto out; --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~---