Re: forcing a user

Phil Smith III Wed, 21 Dec 2005 04:49:40 -0800

"Huegel, Thomas" <[EMAIL PROTECTED]> wrote:
>Yes, of course development knows the ramifications of a VM IPL, but the
>problem has been around for forty(40) years. In my opinion VM is the best
>operating system there is period. But by now we should have the ability, for
>a permanently hung user, to use a 'FORCE UNCONDITIONAL' command to get rid
>of it or a 'LOGON UNCONDITIONAL' command to get the problem user back to an
>active state where it could be dealt with. It just seems like it is about
>time to fix it... I must apologize for having mentioned windows and VM in
>the same breath obviously there is no comparison.


Warning: Long post ahead!

And if 'twere that simple, IBM would have done it.  As someone who struggled 
with V/FORCE on SP and HPO for several years, then wrote V/FORCE-XA (-ESA) from 
scratch, I believe I'm qualified to weigh in on this topic.  

The problem isn't a poor design in VM.  Rather, it's a GOOD design decision: to 
leave a single virtual machine hung rather than crash the system.

Consider the case Richard Schuh mentions, where a deferred work count is 
non-zero.  It may be that the deferred work count is bogus, or it may be (as in 
the tape drive case) that the work WILL wake up at some point.

If CP were to let the virtual machine log off, then if the tape drive were 
readied, the CPEBK would pop and -- what?  System ABEND, most likely.  
(Probably due to a negative count, but again, that's deliberate -- at that 
point, all would be lost already, and you might as well take the ABEND as early 
as possible.)

It's easy to say things like, "Well, put a userid in the CPEBK and verify it 
against the VMDBK, and ignore it if they don't match" -- but that opens up a 
can of worms wrt what ELSE is tied to that CPEBK.  And so forth (CPEBKs aren't 
the only culprits by far).  Anyway, the same userid might have the VMDBK again, 
in its new incarnation (yeah, you could use a TOD clock value, but the point 
is, it isn't simple.)  

The Windows approach (I suspect; can't claim to have inside knowledge of the 
code!) would be to ignore things that don't make sense.  (This is how Windows 
applications get hung, I assume.)  The alternative would be a BSOD.  On a 
(mainly) single-user system, it makes sense to take this approach; similarly, 
on a VM system designed for 100K users, you hang the userid.

Having said that, V/FORCE-XA (-ESA) took what I thought and think was a logical 
3-step approach:
Step 0: look at the VMDBK and copy info, in case we crash the system ;-)
Step 1: Ask CP nicely to log the virtual machine off; set a timer for 15 
seconds and spin, waiting for the machine to log off; if it logs off, cancel 
the timer and declare victory
Step 2: If it does not, strip resources *as much as is safe*, again under timer 
control (the definition of "safe" evolved over time, of course; this included 
making minidisks R/O, DETACHing devices that had real resources associated, 
etc., and it did the devices believed least likely to cause problems first, 
i.e., tape drives last!)
Step 3: At the end of that stripping process, or when the stripping process 
hung, rename the VMDBK (including fixing up the hash table)

Thus VFORCEing a userid wound up with one of several results:
1) The virtual machine logged off cleanly (in the simplest case, because it 
essentially just did a CP FORCE)
2) The virtual machine didn't log off, but was stripped of all resources and 
left as VFORCEnn
3) The virtual machine didn't log off, but was stripped of MOST resources and 
left as VFORCEnn
4) CP took and ABEND and you got to IPL anyway

Supporting V/FORCE-XA (-ESA) as CP evolved was a struggle on the knife-edge 
between cases (3) and (4): too aggressive and (4) occurred, too conservative 
and (3) was the case.

History tells the tale: V/FORCE became obsolete in the market because IBM did 
such a good job of reducing the incidence of hung users.  Of course they flare 
up from time to time, followed by an APAR that fixes the new case.

It has occurred to me (and others, including IBM, I'm 110% sure) that a 
RENAMEVM command that simply renamed a VMDBK would be one solution.  I assume 
that IBM have chosen to fix the actual problems rather than create a workaround 
that would assist in only some fraction of the cases: Richard, for example, 
likely wouldn't be satisfied with getting his TPF guest back but losing the 
tape drives etc. that the previous one was using (nor should he be).

I believe that some other changes were made to CP to allow, for example, LOCATE 
commands against a VMDBK that's in LOGOFF/FORCE PENDING.  Before that, once 
you'd said FORCE, you couldn't even find the sucker, which made it harder to 
look at with TRACK et al.

ObAnecdote: Back when VM/XA SP 1 was in ESP, I spent a week at Virginia Tech 
writing V/FORCE-XA.  First a colleague and I went down for a day to look at XA 
and think about what was even possible.  While we were there, PVM hung, and the 
VT folks asked if we wanted to tinker before they IPLed.  With some SWAG VMDBK 
changes, I was able to get the VMDBK logged off, but logging it back on didn't 
restore PVM operation, and we wound up IPLing anyway.  It wasn't until several 
weeks later that I discovered that at the time, HCP had a limit of 8 LDEV 
hosts, and one of the slots was reserved for PVM: I'd gotten the virtual 
machine logged off, but the slot was still marked as 'in use', so it couldn't 
run.  Boy, that was a fun adventure, though!  The V/FORCE-XA that came out of 
the week there was complete and, if I do say so myself, remarkably stable 
(especially compared to the SP/HPO version); while it had a number of fixes 
over the years, they were all to deal with changes to CP, not to fix flaws in 
the base implementation.  I'm pretty proud of that.

...phsiii

Re: forcing a user

Reply via email to