"Huegel, Thomas" <[EMAIL PROTECTED]> wrote: >Yes, of course development knows the ramifications of a VM IPL, but the >problem has been around for forty(40) years. In my opinion VM is the best >operating system there is period. But by now we should have the ability, for >a permanently hung user, to use a 'FORCE UNCONDITIONAL' command to get rid >of it or a 'LOGON UNCONDITIONAL' command to get the problem user back to an >active state where it could be dealt with. It just seems like it is about >time to fix it... I must apologize for having mentioned windows and VM in >the same breath obviously there is no comparison.
Warning: Long post ahead! And if 'twere that simple, IBM would have done it. As someone who struggled with V/FORCE on SP and HPO for several years, then wrote V/FORCE-XA (-ESA) from scratch, I believe I'm qualified to weigh in on this topic. The problem isn't a poor design in VM. Rather, it's a GOOD design decision: to leave a single virtual machine hung rather than crash the system. Consider the case Richard Schuh mentions, where a deferred work count is non-zero. It may be that the deferred work count is bogus, or it may be (as in the tape drive case) that the work WILL wake up at some point. If CP were to let the virtual machine log off, then if the tape drive were readied, the CPEBK would pop and -- what? System ABEND, most likely. (Probably due to a negative count, but again, that's deliberate -- at that point, all would be lost already, and you might as well take the ABEND as early as possible.) It's easy to say things like, "Well, put a userid in the CPEBK and verify it against the VMDBK, and ignore it if they don't match" -- but that opens up a can of worms wrt what ELSE is tied to that CPEBK. And so forth (CPEBKs aren't the only culprits by far). Anyway, the same userid might have the VMDBK again, in its new incarnation (yeah, you could use a TOD clock value, but the point is, it isn't simple.) The Windows approach (I suspect; can't claim to have inside knowledge of the code!) would be to ignore things that don't make sense. (This is how Windows applications get hung, I assume.) The alternative would be a BSOD. On a (mainly) single-user system, it makes sense to take this approach; similarly, on a VM system designed for 100K users, you hang the userid. Having said that, V/FORCE-XA (-ESA) took what I thought and think was a logical 3-step approach: Step 0: look at the VMDBK and copy info, in case we crash the system ;-) Step 1: Ask CP nicely to log the virtual machine off; set a timer for 15 seconds and spin, waiting for the machine to log off; if it logs off, cancel the timer and declare victory Step 2: If it does not, strip resources *as much as is safe*, again under timer control (the definition of "safe" evolved over time, of course; this included making minidisks R/O, DETACHing devices that had real resources associated, etc., and it did the devices believed least likely to cause problems first, i.e., tape drives last!) Step 3: At the end of that stripping process, or when the stripping process hung, rename the VMDBK (including fixing up the hash table) Thus VFORCEing a userid wound up with one of several results: 1) The virtual machine logged off cleanly (in the simplest case, because it essentially just did a CP FORCE) 2) The virtual machine didn't log off, but was stripped of all resources and left as VFORCEnn 3) The virtual machine didn't log off, but was stripped of MOST resources and left as VFORCEnn 4) CP took and ABEND and you got to IPL anyway Supporting V/FORCE-XA (-ESA) as CP evolved was a struggle on the knife-edge between cases (3) and (4): too aggressive and (4) occurred, too conservative and (3) was the case. History tells the tale: V/FORCE became obsolete in the market because IBM did such a good job of reducing the incidence of hung users. Of course they flare up from time to time, followed by an APAR that fixes the new case. It has occurred to me (and others, including IBM, I'm 110% sure) that a RENAMEVM command that simply renamed a VMDBK would be one solution. I assume that IBM have chosen to fix the actual problems rather than create a workaround that would assist in only some fraction of the cases: Richard, for example, likely wouldn't be satisfied with getting his TPF guest back but losing the tape drives etc. that the previous one was using (nor should he be). I believe that some other changes were made to CP to allow, for example, LOCATE commands against a VMDBK that's in LOGOFF/FORCE PENDING. Before that, once you'd said FORCE, you couldn't even find the sucker, which made it harder to look at with TRACK et al. ObAnecdote: Back when VM/XA SP 1 was in ESP, I spent a week at Virginia Tech writing V/FORCE-XA. First a colleague and I went down for a day to look at XA and think about what was even possible. While we were there, PVM hung, and the VT folks asked if we wanted to tinker before they IPLed. With some SWAG VMDBK changes, I was able to get the VMDBK logged off, but logging it back on didn't restore PVM operation, and we wound up IPLing anyway. It wasn't until several weeks later that I discovered that at the time, HCP had a limit of 8 LDEV hosts, and one of the slots was reserved for PVM: I'd gotten the virtual machine logged off, but the slot was still marked as 'in use', so it couldn't run. Boy, that was a fun adventure, though! The V/FORCE-XA that came out of the week there was complete and, if I do say so myself, remarkably stable (especially compared to the SP/HPO version); while it had a number of fixes over the years, they were all to deal with changes to CP, not to fix flaws in the base implementation. I'm pretty proud of that. ...phsiii
