Historical details: cluster originally set up with GPFS accessMode= allToAll, 
back in 2009.  
All nodes still have ability to ssh as root to any other node as root, but we 
changed
the gpfs cluster configuration to accessMode=central a few years ago.
Some day we may want to tighten it down, but not now.

Originally /var/mmfs/gen was copied into the diskless boot image, but this 
became painful as we continued to add a few dozen new nodes every few months,
and would have to respin the diskless images to update the SDR.

Nodes in the old cluster continues to work fine with xCAT 2.8.x and centos 6.7 
and gpfs 4.2.2-1.
New cluster nodes are exhibiting this problem with xCAT 2.13.1 and rhels7.2 and 
gpfs 4.2.2-1.

So, did xCAT formerly run postscripts in a pseudo-tty environment, and not any 
more?

The goal is to have each compute node be able to boot up diskless, rejoin GPFS 
at will
without any manual intervention. 
> On Feb 7, 2017, at 2:02 PM, David D. Johnson <david_john...@brown.edu> wrote:
> 
> Drilling down deeper, seems to be two different situations.
> 
> On nodes without X11, remoteshell script takes less than a second,
> -rw------- 1 root root 1231 Feb  7  2017 authorized_keys
> -rw------- 1 root root 1675 Feb  7  2017 id_rsa
> -rw------- 1 root root  410 Feb  7  2017 id_rsa.pub
> 7cb5ab60ff42ede791c823afd016997d  /root/.ssh/authorized_keys
> 13f430f0001adff42dc250f818eabbd1  /root/.ssh/id_rsa
> 3f5101404ac152d4aaea6c62f7eb6e30  /root/.ssh/id_rsa.pub
> 
> However later in the script, trying to set up to start gpfs, I get this 
> message:
> 
> Install: recovering gpfs sdr
> Tue Feb  7 18:20:28 UTC 2017: mmsdrrestore: Processing node gpu002
> mmsdrrestore: Run the command from an active terminal or enable global 
> passwordless access.
> mmsdrrestore: Unable to retrieve GPFS cluster files from node 
> ut002.oscar.ccv.brown.edu <http://ut002.oscar.ccv.brown.edu/>
> mmsdrrestore: File /var/mmfs/ssl/stage/genkeyData1 not found.
>    Use mmauth genkey to recover the file, or to generate and commit a new key.
> mmsdrrestore: Unexpected error from updateMmfsEnvironment.  Return code: 1
> mmsdrrestore: Command failed. Examine previous error messages to determine 
> cause.
> 
> 
> If I copy/paste the command from the postscript file, run it from ssh login, 
> I get
> [root@gpu002 xcat]# /usr/lpp/mmfs/bin/mmsdrrestore -p ut003 -R /usr/bin/scp
> Tue Feb  7 18:26:10 UTC 2017: mmsdrrestore: Processing node gpu002
> Warning: Permanently added 'ut002.oscar.ccv.brown.edu 
> <http://ut002.oscar.ccv.brown.edu/>' (RSA) to the list of known hosts.
> mmsdrrestore: Node gpu002 successfully restored.
> 
> There is no difference in the /root/.ssh files before or after. Why does it 
> work by hand, but not from inside script?
> 
> Found that on nodes with X11, remoteshell script was taking 12 minutes to run 
> to “completion”,
> and the result is zero length id_rsa.pub file.
> 
> -rw------- 1 root root 821 Feb  7 12:59 authorized_keys
> -rw------- 1 root root   0 Feb  7 13:09 id_rsa.pub
> -rw-r--r-- 1 root root 183 Feb  7 13:02 known_hosts
> 4cd344ed6d3721a283f442977862b981  /root/.ssh/authorized_keys
> d41d8cd98f00b204e9800998ecf8427e  /root/.ssh/id_rsa.pub
> a178f5a553c74d99590b2047d9517363  /root/.ssh/known_hosts
> 
> I thought it was NetworkManager, but it turns out it was firewalld.
> (chroot . systemctl disable firewalld )
> 
> — ddj
> 
>> On Feb 7, 2017, at 6:35 AM, David D Johnson <david_john...@brown.edu 
>> <mailto:david_john...@brown.edu>> wrote:
>> 
>> That was already the case (IP of mgt1 and IP of mgt[2] are the forwarders).
>> I don't believe it will forward requests within the zones that it is 
>> authoritative.
>> I ended up using tabdump to recreate the hosts and nodelist tables. Mostly 
>> good.
>> 
>> Now the problem of the day is fixing the SSH credentials so that all the 
>> diskless nodes booting off the
>> new frontend can get root access to all the nodes still booted off the old 
>> frontend.  Need this
>> especially for GPFS.  I've been trying to follow what's going on in the 
>> remoteshell postscript,
>> and I'm wondering if my "sitespecific" postscript is running before 
>> "remoteshell" is competed.
>> Is there a way to determine/force the order the postscripts are executed?  
>> Sitespecific is after
>> remoteshell both in alphabet and in the lsdef output. 
>> The basic problem is that mmsdrrestore fails during sitespecific, but works 
>> fine when I try it again later by hand.
>> 
>>  -- ddj
>> Dave Johnson
>> Brown University
>> 
>>> On Feb 7, 2017, at 4:32 AM, Er Tao Zhao <erta...@cn.ibm.com 
>>> <mailto:erta...@cn.ibm.com>> wrote:
>>> 
>>> Hi, David
>>>  
>>> Will you pls try 'chdef -t site forwarders=<ip_of_mgt1>' and then 'makedns' 
>>> to use mgt1 as your remote DNS server.
>>> Pls feel free to let me know if there is any more issues.
>>>  
>>> Thx!
>>> Best Regards,
>>> -----------------------------------
>>> Zhao Er Tao
>>> 
>>> IBM China System and Technology Laboratory, Beijing
>>> Tel:(86-10)82450485
>>> Email: erta...@cn.ibm.com <mailto:erta...@cn.ibm.com>
>>> Address: 1/F, 28 Building,ZhongGuanCun Software Park,
>>> No.8 DongBeiWang West Road, Haidian District,
>>> Beijing, 100193, P.R.China
>>>  
>>>  
>>> ----- Original message -----
>>> From: "David D. Johnson" <david_john...@brown.edu 
>>> <mailto:david_john...@brown.edu>>
>>> To: "xcat-user@lists.sourceforge.net 
>>> <mailto:xcat-user@lists.sourceforge.net>" <xcat-user@lists.sourceforge.net 
>>> <mailto:xcat-user@lists.sourceforge.net>>
>>> Cc:
>>> Subject: [xcat-user] upgrading xCAT onto new servers
>>> Date: Sat, Feb 4, 2017 3:04 AM
>>>  
>>> We’re upgrading cluster mgt node hardware and software at the same time, 
>>> going from 2.8.3 to 2.13.1,
>>> and from centos6.7 to rhels7.2.   I have the new frontend installed and 
>>> somewhat functional.
>>> Right now I’m needing to clone the DNS / named from “mgt1” that is still 
>>> authoritative for the production cluster.
>>> I could just tabdump hosts and nodelist and do makedns on “mgt5”, or I’m 
>>> thinking there might be a way to make
>>> the new mgt5 a slave to the existing named running on mgt1.   Any 
>>> pros/cons?  What would you do?
>>> 
>>> Thanks,
>>> 
>>>  — ddj
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org <http://slashdot.org/>! 
>>> http://sdm.link/slashdot <http://sdm.link/slashdot>
>>> _______________________________________________
>>> xCAT-user mailing list
>>> xCAT-user@lists.sourceforge.net <mailto:xCAT-user@lists.sourceforge.net>
>>> https://lists.sourceforge.net/lists/listinfo/xcat-user 
>>> <https://lists.sourceforge.net/lists/listinfo/xcat-user>
>>>  
>>> 
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, SlashDot.org <http://slashdot.org/>! 
>>> http://sdm.link/slashdot_______________________________________________ 
>>> <http://sdm.link/slashdot_______________________________________________>
>>> xCAT-user mailing list
>>> xCAT-user@lists.sourceforge.net <mailto:xCAT-user@lists.sourceforge.net>
>>> https://lists.sourceforge.net/lists/listinfo/xcat-user 
>>> <https://lists.sourceforge.net/lists/listinfo/xcat-user>
>> 
> 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to