Re: [storage-discuss] SNDR "could not open file XXX on remote node"

Jim Dunham Tue, 20 Jan 2009 19:41:13 -0800

Rick Romero wrote:
> Jim Dunham wrote:
>> Rick,
>>
>>> Jim Dunham wrote:
>>>> Rick,
>>>>
>>>>> I followed Jim Dunham's AVS & ZFS seamless guide on OpenSolaris  
>>>>> 2008.11,
>>>>> and I'm running into a problem.  Actually, I ran into a few  
>>>>> problems,
>>>>> but this is where I'm really stuck :)
>>>>>
>>>>> Both nodes /var/adm/ds.log show the same errors for each disk:
>>>>> Jan 19 15:37:08 librdc: SNDR: Could not open file
>>>>> sysvoltwo:/dev/rdsk/c4d0s0 on remote node
>>>>> Jan 19 15:37:09 sndr: SNDR: Could not open file
>>>>> sysvoltwo:/dev/rdsk/c5d0s0 on remote node
>>>>
>>>> SNDR is a client / server replication model, and thus all of AVS  
>>>> must be running on both nodes involved in replication. This can  
>>>> be verified by running "dscfgadm -i", and assuring there are no  
>>>> errors. If there are errors, "dscfgsdm -d" (disable), following  
>>>> be "dscfgadm -e" (enable), should resolve all errors. Check  
>>>> "dscfgadm -i", one more time.
>>> My dscfgadm -i appears to be good.   I didn't post both nodes log  
>>> outputs because I didn't want this to get too big, but here are  
>>> both dscfgadm -i outputs.
>>
>> That's OK
>>
>>> sysvolone:~# dscfgadm -i
>>> SERVICE         STATE           ENABLED
>>> nws_scm         online          true
>>> nws_sv          online          true
>>> nws_ii          online          true
>>> nws_rdc         online          true
>>> nws_rdcsyncd    online          true
>>>
>>> Availability Suite Configuration:
>>> Local configuration database: valid
>>>
>>> sysvoltwo:~# dscfgadm -i
>>> SERVICE         STATE           ENABLED
>>> nws_scm         online          true
>>> nws_sv          online          true
>>> nws_ii          online          true
>>> nws_rdc         online          true
>>> nws_rdcsyncd    online          true
>>>
>>> Availability Suite Configuration:
>>> Local configuration database: valid
>>>
>>>>
>>>>> I ran rpcinfo -p on each node and they're identical:
>>>>
>>>> From rpcinfo(1M), the following command syntax is covered in the  
>>>> AVS troubleshooting guide (819-6151-10)
>>>>
>>>>   # rpcinfo -T tcp node1 100143
>>>>
>>>>   rpcinfo -T transport host prognum [versnum]
>>>>
>>>> SNDR's program number is 100143
>>> Yes, I did do this, and version 4 'failed'  on both nodes (old  
>>> documentation I assumed):
>>
>> That is correct. Versions 5, 6, & 7 are the currently supported  
>> versions for interoperability, and you were looking at outdated docs.
>>
>>    Version 5 = Solaris 2.6 & 7            - 32-bit data path, 32- 
>> bit Kernel
>>    Version 6 = Solaris 8 & 9                - 64-bit data path, 64- 
>> bit Kernel, disk queues, multiple async. flusher threads
>>    Version 7 = Solaris 10, OpenSolaris    - Add x86 / x64  
>> replication support in addition to SPARC
>>
>> The place where this is defined is as follows:
>>
>>    
>> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/avs/ns/rdc/rdc_prot.x#391
>>
>> All of AVS source code
>>
>>    http://cvs.opensolaris.org/source/search?q=&path=%2Favs%2F&project=%2Fonnv
>>
>>>
>>> # rpcinfo -T tcp sysvolone 100143 4
>>> rpcinfo: RPC: Program/version mismatch; low version = 5, high  
>>> version = 7
>>> program 100143 version 4 is not available
>>>
>>> But, as shown by the mismatch error, version 7 does work - on both  
>>> nodes:
>>> # rpcinfo -T tcp sysvolone 100143 7
>>> program 100143 version 7 ready and waiting
>>>
>>>>
>>>>> rpcinfo -p sysvoltwo
>>>>> program vers proto   port  service
>>>>>  100000    4   tcp    111  rpcbind
>>>>>  100000    3   tcp    111  rpcbind
>>>>>  100000    2   tcp    111  rpcbind
>>>>>  100000    4   udp    111  rpcbind
>>>>>  100000    3   udp    111  rpcbind
>>>>>  100000    2   udp    111  rpcbind
>>>>>  100229    1   tcp  62457  metad
>>>>>  100229    2   tcp  62457  metad
>>>>>  100143    5   tcp    121
>>>>>  100143    6   tcp    121
>>>>>  100143    7   tcp    121
>>>>>
>>>>> Originally, I couldn't connect with rpcinfo at all and then I was
>>>>> missing port 121 on one node - but I've fixed those services and I
>>>>> turned off the 'local only' setting for the rpc/bind service.
>>>>
>>>> I am concerned about the above statement. There is never a need  
>>>> for a system admin to use rpcinfo on behalf of AVS (SNDR). I am  
>>>> therefore concerned have made incompatible changes.
>>> There were two parts to this - and I probably did it backwards  
>>> because my dscfginfo was buggy (had to fix line 1020).
>>
>> Can you show me the actual change?
> Sure. On line 1020:
> Change:  typeset svc=$1
> to:   typeset svc='$1'
>
> This was from:
> http://www.opensolaris.org/jive/thread.jspa?messageID=307817&#307817
>
>>
>>> I didn't fix it until afterwards.  First I tried to connect to  
>>> port 121 via telnet on each node, but one node didn't respond.  I  
>>> noticed the nws_rdc and nws_rdcsyncd services weren't running (as  
>>> I tried to figure out what service bound to what port).  So I  
>>> manually added those with 'svcadm enable'.  I was then able to  
>>> connect to port 121, but it still wasn't working.  I came across a  
>>> thread that mentiond using rpcinfo -p to check the services, but  
>>> they wouldn't respond, which led me to the 'local' setting for the  
>>> rpc/bind service.  That's all I changed for that.  local to  
>>> public. I would think if I did anything wrong it would be the  
>>> manual service enable.
>>
>> This is my concern with 'dscfgadm', as in there should be no reason  
>> to alter the shell script, and if so, I will need to research why  
>> there is a regression, create a CR and get the change back into the  
>> gate.
> Excellent.


I have an outstanding bug as to 'why' this typeset error happens: 
http://defect.opensolaris.org/bz/show_bug.cgi?id=967

>
>
>>
>>>>> So this is where I'm stuck.  I'm a Solaris newbie, and I'm  
>>>>> finding it a
>>>>> little difficult because things like the AVS Troubleshooting  
>>>>> guide just
>>>>> give commands to run - but I don't know what output I'm looking  
>>>>> for.
>>>>
>>>> The encapsulation of AVS startup and shutdown into 'dscfgadm', is  
>>>> an improvement over prior versions. If 'dscfgadm -i' does not  
>>>> come back without errors, one can run 'dscfgadm -i -x', to get a  
>>>> look inside the script as to what operations are failing.
>>>>
>>> The script that comes with OpenSolaris is busted, and I didn't  
>>> stumble across a fix until I already had everything else fixed. :(
>>> But now it definitely looks clean - right?
>>
>> When you first detect the busted 'script', is when you should have  
>> sought help. Altering the script, then raises the question as to  
>> what is, and is not working correctly.
> I don't disagree, but that seemed minor compared to the other  
> OpenSolaris issues I had...  static IP changes broken, GUI keyboard  
> layout broken - I physically had no mouse initially, compilation  
> issues with certain programs and all Perl modules,  what the hell is  
> this metadb thing - slices?  what are those?  Aw hell, OS used the  
> entire Solaris partition for the root slice, etc,  etc, :)  Your  
> blog, which is what I based my testing off of, doesn't even mention  
> dscfgadm except in a comment.

My blog pre-dates opensolaris 2008.xx, which is where the typeset  
defect exists. In other distributions based of

> Granted, I realize you assume some existing familiarity with the  
> product/environment, but your example was the only one I found that  
> did what I was looking for.  If I wanted software raid, I'd use zfs,  
> and if I wanted plain old replication, I'd use ZFS.  Your example  
> had the end result I was looking for - Active/Active replication  
> between servers which did not need to be in the same office.  I  
> didn't just start this yesterday, it's been a journey :)
>
> I realize I learn awkwardly (I take something advanced and make it  
> work), but I'm just explaining why I didn't stop with the dscfgadm  
> issue.   I have no problem wiping and reinstalling if you think I  
> should do that.

No problem. Do realize that as one deploys new and evolving  
technologies, there is a higher probability of defects. Given that, it  
takes a Google search, and if not found a bug entered specific to the  
defect found.

>
>>
>>
>>>>> The above output looks fine to me, but am I missing something  
>>>>> else?
>>>>
>>>> There are two places, one either the SNDR primary or SNDR  
>>>> secondary node where error messages are logged on behalf of AVS.  
>>>> They are /var/adm/messages, and  /var/svc/log/*nws_*
>>> Ahh This is what I was looking for, I just didn't know where.  I'm  
>>> not sure if it helps though, the nws-scm service is the only one  
>>> with anything odd:
>>> [ Jan 19 15:27:54 Executing start method ("/lib/svc/method/svc-scm  
>>> start"). ]
>>> scmadm: cache enable failed
>>> SDBC: Cache enable failed.
>>
>> The SDBC (Solaris Data Buffer Cache), is used by AVS to cache the  
>> contents of bitmap volumes. Having this fail, will prevent one side  
>> (or the other) of an SNDR replicate from running. This raises the  
>> question on why "/lib/svc/method/svc-scm start", failed, but  
>> 'dscfgadm -i", did not list it is such?
> ahhhh ok.  Both sides have that error, so it seems to me the manual  
> service enable shouldn't have been the cause (as the other node was  
> setup fine on it's own).  I see there are 4 packages, and I believe  
> the reason the single node didn't work properly is because I  
> installed SUNWii before anything else.  That's because the package  
> manager doesn't show descriptions until after an update, and I just  
> started the AVS install before I did the update on that node (I  
> couldn't recall what I did to get the descriptions - if it was a  
> partial install, or full update)   I searched for wii, then 'avail'  
> to do the install, not a search for each package.
> Based on the following bug, 
> http://defect.opensolaris.org/bz/show_bug.cgi?id=5115 
>  , it looks like those 4 are all I need - correct?
> They are:
> SUNWii
> SUNWrdc
> SUNWscm
> SUNWspsv

Yes. The packages need are those above, but they must be acquired one  
at a time, and installed in the following order.

        SUNWscm
        SUNWspsv
        SUNWrdc
        SUNWii

See: http://defect.opensolaris.org/bz/show_bug.cgi?id=5115

> This is my progress and documentation if you're interested: 
> http://havokmon.blogspot.com/
>
> I've restarted the nws_scm services, and now did a sndradm -u on  
> each node.  Both node's ds.log are logging 'could not open file..'  
> and ends with 'Sync Ended',  but the scm log doesn't have any cache  
> errors this time.  The other nws services haven't logged anything at  
> all.   I'm sure that clears things up :)
>
> Thanks,
> Rick
>>
>>
>>> All the other services appear normal, and identical:
>>> [ Jan 19 15:27:44 Enabled. ]
>>> [ Jan 19 15:27:55 Executing start method ("/lib/svc/method/svc-sv  
>>> start"). ]
>>> [ Jan 19 15:27:56 Method "start" exited with status 0. ]
>>>
>>> Thanks,
>>>
>>> Rick
>>
>> Jim Dunham
>> Engineering Manager
>> Storage Platform Software Group
>> Sun Microsystems, Inc.
>
>

- Jim
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Re: [storage-discuss] SNDR "could not open file XXX on remote node"

Reply via email to