Hi Jim,
Thanks for your reply,
Jim Dunham wrote:
> Rick,
>
>> I followed Jim Dunham's AVS & ZFS seamless guide on OpenSolaris 2008.11,
>> and I'm running into a problem. Actually, I ran into a few problems,
>> but this is where I'm really stuck :)
>>
>> Both nodes /var/adm/ds.log show the same errors for each disk:
>> Jan 19 15:37:08 librdc: SNDR: Could not open file
>> sysvoltwo:/dev/rdsk/c4d0s0 on remote node
>> Jan 19 15:37:09 sndr: SNDR: Could not open file
>> sysvoltwo:/dev/rdsk/c5d0s0 on remote node
>
> SNDR is a client / server replication model, and thus all of AVS must
> be running on both nodes involved in replication. This can be verified
> by running "dscfgadm -i", and assuring there are no errors. If there
> are errors, "dscfgsdm -d" (disable), following be "dscfgadm -e"
> (enable), should resolve all errors. Check "dscfgadm -i", one more time.
My dscfgadm -i appears to be good. I didn't post both nodes log
outputs because I didn't want this to get too big, but here are both
dscfgadm -i outputs.
sysvolone:~# dscfgadm -i
SERVICE STATE ENABLED
nws_scm online true
nws_sv online true
nws_ii online true
nws_rdc online true
nws_rdcsyncd online true
Availability Suite Configuration:
Local configuration database: valid
sysvoltwo:~# dscfgadm -i
SERVICE STATE ENABLED
nws_scm online true
nws_sv online true
nws_ii online true
nws_rdc online true
nws_rdcsyncd online true
Availability Suite Configuration:
Local configuration database: valid
>
>> I ran rpcinfo -p on each node and they're identical:
>
> From rpcinfo(1M), the following command syntax is covered in the AVS
> troubleshooting guide (819-6151-10)
>
> # rpcinfo -T tcp node1 100143
>
> rpcinfo -T transport host prognum [versnum]
>
> SNDR's program number is 100143
Yes, I did do this, and version 4 'failed' on both nodes (old
documentation I assumed):
# rpcinfo -T tcp sysvolone 100143 4
rpcinfo: RPC: Program/version mismatch; low version = 5, high version = 7
program 100143 version 4 is not available
But, as shown by the mismatch error, version 7 does work - on both nodes:
# rpcinfo -T tcp sysvolone 100143 7
program 100143 version 7 ready and waiting
>
>> rpcinfo -p sysvoltwo
>> program vers proto port service
>> 100000 4 tcp 111 rpcbind
>> 100000 3 tcp 111 rpcbind
>> 100000 2 tcp 111 rpcbind
>> 100000 4 udp 111 rpcbind
>> 100000 3 udp 111 rpcbind
>> 100000 2 udp 111 rpcbind
>> 100229 1 tcp 62457 metad
>> 100229 2 tcp 62457 metad
>> 100143 5 tcp 121
>> 100143 6 tcp 121
>> 100143 7 tcp 121
>>
>> Originally, I couldn't connect with rpcinfo at all and then I was
>> missing port 121 on one node - but I've fixed those services and I
>> turned off the 'local only' setting for the rpc/bind service.
>
> I am concerned about the above statement. There is never a need for a
> system admin to use rpcinfo on behalf of AVS (SNDR). I am therefore
> concerned have made incompatible changes.
There were two parts to this - and I probably did it backwards because
my dscfginfo was buggy (had to fix line 1020). I didn't fix it until
afterwards. First I tried to connect to port 121 via telnet on each
node, but one node didn't respond. I noticed the nws_rdc and
nws_rdcsyncd services weren't running (as I tried to figure out what
service bound to what port). So I manually added those with 'svcadm
enable'. I was then able to connect to port 121, but it still wasn't
working. I came across a thread that mentiond using rpcinfo -p to check
the services, but they wouldn't respond, which led me to the 'local'
setting for the rpc/bind service. That's all I changed for that. local
to public.
I would think if I did anything wrong it would be the manual service enable.
>> So this is where I'm stuck. I'm a Solaris newbie, and I'm finding it a
>> little difficult because things like the AVS Troubleshooting guide just
>> give commands to run - but I don't know what output I'm looking for.
>
> The encapsulation of AVS startup and shutdown into 'dscfgadm', is an
> improvement over prior versions. If 'dscfgadm -i' does not come back
> without errors, one can run 'dscfgadm -i -x', to get a look inside the
> script as to what operations are failing.
The script that comes with OpenSolaris is busted, and I didn't stumble
across a fix until I already had everything else fixed. :(
But now it definitely looks clean - right?
>
>> The above output looks fine to me, but am I missing something else?
>
> There are two places, one either the SNDR primary or SNDR secondary
> node where error messages are logged on behalf of AVS. They are
> /var/adm/messages, and /var/svc/log/*nws_*
Ahh This is what I was looking for, I just didn't know where. I'm not
sure if it helps though, the nws-scm service is the only one with
anything odd:
[ Jan 19 15:27:54 Executing start method ("/lib/svc/method/svc-scm
start"). ]
scmadm: cache enable failed
SDBC: Cache enable failed.
All the other services appear normal, and identical:
[ Jan 19 15:27:44 Enabled. ]
[ Jan 19 15:27:55 Executing start method ("/lib/svc/method/svc-sv start"). ]
[ Jan 19 15:27:56 Method "start" exited with status 0. ]
Thanks,
Rick
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss