Jim Dunham wrote: > Rick, > >> Jim Dunham wrote: >>> Rick, >>> >>>> I followed Jim Dunham's AVS & ZFS seamless guide on OpenSolaris >>>> 2008.11, >>>> and I'm running into a problem. Actually, I ran into a few problems, >>>> but this is where I'm really stuck :) >>>> >>>> Both nodes /var/adm/ds.log show the same errors for each disk: >>>> Jan 19 15:37:08 librdc: SNDR: Could not open file >>>> sysvoltwo:/dev/rdsk/c4d0s0 on remote node >>>> Jan 19 15:37:09 sndr: SNDR: Could not open file >>>> sysvoltwo:/dev/rdsk/c5d0s0 on remote node >>> >>> SNDR is a client / server replication model, and thus all of AVS >>> must be running on both nodes involved in replication. This can be >>> verified by running "dscfgadm -i", and assuring there are no errors. >>> If there are errors, "dscfgsdm -d" (disable), following be "dscfgadm >>> -e" (enable), should resolve all errors. Check "dscfgadm -i", one >>> more time. >> My dscfgadm -i appears to be good. I didn't post both nodes log >> outputs because I didn't want this to get too big, but here are both >> dscfgadm -i outputs. > > That's OK > >> sysvolone:~# dscfgadm -i >> SERVICE STATE ENABLED >> nws_scm online true >> nws_sv online true >> nws_ii online true >> nws_rdc online true >> nws_rdcsyncd online true >> >> Availability Suite Configuration: >> Local configuration database: valid >> >> sysvoltwo:~# dscfgadm -i >> SERVICE STATE ENABLED >> nws_scm online true >> nws_sv online true >> nws_ii online true >> nws_rdc online true >> nws_rdcsyncd online true >> >> Availability Suite Configuration: >> Local configuration database: valid >> >>> >>>> I ran rpcinfo -p on each node and they're identical: >>> >>> From rpcinfo(1M), the following command syntax is covered in the AVS >>> troubleshooting guide (819-6151-10) >>> >>> # rpcinfo -T tcp node1 100143 >>> >>> rpcinfo -T transport host prognum [versnum] >>> >>> SNDR's program number is 100143 >> Yes, I did do this, and version 4 'failed' on both nodes (old >> documentation I assumed): > > That is correct. Versions 5, 6, & 7 are the currently supported > versions for interoperability, and you were looking at outdated docs. > > Version 5 = Solaris 2.6 & 7 - 32-bit data path, 32-bit > Kernel > Version 6 = Solaris 8 & 9 - 64-bit data path, > 64-bit Kernel, disk queues, multiple async. flusher threads > Version 7 = Solaris 10, OpenSolaris - Add x86 / x64 replication > support in addition to SPARC > > The place where this is defined is as follows: > > > http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/avs/ns/rdc/rdc_prot.x#391 > > > > All of AVS source code > > > http://cvs.opensolaris.org/source/search?q=&path=%2Favs%2F&project=%2Fonnv > > >> >> # rpcinfo -T tcp sysvolone 100143 4 >> rpcinfo: RPC: Program/version mismatch; low version = 5, high version >> = 7 >> program 100143 version 4 is not available >> >> But, as shown by the mismatch error, version 7 does work - on both >> nodes: >> # rpcinfo -T tcp sysvolone 100143 7 >> program 100143 version 7 ready and waiting >> >>> >>>> rpcinfo -p sysvoltwo >>>> program vers proto port service >>>> 100000 4 tcp 111 rpcbind >>>> 100000 3 tcp 111 rpcbind >>>> 100000 2 tcp 111 rpcbind >>>> 100000 4 udp 111 rpcbind >>>> 100000 3 udp 111 rpcbind >>>> 100000 2 udp 111 rpcbind >>>> 100229 1 tcp 62457 metad >>>> 100229 2 tcp 62457 metad >>>> 100143 5 tcp 121 >>>> 100143 6 tcp 121 >>>> 100143 7 tcp 121 >>>> >>>> Originally, I couldn't connect with rpcinfo at all and then I was >>>> missing port 121 on one node - but I've fixed those services and I >>>> turned off the 'local only' setting for the rpc/bind service. >>> >>> I am concerned about the above statement. There is never a need for >>> a system admin to use rpcinfo on behalf of AVS (SNDR). I am >>> therefore concerned have made incompatible changes. >> There were two parts to this - and I probably did it backwards >> because my dscfginfo was buggy (had to fix line 1020). > > Can you show me the actual change? Sure. On line 1020: Change: typeset svc=$1 to: typeset svc='$1'
This was from: http://www.opensolaris.org/jive/thread.jspa?messageID=307817񋉩 > >> I didn't fix it until afterwards. First I tried to connect to port >> 121 via telnet on each node, but one node didn't respond. I noticed >> the nws_rdc and nws_rdcsyncd services weren't running (as I tried to >> figure out what service bound to what port). So I manually added >> those with 'svcadm enable'. I was then able to connect to port 121, >> but it still wasn't working. I came across a thread that mentiond >> using rpcinfo -p to check the services, but they wouldn't respond, >> which led me to the 'local' setting for the rpc/bind service. That's >> all I changed for that. local to public. I would think if I did >> anything wrong it would be the manual service enable. > > This is my concern with 'dscfgadm', as in there should be no reason to > alter the shell script, and if so, I will need to research why there > is a regression, create a CR and get the change back into the gate. Excellent. > >>>> So this is where I'm stuck. I'm a Solaris newbie, and I'm finding >>>> it a >>>> little difficult because things like the AVS Troubleshooting guide >>>> just >>>> give commands to run - but I don't know what output I'm looking for. >>> >>> The encapsulation of AVS startup and shutdown into 'dscfgadm', is an >>> improvement over prior versions. If 'dscfgadm -i' does not come back >>> without errors, one can run 'dscfgadm -i -x', to get a look inside >>> the script as to what operations are failing. >>> >> The script that comes with OpenSolaris is busted, and I didn't >> stumble across a fix until I already had everything else fixed. :( >> But now it definitely looks clean - right? > > When you first detect the busted 'script', is when you should have > sought help. Altering the script, then raises the question as to what > is, and is not working correctly. I don't disagree, but that seemed minor compared to the other OpenSolaris issues I had... static IP changes broken, GUI keyboard layout broken - I physically had no mouse initially, compilation issues with certain programs and all Perl modules, what the hell is this metadb thing - slices? what are those? Aw hell, OS used the entire Solaris partition for the root slice, etc, etc, :) Your blog, which is what I based my testing off of, doesn't even mention dscfgadm except in a comment. Granted, I realize you assume some existing familiarity with the product/environment, but your example was the only one I found that did what I was looking for. If I wanted software raid, I'd use zfs, and if I wanted plain old replication, I'd use ZFS. Your example had the end result I was looking for - Active/Active replication between servers which did not need to be in the same office. I didn't just start this yesterday, it's been a journey :) I realize I learn awkwardly (I take something advanced and make it work), but I'm just explaining why I didn't stop with the dscfgadm issue. I have no problem wiping and reinstalling if you think I should do that. > > >>>> The above output looks fine to me, but am I missing something else? >>> >>> There are two places, one either the SNDR primary or SNDR secondary >>> node where error messages are logged on behalf of AVS. They are >>> /var/adm/messages, and /var/svc/log/*nws_* >> Ahh This is what I was looking for, I just didn't know where. I'm >> not sure if it helps though, the nws-scm service is the only one with >> anything odd: >> [ Jan 19 15:27:54 Executing start method ("/lib/svc/method/svc-scm >> start"). ] >> scmadm: cache enable failed >> SDBC: Cache enable failed. > > The SDBC (Solaris Data Buffer Cache), is used by AVS to cache the > contents of bitmap volumes. Having this fail, will prevent one side > (or the other) of an SNDR replicate from running. This raises the > question on why "/lib/svc/method/svc-scm start", failed, but 'dscfgadm > -i", did not list it is such? ahhhh ok. Both sides have that error, so it seems to me the manual service enable shouldn't have been the cause (as the other node was setup fine on it's own). I see there are 4 packages, and I believe the reason the single node didn't work properly is because I installed SUNWii before anything else. That's because the package manager doesn't show descriptions until after an update, and I just started the AVS install before I did the update on that node (I couldn't recall what I did to get the descriptions - if it was a partial install, or full update) I searched for wii, then 'avail' to do the install, not a search for each package. Based on the following bug, http://defect.opensolaris.org/bz/show_bug.cgi?id=5115 , it looks like those 4 are all I need - correct? They are: SUNWii SUNWrdc SUNWscm SUNWspsv This is my progress and documentation if you're interested: http://havokmon.blogspot.com/ I've restarted the nws_scm services, and now did a sndradm -u on each node. Both node's ds.log are logging 'could not open file..' and ends with 'Sync Ended', but the scm log doesn't have any cache errors this time. The other nws services haven't logged anything at all. I'm sure that clears things up :) Thanks, Rick > > >> All the other services appear normal, and identical: >> [ Jan 19 15:27:44 Enabled. ] >> [ Jan 19 15:27:55 Executing start method ("/lib/svc/method/svc-sv >> start"). ] >> [ Jan 19 15:27:56 Method "start" exited with status 0. ] >> >> Thanks, >> >> Rick > > Jim Dunham > Engineering Manager > Storage Platform Software Group > Sun Microsystems, Inc. _______________________________________________ storage-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/storage-discuss
