Rick Romero wrote: > Jim Dunham wrote: >> Rick, >> >>> Jim Dunham wrote: >>>> Rick, >>>> >>>>> I followed Jim Dunham's AVS & ZFS seamless guide on OpenSolaris >>>>> 2008.11, >>>>> and I'm running into a problem. Actually, I ran into a few >>>>> problems, >>>>> but this is where I'm really stuck :) >>>>> >>>>> Both nodes /var/adm/ds.log show the same errors for each disk: >>>>> Jan 19 15:37:08 librdc: SNDR: Could not open file >>>>> sysvoltwo:/dev/rdsk/c4d0s0 on remote node >>>>> Jan 19 15:37:09 sndr: SNDR: Could not open file >>>>> sysvoltwo:/dev/rdsk/c5d0s0 on remote node >>>> >>>> SNDR is a client / server replication model, and thus all of AVS >>>> must be running on both nodes involved in replication. This can >>>> be verified by running "dscfgadm -i", and assuring there are no >>>> errors. If there are errors, "dscfgsdm -d" (disable), following >>>> be "dscfgadm -e" (enable), should resolve all errors. Check >>>> "dscfgadm -i", one more time. >>> My dscfgadm -i appears to be good. I didn't post both nodes log >>> outputs because I didn't want this to get too big, but here are >>> both dscfgadm -i outputs. >> >> That's OK >> >>> sysvolone:~# dscfgadm -i >>> SERVICE STATE ENABLED >>> nws_scm online true >>> nws_sv online true >>> nws_ii online true >>> nws_rdc online true >>> nws_rdcsyncd online true >>> >>> Availability Suite Configuration: >>> Local configuration database: valid >>> >>> sysvoltwo:~# dscfgadm -i >>> SERVICE STATE ENABLED >>> nws_scm online true >>> nws_sv online true >>> nws_ii online true >>> nws_rdc online true >>> nws_rdcsyncd online true >>> >>> Availability Suite Configuration: >>> Local configuration database: valid >>> >>>> >>>>> I ran rpcinfo -p on each node and they're identical: >>>> >>>> From rpcinfo(1M), the following command syntax is covered in the >>>> AVS troubleshooting guide (819-6151-10) >>>> >>>> # rpcinfo -T tcp node1 100143 >>>> >>>> rpcinfo -T transport host prognum [versnum] >>>> >>>> SNDR's program number is 100143 >>> Yes, I did do this, and version 4 'failed' on both nodes (old >>> documentation I assumed): >> >> That is correct. Versions 5, 6, & 7 are the currently supported >> versions for interoperability, and you were looking at outdated docs. >> >> Version 5 = Solaris 2.6 & 7 - 32-bit data path, 32- >> bit Kernel >> Version 6 = Solaris 8 & 9 - 64-bit data path, 64- >> bit Kernel, disk queues, multiple async. flusher threads >> Version 7 = Solaris 10, OpenSolaris - Add x86 / x64 >> replication support in addition to SPARC >> >> The place where this is defined is as follows: >> >> >> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/avs/ns/rdc/rdc_prot.x#391 >> >> All of AVS source code >> >> http://cvs.opensolaris.org/source/search?q=&path=%2Favs%2F&project=%2Fonnv >> >>> >>> # rpcinfo -T tcp sysvolone 100143 4 >>> rpcinfo: RPC: Program/version mismatch; low version = 5, high >>> version = 7 >>> program 100143 version 4 is not available >>> >>> But, as shown by the mismatch error, version 7 does work - on both >>> nodes: >>> # rpcinfo -T tcp sysvolone 100143 7 >>> program 100143 version 7 ready and waiting >>> >>>> >>>>> rpcinfo -p sysvoltwo >>>>> program vers proto port service >>>>> 100000 4 tcp 111 rpcbind >>>>> 100000 3 tcp 111 rpcbind >>>>> 100000 2 tcp 111 rpcbind >>>>> 100000 4 udp 111 rpcbind >>>>> 100000 3 udp 111 rpcbind >>>>> 100000 2 udp 111 rpcbind >>>>> 100229 1 tcp 62457 metad >>>>> 100229 2 tcp 62457 metad >>>>> 100143 5 tcp 121 >>>>> 100143 6 tcp 121 >>>>> 100143 7 tcp 121 >>>>> >>>>> Originally, I couldn't connect with rpcinfo at all and then I was >>>>> missing port 121 on one node - but I've fixed those services and I >>>>> turned off the 'local only' setting for the rpc/bind service. >>>> >>>> I am concerned about the above statement. There is never a need >>>> for a system admin to use rpcinfo on behalf of AVS (SNDR). I am >>>> therefore concerned have made incompatible changes. >>> There were two parts to this - and I probably did it backwards >>> because my dscfginfo was buggy (had to fix line 1020). >> >> Can you show me the actual change? > Sure. On line 1020: > Change: typeset svc=$1 > to: typeset svc='$1' > > This was from: > http://www.opensolaris.org/jive/thread.jspa?messageID=307817񋉩 > >> >>> I didn't fix it until afterwards. First I tried to connect to >>> port 121 via telnet on each node, but one node didn't respond. I >>> noticed the nws_rdc and nws_rdcsyncd services weren't running (as >>> I tried to figure out what service bound to what port). So I >>> manually added those with 'svcadm enable'. I was then able to >>> connect to port 121, but it still wasn't working. I came across a >>> thread that mentiond using rpcinfo -p to check the services, but >>> they wouldn't respond, which led me to the 'local' setting for the >>> rpc/bind service. That's all I changed for that. local to >>> public. I would think if I did anything wrong it would be the >>> manual service enable. >> >> This is my concern with 'dscfgadm', as in there should be no reason >> to alter the shell script, and if so, I will need to research why >> there is a regression, create a CR and get the change back into the >> gate. > Excellent.
I have an outstanding bug as to 'why' this typeset error happens: http://defect.opensolaris.org/bz/show_bug.cgi?id=967 > > >> >>>>> So this is where I'm stuck. I'm a Solaris newbie, and I'm >>>>> finding it a >>>>> little difficult because things like the AVS Troubleshooting >>>>> guide just >>>>> give commands to run - but I don't know what output I'm looking >>>>> for. >>>> >>>> The encapsulation of AVS startup and shutdown into 'dscfgadm', is >>>> an improvement over prior versions. If 'dscfgadm -i' does not >>>> come back without errors, one can run 'dscfgadm -i -x', to get a >>>> look inside the script as to what operations are failing. >>>> >>> The script that comes with OpenSolaris is busted, and I didn't >>> stumble across a fix until I already had everything else fixed. :( >>> But now it definitely looks clean - right? >> >> When you first detect the busted 'script', is when you should have >> sought help. Altering the script, then raises the question as to >> what is, and is not working correctly. > I don't disagree, but that seemed minor compared to the other > OpenSolaris issues I had... static IP changes broken, GUI keyboard > layout broken - I physically had no mouse initially, compilation > issues with certain programs and all Perl modules, what the hell is > this metadb thing - slices? what are those? Aw hell, OS used the > entire Solaris partition for the root slice, etc, etc, :) Your > blog, which is what I based my testing off of, doesn't even mention > dscfgadm except in a comment. My blog pre-dates opensolaris 2008.xx, which is where the typeset defect exists. In other distributions based of > Granted, I realize you assume some existing familiarity with the > product/environment, but your example was the only one I found that > did what I was looking for. If I wanted software raid, I'd use zfs, > and if I wanted plain old replication, I'd use ZFS. Your example > had the end result I was looking for - Active/Active replication > between servers which did not need to be in the same office. I > didn't just start this yesterday, it's been a journey :) > > I realize I learn awkwardly (I take something advanced and make it > work), but I'm just explaining why I didn't stop with the dscfgadm > issue. I have no problem wiping and reinstalling if you think I > should do that. No problem. Do realize that as one deploys new and evolving technologies, there is a higher probability of defects. Given that, it takes a Google search, and if not found a bug entered specific to the defect found. > >> >> >>>>> The above output looks fine to me, but am I missing something >>>>> else? >>>> >>>> There are two places, one either the SNDR primary or SNDR >>>> secondary node where error messages are logged on behalf of AVS. >>>> They are /var/adm/messages, and /var/svc/log/*nws_* >>> Ahh This is what I was looking for, I just didn't know where. I'm >>> not sure if it helps though, the nws-scm service is the only one >>> with anything odd: >>> [ Jan 19 15:27:54 Executing start method ("/lib/svc/method/svc-scm >>> start"). ] >>> scmadm: cache enable failed >>> SDBC: Cache enable failed. >> >> The SDBC (Solaris Data Buffer Cache), is used by AVS to cache the >> contents of bitmap volumes. Having this fail, will prevent one side >> (or the other) of an SNDR replicate from running. This raises the >> question on why "/lib/svc/method/svc-scm start", failed, but >> 'dscfgadm -i", did not list it is such? > ahhhh ok. Both sides have that error, so it seems to me the manual > service enable shouldn't have been the cause (as the other node was > setup fine on it's own). I see there are 4 packages, and I believe > the reason the single node didn't work properly is because I > installed SUNWii before anything else. That's because the package > manager doesn't show descriptions until after an update, and I just > started the AVS install before I did the update on that node (I > couldn't recall what I did to get the descriptions - if it was a > partial install, or full update) I searched for wii, then 'avail' > to do the install, not a search for each package. > Based on the following bug, > http://defect.opensolaris.org/bz/show_bug.cgi?id=5115 > , it looks like those 4 are all I need - correct? > They are: > SUNWii > SUNWrdc > SUNWscm > SUNWspsv Yes. The packages need are those above, but they must be acquired one at a time, and installed in the following order. SUNWscm SUNWspsv SUNWrdc SUNWii See: http://defect.opensolaris.org/bz/show_bug.cgi?id=5115 > This is my progress and documentation if you're interested: > http://havokmon.blogspot.com/ > > I've restarted the nws_scm services, and now did a sndradm -u on > each node. Both node's ds.log are logging 'could not open file..' > and ends with 'Sync Ended', but the scm log doesn't have any cache > errors this time. The other nws services haven't logged anything at > all. I'm sure that clears things up :) > > Thanks, > Rick >> >> >>> All the other services appear normal, and identical: >>> [ Jan 19 15:27:44 Enabled. ] >>> [ Jan 19 15:27:55 Executing start method ("/lib/svc/method/svc-sv >>> start"). ] >>> [ Jan 19 15:27:56 Method "start" exited with status 0. ] >>> >>> Thanks, >>> >>> Rick >> >> Jim Dunham >> Engineering Manager >> Storage Platform Software Group >> Sun Microsystems, Inc. > > - Jim _______________________________________________ storage-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/storage-discuss
