Re: [storage-discuss] AVS BUG (configuration lost part III)

msl Wed, 17 Oct 2007 07:29:01 -0700

2007/10/15, Jim Dunham :

msl wrote:


> I'm here to know how can we fix the problem with the AVS
> configuration lost. I did some posts about that error here, and i
> got some reply from Jim Dunham about the "/var/adm/messages", but i
> could not find anything relevant. So, i'm using AVS with solaris 10
> u3, and i really want to know if i have to install a opensolaris
> distro to help you guys with the solution. The AVS code is the one
> from opensolaris.org, i think other people using AVS "should" be
> getting that error too (even if using with opensolaris).

To be very clear, you are the "only" one that is loosing there AVS
configuration on S10u3. There are no customers that have purchased
AVS 4.0 running it on Solaris 10 Update 3, or OpenSolaris consumers
running Nevada seeing this issue.

Therefore it is highly like that something in your environment is
causing this problems.

The AVS configuration is fairly simple.

In a single node Solaris environment, the "file" /etc/dscfg_local
contains the "AVS" configuration. There are three ways to loose the
AVS configuration under this scenario:
1). Deleting the file "/etc/dscfg_local"
2). Corrupting the contents of "/etc/dscfg_local"
3). Some software issuing "dscfg -i "

Because AVS supports a two-phase commit protocol, within itself, it
can't corruption its own dscfg database.


Ok, i'm replicating discs between two cluster nodes, but that (i think), is 
irrelevant to AVS. But the AVS is detecting that it is in a cluster 
environment, and is requesting a cluster configuration. I think i could just 
use the AVS like a two separate nodes. ..Sorry, but i don't "see" where the 
cluster environment is relevant to AVS in my case.
Answering your questions:
1) The file "/etc/dscfg_local" is not deleted, and the command "dscfg -l" works.
2) See above. i have saved the output of the command "dscfg -l" before reboot, 
and after to make a diff... the output are the same. Besides, i think the AVS 
software would know about corruptions in that file. right?
3) That i can't answer...

In a multi node Sun Cluster environment there is the same scenarios
as above, plus the Sun Cluster part of the AVS configuration. There
are seven addition ways to loose the AVS configuration under this
scenario:
4). Deleting the file /etc/dscfg_cluster"


That file is always there, always "without the *new line* at the end" :)

5). Changing the contents of /etc/dscfg_cluster


Same as above..

6). Corrupting the DID partition pointed to by the contents of /etc/
dscfg_cluster


I was using a partition (s1), on a disc that was being used (s0), on a ZFS pool.
Now, i'm using the same partition (s1), and the other slice (s0) is not used 
for anything else.
And the problem is always there...

7). Some software issuing "dscfg -C -i"

8). The DID partition is /dev/did/dsk/ds, not /dev/did/rdsk/
ds


The partition is "/dev/did/rdsk/d2s1".
I did try to configure another way to see if i could... but the answer is no! 
The AVS software complains about "dsk"...

9). The DID partition is not the same DID device on all nodes in the
Sun Cluster

# /usr/cluster/bin/scdidadm -L
2 node1:/dev/rdsk/c0t5006048449AF62A7d34 /dev/did/rdsk/d2
2 node2:/dev/rdsk/c0t5006048449AF62A7d34 /dev/did/rdsk/d2

10). Within all nodes of the Sun Cluster, there needs to be a
process called "dscfglockd" running

#ps -eoargs | grep dscfg
/usr/lib/dscfglockd -f /etc/dscfg_lockdb

In both nodes...

It is my opinion that failure "6" is the situation you are seeing. It
is likely caused by 8, 9 or 10.


If you say so, i believe. :) but we need a way to see it... after a reboot, 
that is the situation:

COMMAND: dscfg -l
# Consolidated Dataservice Configuration
# Do not edit out whitespace or dashes
# File created on: Tue Oct 16 16:02:52 2007
# Availability Suite - dscfg configuration database
#
# Storage Cache Manager - scmadm
# threads csiz wrtcache filpat reserved1 niobuf ntdaemon fwrthru nofwrthru
scm: 128 64 - - - - - - -
#
# Cache Hints - scmadm
# device wrthru nordcache
#
# Point-in-Time Copy - iiadm
# master shadow bitmap mode(D|I) [overflow] [device-group] [options] [group]
#
# Remote Mirror (internal) SetId
# setid [device-group]
setid: 4 setid-ctag
#
# Remote Mirror - sndradm
# p_host p_dev p_bmp s_host s_dev s_bmp protocol(ip/fcal_device) mode \
# [group] [device-group] [options] [diskq]
sndr: node1 /dev/rdsk/c2d0s0 /dev/rdsk/c2d0s1 node2 /dev/rdsk/c2d0s0 
/dev/rdsk/c2d0s1 ip sync B2007 - setid=3; -
sndr: node1 /dev/rdsk/c3d0s0 /dev/rdsk/c3d0s1 node2 /dev/rdsk/c3d0s0 
/dev/rdsk/c3d0s1 ip sync B2007 - setid=4; -
#
# Remote Mirror - Point-in-Time mapping
# SNDR-secondary II-shadow II-bitmap state [device-group]
#
# Bitmap filesystem to mount before other filesystems
# pathname_or_special_device [resource-group]
#
# Storage volumes - svadm
# pathname [mode] [device-group]
sv: /dev/rdsk/c2d0s0 - -
sv: /dev/rdsk/c2d0s1 - -
sv: /dev/rdsk/c3d0s0 - -
sv: /dev/rdsk/c3d0s1 - -
#
# Ncall Core
# nodeid [device-group]
#
# DsVol - volume usage
# volume [device-group] users
dsvol: /dev/rdsk/c2d0s0 - sndr
dsvol: /dev/rdsk/c2d0s1 - sndr
dsvol: /dev/rdsk/c3d0s0 - sndr
dsvol: /dev/rdsk/c3d0s1 - sndr

COMMAND: svadm
/dev/rdsk/c2d0s0
/dev/rdsk/c2d0s1
/dev/rdsk/c3d0s0
/dev/rdsk/c3d0s1

COMMAND: dsstat


COMMAND: sndradm -C local -P


Because AVS supports a two-phase commit protocol, with the assistance
of "dscfglockd", within itself, it can't corruption its own dscfg
database.


I still believe in you. :)


Jim

> So, how can we find a fix to this? after a reboot, the sndr
> information is lost, and i lost all the replication information...
> that's a really "bad" behavior. I did some dtrace "scripts" to try
> to find the error, but Mr. Dunham said that is "digging to deep".
> Any ideas?
> Thanks for your time!
>
> ps.:
> The dtrace post is here:
> http://www.posix.brte.com.br/blog/?p=79
> --
>
> This message posted from opensolaris.org
>
> _______________________________________________
> ha-clusters-discuss mailing list
> ha-clusters-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss

Jim Dunham
Solaris, Storage Software Group

Sun Microsystems, Inc.
1617 Southwood Drive
Nashua, NH 03063
Email: James dot Dunham at Sun dot COM
http://blogs.sun.com/avs




Thanks very much!!!

Leal
-- 
pOSix rules
 
 
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Re: [storage-discuss] AVS BUG (configuration lost part III)

Reply via email to