from:"Sunil Mushran"

Re: [Ocfs2-users] [ocfs2-users] RedHat 4 Update 2

2006-02-16 Thread Sunil Mushran

We will be releasing one by tomorrow.

Christophe JOBARD (GHH) wrote:

 Hi,

 Where can i get the RPM's of the OCFS2 software for the new Red Hat
 Enterprise 2.6.9-22.0.2 kernel  (RH4 Update 2) ?

 Many Thanks,

 Christophe JOBARD


 

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Problem with configuration file.

2006-02-20 Thread Sunil Mushran

oops... it'll be fixed today.

Mathieu Avila wrote:
 Norbert Tretkowski wrote:

   
 * Mathieu Avila wrote:
  

 
 I must have missed something obvious, but i can't see what. Any
 ideas?


   
 You forgot indention in the configuration file.

Norbert
  

 
 Thank you very much.

 I have taken the example file from the user's guide 
 (http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_users_guide.pdf),
  
 in which there is a bad indention.

 --
 Mathieu Avila

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Add a new node to ocfs cluster

2006-03-13 Thread Sunil Mushran

o2cb_ctl -C -n hostname -t node -a cluster=clustername -a 
ip_address=ip -a ip_port=port -a number=nodenum -i

For e.g.
# o2cb_ctl -C -n node99 -t node -a cluster=clus5 -a 
ip_address=192.168.0.99 -a ip_port= -a number=99 -i

Refer man o2cb_ctl for more details.

Vanaclocha Llorens, Jose Lorenzo wrote:
 Many thanks for your help Sunil. One more question my final intention is to 
 do it in a silent mode. Can the ocfs2console step be replaced with a sequence 
 of commands?

 I mean, my final goal is add a node to an existing Oracle RAC automatically. 
 Do you think that it is possible from the ocfs perspective?

 Best regards,


 Llorenç Vanaclocha 


 -Mensaje original-
 De: Sunil Mushran [mailto:[EMAIL PROTECTED] 
 Enviado el: sábado, 11 de marzo de 2006 0:29
 Para: Vanaclocha Llorens, Jose Lorenzo
 CC: ocfs2-users@oss.oracle.com
 Asunto: Re: [Ocfs2-users] Add a new node to ocfs cluster

 One can add nodes dynamically. Run ocfs2console on
 all existing nodes to add the new node. Note adding on
 one and propagating to others will not work. One needs
 to add on each active node via ocfs2console.

 Once added, the new node should show up in
 /config/cluster/clustername/nodes. Ensure you see the
 new node on all active nodes.

 Then copy the updated cluster.conf to the new node and
 start the cluster and mount.

 Vanaclocha Llorens, Jose Lorenzo wrote:
   
 Hi everybody,

 My problem is that I want to add a new node to an existing RAC with 
 ocfs2, without stop the database.

 If I add a new node to an existing ocfs cluster, do I need to stop the 
 ocfs in the others nodes of the cluster?

 I've tried to do it without stop the ocfs in the others nodes but I 
 get the following error:

 --

 [EMAIL PROTECTED] ~]# mount.ocfs2 /dev/mapper/eva_d1 /u01/app/oracle

 mount.ocfs2: Transport endpoint is not connected while mounting 
 /dev/mapper/eva_d1 on /u01/app/oracle

 --

 I've three nodes: raclab3, raclab4 and raclab5. I'm trying to add 
 raclab3 to the existing cluster formed by raclab4 and raclab5.

 In the raclab3 I have the following folder:

 --

 [EMAIL PROTECTED] ~]# ls -l /config/cluster/ocfs2/node/

 total 0

 drwxr-xr-x 2 root root 0 Mar 10 11:53 raclab3

 drwxr-xr-x 2 root root 0 Mar 10 11:47 raclab4

 drwxr-xr-x 2 root root 0 Mar 10 11:47 raclab5

 --

 But in the raclab4 and raclab5 I have:

 --

 [EMAIL PROTECTED] ~]# ls -l /config/cluster/ocfs2/node/

 total 0

 drwxr-xr-x 2 root root 0 Mar 10 12:47 raclab4

 drwxr-xr-x 2 root root 0 Mar 10 12:47 raclab5

 --

 If I execute in both nodes:

 --

 [EMAIL PROTECTED] ~]# /etc/init.d/o2cb disable

 Writing O2CB configuration: OK

 [EMAIL PROTECTED] ~]# /etc/init.d/o2cb enable

 Writing O2CB configuration: OK

 Loading module configfs: OK

 Mounting configfs filesystem at /config: OK

 Loading module ocfs2_nodemanager: OK

 Loading module ocfs2_dlm: OK

 Loading module ocfs2_dlmfs: OK

 Mounting ocfs2_dlmfs filesystem at /dlm: OK

 Starting cluster ocfs2: OK

 --

 I get information of raclab3:

 --

 [EMAIL PROTECTED] ~]# ls -l /config/cluster/ocfs2/node/

 total 0

 drwxr-xr-x 2 root root 0 Mar 10 12:51 raclab3

 drwxr-xr-x 2 root root 0 Mar 10 12:51 raclab4

 drwxr-xr-x 2 root root 0 Mar 10 12:51 raclab5

 --

 And finally I can mount my file system in the raclab3 node.

 Summarizing, is it possible to add a new node without stop the ocfs in 
 all the cluster nodes

Re: [Ocfs2-users] nodes dont see eachother pls help!

2006-03-21 Thread Sunil Mushran

Is this a shared disk?

Do:
# echo stats | debugfs.ocfs2 -n /dev/sdX | grep UUID
on all nodes

Is the UUID the same?

Oneill wrote:
 Hi!

 I working on an oracle cluster but I cannot get fahrer because ocfs2 
 nodes dont synchronize.
 I can create ocfs2 filesystem both mashine if i want but they totally 
 dont see eachother and it's not a network error (unsecure fedora core4 
 boxes without firewall or security patch etc.) , all settings perfect, i 
 generated cluster.conf many times with ocfs2console and manually too, 
 but cant help.
 I read all writeings on the ocfs2 page and really dont know why they 
 dont work.

 As I said there is 2 fedora core 4 box, config the same,  I compiled the 
 kernel with your ocfs2 patch (version:2.6.14), all startup scripts, and 
 ocfs2console works perfectly, there is no error in the logs, 2 mashine 
 can ping eachother but i checked traffic when tried to setup the cluster 
 and there isnt a single packet going to port .
 Disk partitions same both sides, I tried to format and mount volumes 
 node1, node2 ; node2, node1, nothing, happens the other node...

 And 1 more think /proc/fs/ocfs2 dont exits! I dont know why, i can 
 format and mount ocfs2 particions locally.

 Help me pls!

 Thanks:

 Oneill

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device.

2006-03-30 Thread Sunil Mushran

/etc/hosts is not the problem.
Do:
/sbin/ifconfig
Do you see the vip bound on the same interface as the
one used in cluster.conf?

Also, what does the dmesg indicate on both nodes. The lower
node number will list the ip which is trying to connect to it.

Vaidya, Sachin wrote:

 Removed VIPs from hosts and restarted the cluster. But nothing 
 changed. Still cannot mount /dev/md0 on both nodes.
 Do I need to reboot servers after changing the /etc/hosts ? Any other 
 suggestions ?
 Thanks,

 Sachin Vaidya
 Infrastructure Management Senior Analyst
 Affiliated Computer Services


  -Original Message-
 From:   Sunil Mushran [mailto:[EMAIL PROTECTED]
 Sent:   Thursday, March 30, 2006 5:34 PM
 To: Vaidya, Sachin
 Cc: ''ocfs2-users@oss.oracle.com' '
 Subject:Re: [Ocfs2-users] Getting eI am using RHLError when 
 mountingshar ed OCFS2 device.

 Remove vip and mount on both. See if that helps.

 Vaidya, Sachin wrote:
 
  Hi,
  Tried both public and private ip addreses but still not able to mount
  device on both nodes.
  Here are my configuration details.
  hosts file : same on both nodes.
 
   127.0.0.1   localhost.localdomain   localhost
  172.18.11.12acspittdw001acspittdw001.servicemetrics.net
  172.18.22.1 priv-acspittdw001
  172.18.11.24vip-acspittdw001
  172.18.11.13acspittdw002acspittdw002.servicemetrics.net
  172.18.22.2 priv-acspittdw002
  172.18.11.25vip-acspittdw002
 
  The cluster.conf on both nodes looks same as
  node:
  ip_port = 
  ip_address = 172.18.11.12
  number = 0
  name = acspittdw001
  cluster = ocfs2
 
  node:
  ip_port = 
  ip_address = 172.18.11.13
  number = 1
  name = acspittdw002
  cluster = ocfs2
 
  cluster:
  node_count = 2
  name = ocfs2
 
  Both nodes can ping each other on public and private ips.
  The mount command produces following error on node 2 when device is
  already mounted on node 1.
 
  [EMAIL PROTECTED] ~]#  mount -t ocfs2 /dev/md0 /crs1
  mount.ocfs2: Transport endpoint is not connected while mounting
  /dev/md0 on /crs1
  [EMAIL PROTECTED] ~]#
 
  dmesg show following messages
 
  SELinux: initialized (dev debugfs, type debugfs), uses genfs_contexts
  (5027,2):ocfs2_initialize_super:1354 max_slots for this device: 8
  (5027,2):ocfs2_fill_local_node_info:1031 I am node 1
  (4986,2):o2net_connect_expired:1446 ERROR: no connection established
  with node 0 after 10 seconds, giving up and returning errors.
 
  (5027,2):dlm_request_join:771 ERROR: status = -107
  (5027,2):dlm_try_to_join_domain:919 ERROR: status = -107
  (5027,2):dlm_join_domain:1164 ERROR: status = -107
  (5027,2):dlm_register_domain:1354 ERROR: status = -107
  (5027,2):ocfs2_dlm_init:1996 ERROR: status = -107
  (5027,2):ocfs2_mount_volume:1063 ERROR: status = -107
  ocfs2: Unmounting device (9,0) on (node 1)
  [EMAIL PROTECTED] ~]#
 
  Any idea why this is happening ?
  I can provide more details if needed.
  Any help will be greatly appreciated.
  Thanks in advance.
  - Sachin Vaidya.
 
 
 
  -Original Message-
  From: Sunil Mushran
  To: Vaidya, Sachin
  Cc: 'ocfs2-users@oss.oracle.com'
  Sent: 3/29/2006 7:16 PM
  Subject: Re: [Ocfs2-users] Getting eI am using RHLError when mounting
  shared OCFS2 device.
 
  Connection failiure. Check dmesg.
 
  Mount triggers the heartbeat thread which triggers the o2net
  to make a connection to all heartbeating nodes. If this connection
  fails,
  the mount fails. (The larger node number initiates the connection
  to the lower node number.)
 
  Obvious error would be incorrect ipaddr specified in cluster.conf.
  Error messages in /var/log/messsages on both nodes will
  provide more clues.
 
  Vaidya, Sachin wrote:
  
   Hi,
  
   I am using RHLE4 2.6.9-34.Elsmp with OCFS2 1.2.
  
   The h/w for this 2 node cluster is connected correctly.
  
   After loading ocfs2 on both nodes, the shared device could only be
   mounted on one node. When I try to mount same shared device on second
   node then I get following error.
  
   Mount.ocfs2: Transport endpoint is not connected while mounting
   /dev/md0 on /crs1
  
   Any idea, why this is happening ?
  
   Any help will be highly appreciated.
  
   Thanks,
  
   Sachin Vaidya
  
  
  
  
  
 
  
   ___
   Ocfs2-users mailing list
   Ocfs2-users@oss.oracle.com
   http://oss.oracle.com/mailman/listinfo/ocfs2-users
   
 
  
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
   


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] understanding self fencing with ocfs2

2006-05-02 Thread Sunil Mushran

In a 2 node setup, if node 0 or 1 crashes, the other node should survive.
The one issue encountered by many users was while shutting down node 0,
node 1 would fence it self. The latter was because of the sequencing of
service shutdowns. We added ocfs2-init script to handle shutdown
sequencing.

However, 1.0.2 is fairly old. We've made numerous fixes. Ideally one
should be on SP3. Infact, look for SuSE to make a new drop in the coming
weeks which will include the certified ocfs2 bits.

[EMAIL PROTECTED] wrote:
 hi list,

 heaving read the FAQ, I still have a problem understanding the
 self fencing thing.
 the FAQ sais:
 Q02   How does OCFS2's cluster services define a quorum? 
 A02   ...
   A node has quorum when:
   * it sees an odd number of heartbeating nodes and has network
 connectivity to more than half of them.
   or 
   * it sees an even number of heartbeating nodes and has network
 connectivity to at least half of them *and* has connectivity
 to
 the heartbeating node with the lowest node number. 

 and

 Q03   What is fencing?
 A03   Fencing is the act of forecefully removing a node from a
 cluster.
   A node with OCFS2 mounted will fence itself when it realizes
 that it
   doesn't have quorum in a degraded cluster.
   ...

 with a two-node-cluster with node numbers 0 and 1, I see the following
 problem.
 if the node with node number 0 crashes and neither does heartbeat nor is
 it
 reachable via LAN, we have:
 - an odd number of heartbeating nodes (1, the node with number 1) but
 - no network connectivity to more than half of them (the only other
 node
   is'nt reachable anymore)
 so, as I see it, no qorum = self fencing.

 as a result, we end up with no node at all. is this right (and is it
 meant that
 way) or is there any special algorithm in a two node environment?

 our config is:
 two HP DL380 G4,
 SLES9 SP2 (no SP3, because it's not supported by EMC powerpath)
 Linux bmiam112 2.6.5-7.201-bigsmp #1 SMP Thu Aug 25 06:20:45 UTC 2005
 i686 i686 i386 GNU/Linux
 all OCFS modules version 1.0.2-SLES,
 ocfs2console-0.99.14-0.3
 ocfs2-tools-0.99.14-0.3
 each with two NICs in active-standby (bond0)

 thanks in advance and sorry, if this is kind of a newby-question

 greetings
 thomas zimolong

 Bundesministerium des Inneren
 Referat Z 6 - Funktionsbereich Anwendungsentwicklung
 Alt-Moabit 101 D
 D-10559 Berlin
 Fon 01888 681 2383
 Fax 01888 681 5 2383
 mailto:[EMAIL PROTECTED]
 http://bmi.bund.de

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Node panic

2006-05-10 Thread Sunil Mushran

You may want to upgrade to 1.2.1. We have done fixes in this area.

Jim Erb wrote:
 Can anyone tell me what might be happening here.  I have a 3 node
 cluster running under RH AS 4 (2.6.9-22.0.1.ELsmp) with ocfs2 v.
 1.2.0-1.  I've recently implemented elevator=deadline in grub.conf to
 fix some previous panics, but now it seems this box goes down every few
 days with this panic:

 May 10 13:46:10 linux97 kernel: (29579,0):ocfs2_extend_file:784 ERROR:
 bug expression: i_size_read(inode) != (le64_to_cpu(fe-i_size) -
 *bytes_extended)
 May 10 13:46:10 linux97 kernel: (29579,0):ocfs2_extend_file:784 ERROR:
 Inode 3891726 i_size = 77801, dinode i_size = 79865, bytes_extended = 0,
 new_i_size = 77874
 May 10 13:46:10 linux97 kernel: [ cut here ]
 May 10 13:46:10 linux97 kernel: kernel BUG
 at /rpmbuild/jlbec/BUILD/ocfs2-1.2.0/fs/ocfs2/file.c:784!
 May 10 13:46:10 linux97 kernel: invalid operand:  [#1]
 May 10 13:46:10 linux97 kernel: SMP
 May 10 13:46:10 linux97 kernel: Modules linked in: nfs lockd
 hangcheck_timer md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core
 ocfs2(U) debugfs(U)
 ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs(U) sunrpc
 dm_mirror dm_mod emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U)
 emcp(U) emcplib(U) button battery ac uhci_hcd ehci_hcd hw_random shpchp
 e1000 bonding(U) floppy sg ext3 jbd lpfc(U) scsi_transport_fc
 megaraid_mbox megaraid_mm sd_mod scsi_mod
 May 10 13:46:10 linux97 kernel: CPU:0
 May 10 13:46:10 linux97 kernel: EIP:0060:[f8f8c635]Tainted: P
 VLI
 May 10 13:46:10 linux97 kernel: EFLAGS: 00210292   (2.6.9-22.0.1.ELsmp)
 May 10 13:46:10 linux97 kernel: EIP is at ocfs2_extend_file+0x380/0xf25
 [ocfs2]
 May 10 13:46:10 linux97 kernel: eax: 0086   ebx:    ecx:
 ea42fe6c   edx: f8fb52b5
 May 10 13:46:10 linux97 kernel: esi: f4144e24   edi: ea42ff18   ebp:
 cc54   esp: ea42fea4
 May 10 13:46:10 linux97 kernel: ds: 007b   es: 007b   ss: 0068
 May 10 13:46:10 linux97 kernel: Process oracle (pid: 29579,
 threadinfo=ea42f000 task=e8bd11b0)
 May 10 13:46:10 linux97 kernel: Stack: f70d0380  
  f4144e24 f6ef4f00 ea42ff58 
 May 10 13:46:10 linux97 kernel: e53ff48c f7fbc200
 ea42ff68  ea42ff68  ea42ff68
 May 10 13:46:10 linux97 kernel: f4144e24 f8f9a4ee
 00013032  ea42ff18 00012fe9 
 May 10 13:46:10 linux97 kernel: Call Trace:
 May 10 13:46:10 linux97 kernel:  [f8f9a4ee]
 ocfs2_write_lock_maybe_extend+0x731/0xad5 [ocfs2]
 May 10 13:46:10 linux97 kernel:  [f8f8a684] ocfs2_file_write
 +0x11f/0x254 [ocfs2]
 May 10 13:46:10 linux97 kernel:  [c0159d24] vfs_write+0xb6/0xe2
 May 10 13:46:10 linux97 kernel:  [c0159dee] sys_write+0x3c/0x62
 May 10 13:46:10 linux97 kernel:  [c02d0fb7] syscall_call+0x7/0xb
 May 10 13:46:10 linux97 kernel: Code: b1 e0 fd ff ff ff b1 dc fd ff ff
 68 10 03 00 00 68 01 fb fa f8 ff 70 10 ff b2 94 00 00 00 68 b5 52 fb f8
 e8 f8 5b 19 c7 83 c4 3c 0f 0b 10 03 1c 50 fb f8 8b 5c 24 10 8b 83 54
 01 00 00 0f ae e8
 May 10 13:46:10 linux97 kernel:  0Fatal exception: panic in 5 seconds



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 hangs system on reboot

2006-05-16 Thread Sunil Mushran

ocfs2-tools includes two init scripts, o2cb and ocfs2. Ensure the
scripts are active and running in the correct sequence. As in,
the startup seq should network, o2cb and then ocfs2. The shutdown
is the reverse of that.

[EMAIL PROTECTED] wrote:

 Anyone experience OCFS2 hanging the system on reboot, I'm running 
 OCFS2 1.2.1-1 on RHEL 4 Update 3 64bit.  OCFS2 is up and running on 3 
 nodes, with mounts.  When I issue a shutdown -ry now command with 
 OCFS2 mounts still mounted the system begins to shutdown then starts 
 freaking out about not being able to communicate with other nodes in 
 the cluster and starts panicing and fences itself.  It hangs here and 
 I have to cycle the server by hand.  This is not a problem if I 
 manually unmount the OCFS2 filesystems prior to rebooting, I've tried 
 putting an unmount script in /etc/rc6.d but to no avail, whatever is 
 happening is happening before it get to my unmount script.
 

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Disk-based DLM

2006-06-06 Thread Sunil Mushran

OCFS2 does not have a disk-based dlm. Net connectivity is a must.

Leonardo de Assis wrote:
 Hi,

 I have two machines that does not have network connection. If my disk 
 can be shared between them, there is an way to use disk-based dlm or 
 any other manner that does not relay on network access?

 -- 
 Leonardo de Assis
 Computação - UFCG
 [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
 

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?

2006-06-09 Thread Sunil Mushran

The hb failure is just the effect of the ios not completing within 12 secs.
The full oops trace gives the last 24 ops and their timings.

One solution is to double up the hb timeout. Set,
O2CB_HEARTBEAT_THRESHOLD = 14

Brian Long wrote:
 Hello,

 I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
 fencing) and I get a full vmcore on my netdump server.  The netdump log
 file shows the shared filesystem LUN (/dev/dm-6) did not respond within
 12000ms.  I have not changed the default heartbeat values
 in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
 happens, but they are HP Proliant servers running the Insight Manager
 agents.

 Why would the heartbeat fail roughly once a week?  Should I open a
 bugzilla and upload my netdump log file?

 Thanks.

 /Brian/
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] bug in /etc/init.d/o2cb?

2006-06-14 Thread Sunil Mushran

Yes, we are missing that bit. File a bug on http://oss.oracle.com/bugzilla
component ocfs2-tools.

[EMAIL PROTECTED] wrote:
 hi,

 maybe this is not the place to file a bug, but

 I think there is one in /etc/init.d/o2cb.

 the script should be used to create the config file /etc/sysconfig/o2cb
 by calling
 it with o2cb configure, and the generated config file contains info
 not to edit
 the file, but to use the script.

 alas, the script only checks for to parameters, the cluster name and
 wether to
 enable the cluster or not.

 the sometimes necessary modification of the heartbeat threshold cannot
 be made
 via the script, and, even worse, it is overwritten by write_config(),
 when someone
 calls the script after the modification was made using some editor.

 so, maybe configure_ask() should contain a third loop to ask for that
 parameter to
 (or any parameter which additionaly can be set).

 greets
 thomas zimolong

 Bundesministerium des Inneren
 Referat Z 6 - Funktionsbereich Anwendungsentwicklung
 Alt-Moabit 101 D
 D-10559 Berlin
 Fon 01888 681 2383
 Fax 01888 681 5 2383
 mailto:[EMAIL PROTECTED]
 http://bmi.bund.de

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Different versions of ocfs2 and Kernel

2006-06-14 Thread Sunil Mushran

I would not recommend that for 1.1.7 to 1.2.1. While neither
the on-disk format nor the messaging has changed since 1.0,
there have been other internal changes which could cause
problems.

The recommendation is documented in the faq... under Upgrading to 1.2.1.

Marco Friebe wrote:
 Thanks for your answers.

 It was only meant to be a temporary solution. While updating one node the
 others have to be available. No downtime for all nodes in the same time
 frame.

 ---

 Marco Friebe
 __
 Systemberater

 Robotron Datenbank-Software GmbH
 Stuttgarter Straße 29
 01189 Dresden

 Telefon: +49 (0) 351/4021 655
 Telefax: +49 (0) 351/4021 696
 Mailto:   [EMAIL PROTECTED]


 Web:   www.robotron.de
 -Ursprüngliche Nachricht-
 Von: Sunil Mushran [mailto:[EMAIL PROTECTED] 
 Gesendet: Dienstag, 13. Juni 2006 18:14
 An: Marco Friebe
 Cc: ocfs2-users@oss.oracle.com
 Betreff: Re: [Ocfs2-users] Different versions of ocfs2 and Kernel

 As we never test such mixed setups, we never recommended it.
 Also, it is much easier to manage clusters when one has the same
 software on all the nodes.

 Marco Friebe wrote:
   
 Hello,

  

 is it possible to have different kernel and ocfs version on different 
 nodes?

  

 Here:

  

 ocfs2 v1.1.7  Kernel 2.6.5-7.244 (SuSE)

  

 and

  

 Ocfs2 v1.2.1 Kernel 2.6.5-7.257 (SuSE)

  

 

  

 Marco Friebe
 __
 Systemberater

 Robotron Datenbank-Software GmbH
 Stuttgarter Straße 29
 01189 Dresden

 Telefon: +49 (0) 351/4021 655
 Telefax: +49 (0) 351/4021 696
 Mailto:   [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
 Web:   www.robotron.de http://www.robotron.de

  

 

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   
 


   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] change heartbeat threshold online?

2006-06-14 Thread Sunil Mushran

It's not a sysctl entry. It won't work that way.

Set the required value in /etc/sysconfig/o2cb
and restart the cluster. Do it on all nodes.

[EMAIL PROTECTED] wrote:
 hi,

 I'm just thinking about changing the heartbeat threshold of our cluster
 online by issuing
 # echo 31  /proc/fs/ocfs2_nodemanager/hb_dead_threshold

 I thought I read that somewhere ut cannot recall where and I don't find
 it
 in the FAQ(?).

 So is this the way to do it, or is it stop and restart ocfs2/o2cb?

 greetz

 thomas zimolong

 Bundesministerium des Inneren
 Referat Z 6 - Funktionsbereich Anwendungsentwicklung
 Alt-Moabit 101 D
 D-10559 Berlin
 Fon 01888 681 2383
 Fax 01888 681 5 2383
 mailto:[EMAIL PROTECTED]
 http://bmi.bund.de

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] kernel BUG at /rpmbuild/smushran/BUILD/ocfs2-1.2.1/fs/ocfs2/file.c:787!

2006-06-21 Thread Sunil Mushran

Check out http://oss.oracle.com/bugzilla/show_bug.cgi?id=723

Peter McMahon wrote:
 All

 still working on the use of OCFS2

 Yesterday, when we were running autoconfig for an Apps
 DB node in a RAC cluster the other node crashed

 extract from /var/log/messages...is below...

 If anyone can advise of what action we should take
 please do

 Thanks in advance

 Peter


 other info which me be relavent
 - o2cb_ctl version 1.2.1
 - Linux 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:54:53 EST
 2006 i686 i686 i386 GNU/Linux

 - o2cb status
 Module configfs: Loaded
 Filesystem configfs: Mounted
 Module ocfs2_nodemanager: Loaded
 Module ocfs2_dlm: Loaded
 Module ocfs2_dlmfs: Loaded
 Filesystem ocfs2_dlmfs: Mounted
 Checking cluster ocfs2: Online
 Checking heartbeat: Active

 - ocfs2 status
 Configured OCFS2 mountpoints:  /home /d01 /d02 /d03
 /d04 /CRS
 Active OCFS2 mountpoints:  /home /d01 /d02 /d03 /d04
 /CRS

 - O2CB_HEARTBEAT_THRESHOLD=60








 ===
 Jun 20 13:15:21 tudbsou01 kernel:
 (1659,1):ocfs2_extend_file:787 ERROR: bug expression:
 i_size_read(inode) != (le64_to_cpu(fe-i_size) -
 *bytes_extended)
 Jun 20 13:15:21 tudbsou01 kernel:
 (1659,1):ocfs2_extend_file:787 ERROR: Inode 16615168
 i_size = 225, dinode i_size = 3774, bytes_extended =
 0, new_i_size = 345
 Jun 20 13:15:21 tudbsou01 kernel: [ cut
 here ]
 Jun 20 13:15:21 tudbsou01 kernel: kernel BUG at
 /rpmbuild/smushran/BUILD/ocfs2-1.2.1/fs/ocfs2/file.c:787!
 Jun 20 13:15:21 tudbsou01 kernel: invalid operand:
  [#1]
 Jun 20 13:15:21 tudbsou01 kernel: SMP
 Jun 20 13:15:21 tudbsou01 kernel: Modules linked in:
 md5 ipv6 autofs4 i2c_dev i2c_core ocfs2(U) debugfs(U)
 ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U)
 configfs(U) sunrpc emcphr(U) emcpmpap(U) emcpmpaa(U)
 emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery
 ac joydev uhci_hcd shpchp tg3 sg st dm_snapshot
 dm_zero dm_mirror ext3 jbd dm_mod qla2300(U)
 qla2xxx(U) qla2xxx_conf(U) mptscsih mptsas mptspi
 mptfc mptscsi mptbase sd_mod scsi_mod
 Jun 20 13:15:21 tudbsou01 kernel: CPU:1
 Jun 20 13:15:21 tudbsou01 kernel: EIP:   
 0060:[f93ce081]Tainted: P  VLI
 Jun 20 13:15:21 tudbsou01 kernel: EFLAGS: 00010292  
 (2.6.9-34.ELsmp)
 Jun 20 13:15:21 tudbsou01 kernel: EIP is at
 ocfs2_extend_file+0x380/0xf25 [ocfs2]
 Jun 20 13:15:21 tudbsou01 kernel: eax: 0081   ebx:
    ecx: e71c8e6c   edx: f93f726f
 Jun 20 13:15:21 tudbsou01 kernel: esi: f3b0d624   edi:
 e71c8f18   ebp: f3a6b000   esp: e71c8ea4
 Jun 20 13:15:21 tudbsou01 kernel: ds: 007b   es: 007b 
  ss: 0068
 Jun 20 13:15:21 tudbsou01 kernel: Process racgmain
 (pid: 1659, threadinfo=e71c8000 task=f6eea930)
 Jun 20 13:15:21 tudbsou01 kernel: Stack: f6ffef40
    f3b0d624 f600ad80 e71c8f58
 
 Jun 20 13:15:21 tudbsou01 kernel:
 f3921458 f7e53a00 e71c8f68  e71c8f68 
 e71c8f68
 Jun 20 13:15:21 tudbsou01 kernel:
 f3b0d624 f93dc213 0159  e71c8f18 00e1
 
 Jun 20 13:15:21 tudbsou01 kernel: Call Trace:
 Jun 20 13:15:21 tudbsou01 kernel:  [f93dc213]
 ocfs2_write_lock_maybe_extend+0x731/0xad5 [ocfs2]
 Jun 20 13:15:21 tudbsou01 kernel:  [f93cc0d0]
 ocfs2_file_write+0x11f/0x254 [ocfs2]
 Jun 20 13:15:21 tudbsou01 kernel:  [c015a5e8]
 vfs_write+0xb6/0xe2
 Jun 20 13:15:21 tudbsou01 kernel:  [c015a6b2]
 sys_write+0x3c/0x62
 Jun 20 13:15:21 tudbsou01 kernel:  [c02d2657]
 syscall_call+0x7/0xb
 Jun 20 13:15:21 tudbsou01 kernel:  [c02d007b]
 schedule+0x32f/0x8d3
 Jun 20 13:15:21 tudbsou01 kernel: Code: b1 e0 fd ff ff
 ff b1 dc fd ff ff 68 13 03 00 00 68 b5 18 3f f9 ff 70
 10 ff b2 94 00 00 00 68 6f 72 3f f9 e8 bc 45 d5 c6 83
 c4 3c 0f 0b 13 03 d3 6f 3f f9 8b 5c 24 10 8b 83 54
 01 00 00 0f ae e8
 Jun 20 13:15:21 tudbsou01 kernel:  0Fatal exception:
 panic in 5 seconds
 Jun 20 13:25:25 tudbsou01 syslogd 1.4.1: restart.


   
 
 The LOST Ninja blog: Exclusive clues, clips and gossip. 
 http://au.blogs.yahoo.com/lostninja 


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Error while Mounting

2006-06-26 Thread Sunil Mushran

Is it always the mount using node slot 1 that fails? If so, the jbd 
superblock
may be corrupted for that slot.

Grow the journal by, say, 1MB. It will reinitialize the JBD superblock 
for all
the slots. Either that or just reformat the device.

To see the size of the existing journal, do:
# echo ls -l // | debugfs.ocfs2 -n /dev/sda1 | grep journal
36  -rw-r--r--   1 0 067108864 
21-Jun-2006 11:58 journal:
37  -rw-r--r--   1 0 067108864 
21-Jun-2006 11:58 journal:0001
38  -rw-r--r--   1 0 067108864 
21-Jun-2006 11:58 journal:0002
39  -rw-r--r--   1 0 067108864 
21-Jun-2006 11:58 journal:0003

The grow the journal, do:
# tunefs.ocfs2 -Jsize=65M /dev/sdX

Zachary Williams wrote:
 I am attempting to setup a 2 node ocfs2 cluster.  At this point, I 
 have the latest 1.2.1 version of the tools on both nodes.  They are 
 not running identical kernels (one is 2.6.16.18 http://2.6.16.18, 
 the other is 2.6.17.1 http://2.6.17.1) both are using the kernels 
 built in OCFS2 modules, not using from source.

 I can mount my iscsi volume on either node individually, but when I 
 attempt to mount two nodes, I get the following error.  (To confirm, I 
 have 2 nodes setup in the config file, and the filesystem set to a 
 maximum of 4 nodes)

 The error is JDB: no valid journal superblock found

 I have searched high and low for this, but wasn't able to come up with 
 anything as to why I get this.  This error will occur on either node.

 (3509,0):o2net_set_nn_state:415 accepted connection from node bsp (num 
 1) at 10.1.1.11: http://10.1.1.11:
 (3575,0):ocfs2_initialize_super:1326 max_slots for this device: 4
 (3575,0):ocfs2_fill_local_node_info:1019 I am node 0
 (3575,0):__dlm_print_nodes:377 Nodes in my domain 
 (E09A0D90C8454749B81E9754438611B8):
 (3575,0):__dlm_print_nodes:381  node 0
 (3575,0):__dlm_print_nodes:381  node 1
 (3575,0):ocfs2_find_slot:267 taking node slot 1
 JBD: no valid journal superblock found
 (3575,0):ocfs2_journal_wipe:814 ERROR: status = -22
 (3575,0):ocfs2_check_volume:1581 ERROR: status = -22
 (3575,0):ocfs2_mount_volume:1087 ERROR: status = -22
 ocfs2: Unmounting device (8,16) on (node 0)
 (3577,0):o2net_set_nn_state:400 no longer connected to node bsp (num 
 1) at 10.1.1.11: http://10.1.1.11:
 

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] out of memory?

2006-06-29 Thread Sunil Mushran

I would like the entire /proc/meminfo and /proc/slabinfo.
Dump it to a file every 1 min or so.

What version of the kernel/ocfs2?

Paul Jimenez wrote:
 On Jun 29, 2006, at 8:22 AM, Brian Long wrote:

   
 On Wed, 2006-06-28 at 17:03 -0500, Paul Jimenez wrote:
 
 I'm getting out of memory errors trying to do 'rsync -av /foo /bar'
 where /foo is a local dir and /bar is an ocfs2 filesystem running on
 an ~ 6T ATA-over-Ethernet box.
   
 Paul,

 Can you also include some information about your /foo partition?   
 It is
 millions of little files or hundreds of large files?  What is the  
 RSS of
 rsync when you run out of memory?

 http://samba.anu.edu.au/rsync/FAQ.html#5
 http://lists.samba.org/archive/rsync/2002-July/003160.html

 

 /foo is ~ 4600 files each about 60GB for a total of ~259GB.

 Some output after or slightly-before it crashed:


 Every 2s: cat /proc/slabinfo | sort -rnk 2 |  
 head   Thu Jun 29 11:58:01 2006

 buffer_head   754620 754632 52   721 : tunables  120
 608 : slabdata  10481  10481
   0
 bio   225600 225600128   301 : tunables  120
 608 : slabdata   7520   7520
   0
 biovec-1  225593 225736 16  2031 : tunables  120
 608 : slabdata   1112   1112
   0
 journal_head  175548 182448 52   721 : tunables  120
 608 : slabdata   2530   2534
   0
 aoe_bufs  112536 112554 48   781 : tunables  120
 608 : slabdata   1443   1443
   0
 radix_tree_node41510  41510276   141 : tunables   54
 278 : slabdata   2965   2965
   0
 sysfs_dir_cache 3644   3772 40   921 : tunables  120
 608 : slabdata 41 41
   0
 size-32 2938   4407 32  1131 : tunables  120
 608 : slabdata 39 39
   0
 size-64 2354   2596 64   591 : tunables  120
 608 : slabdata 44 44
   0
 dentry_cache2086   3090128   301 : tunables  120
 608 : slabdata103103
   0


 Free swap: 16779608kB
 4718592 pages of RAM
 4489216 pages of HIGHMEM
 562809 reserved pages
 530215 pages shared
 0 pages swap cached
 136994 pages dirty
 61878 pages writeback
 142502 pages mapped
 29403 pages slab
 480 pages pagetables

 4718592 pages of RAM
 4489216 pages of HIGHMEM
 562809 reserved pages
 530215 pages shared
 0 pages swap cached
 136994 pages dirty
 61876 pages writeback
 142502 pages mapped
 29425 pages slab
 480 pages pagetables

 I don't think it's rsync running things oom; its memory consumption  
 is filecount based and 4600 files just isn't that many.

 The tunables that I had in place from the AoE faq (http:// 
 www.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html#toc5.18)
 this time were:

 vm.overcommit_memory=2
 vm.dirty_ratio=3
 vm.dirty_background_ratio=3
 vm.min_free_kbytes=5120

 Any help appreciated.

--pj

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] out of memory?

2006-06-29 Thread Sunil Mushran

HighFree: 11877028 kB
LowFree:391020 kB
HighFree: 11761892 kB
LowFree:342380 kB
HighFree: 11654316 kB
LowFree:315860 kB
HighFree: 11578756 kB
LowFree:291928 kB
HighFree: 11490936 kB
LowFree:264788 kB

That's at the end. I fail to see the enomem. Plenty of lowfree and highfree.
Some of the slabs do have high counts, but this is a big box.

What is crashing? Is the server oopsing? oom-kill?
Or, is the user-space process erroring out?

Paul Jimenez wrote:
 I have that complete file - from before rsync to the crash (~ 4MB) at  
 http://www.rgmadvisors.com/~pj/memslabinfo.

 Kernel is 2.6.16.7 vanilla, and the version of ocfs2 it came with.

--pj


 On Jun 29, 2006, at 2:10 PM, Sunil Mushran wrote:

   
 I would like the entire /proc/meminfo and /proc/slabinfo.
 Dump it to a file every 1 min or so.

 What version of the kernel/ocfs2?

 Paul Jimenez wrote:
 
 On Jun 29, 2006, at 8:22 AM, Brian Long wrote:


   
 On Wed, 2006-06-28 at 17:03 -0500, Paul Jimenez wrote:

 
 I'm getting out of memory errors trying to do 'rsync -av /foo /bar'
 where /foo is a local dir and /bar is an ocfs2 filesystem  
 running on
 an ~ 6T ATA-over-Ethernet box.

   
 Paul,

 Can you also include some information about your /foo  
 partition?   It is
 millions of little files or hundreds of large files?  What is  
 the  RSS of
 rsync when you run out of memory?

 http://samba.anu.edu.au/rsync/FAQ.html#5
 http://lists.samba.org/archive/rsync/2002-July/003160.html


 
 /foo is ~ 4600 files each about 60GB for a total of ~259GB.

 Some output after or slightly-before it crashed:


 Every 2s: cat /proc/slabinfo | sort -rnk 2 |   
 head   Thu Jun 29 11:58:01 2006

 buffer_head   754620 754632 52   721 : tunables   
 120608 : slabdata  10481  10481
   0
 bio   225600 225600128   301 : tunables   
 120608 : slabdata   7520   7520
   0
 biovec-1  225593 225736 16  2031 : tunables   
 120608 : slabdata   1112   1112
   0
 journal_head  175548 182448 52   721 : tunables   
 120608 : slabdata   2530   2534
   0
 aoe_bufs  112536 112554 48   781 : tunables   
 120608 : slabdata   1443   1443
   0
 radix_tree_node41510  41510276   141 : tunables
 54278 : slabdata   2965   2965
   0
 sysfs_dir_cache 3644   3772 40   921 : tunables   
 120608 : slabdata 41 41
   0
 size-32 2938   4407 32  1131 : tunables   
 120608 : slabdata 39 39
   0
 size-64 2354   2596 64   591 : tunables   
 120608 : slabdata 44 44
   0
 dentry_cache2086   3090128   301 : tunables   
 120608 : slabdata103103
   0


 Free swap: 16779608kB
 4718592 pages of RAM
 4489216 pages of HIGHMEM
 562809 reserved pages
 530215 pages shared
 0 pages swap cached
 136994 pages dirty
 61878 pages writeback
 142502 pages mapped
 29403 pages slab
 480 pages pagetables

 4718592 pages of RAM
 4489216 pages of HIGHMEM
 562809 reserved pages
 530215 pages shared
 0 pages swap cached
 136994 pages dirty
 61876 pages writeback
 142502 pages mapped
 29425 pages slab
 480 pages pagetables

 I don't think it's rsync running things oom; its memory  
 consumption  is filecount based and 4600 files just isn't that many.

 The tunables that I had in place from the AoE faq (http://  
 www.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html#toc5.18)
 this time were:

 vm.overcommit_memory=2
 vm.dirty_ratio=3
 vm.dirty_background_ratio=3
 vm.min_free_kbytes=5120

 Any help appreciated.

--pj

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

   


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] out of memory?

2006-07-05 Thread Sunil Mushran

Strange. The meminfo/slabinfo data does not match this.

The deal is if none of the components are leaking memory,
not much one can do other than limiting the lowmem consumption.
So, yes, try HIGHPTE. If 4G/4G was in mainline, I would have
suggested that too.

Else, maybe just limit the box to 8G (from 16G). Or, just upgrade
to a 64-bit box. :)

Paul Jimenez wrote:


 [4296647.18] oom-killer: gfp_mask=0xd0, order=0
 [4296647.181000]  [c014148b] out_of_memory+0xb4/0xd1
 [4296647.181000]  [c0142627] __alloc_pages+0x267/0x2fa
 [4296647.181000]  [c01426e4] __get_free_pages+0x2a/0x4e
 [4296647.181000]  [c016fcb7] __pollwait+0x86/0xc7
 [4296647.181000]  [c03de7d4] datagram_poll+0x2b/0xcf
 [4296647.181000]  [c04173f1] udp_poll+0x23/0xf7
 [4296647.181000]  [c03d7867] sock_poll+0x23/0x2b
 [4296647.181000]  [c0170075] do_select+0x29b/0x2f5
 [4296647.181000]  [c016fc31] __pollwait+0x0/0xc7
 [4296647.183000]  [c01702e1] core_sys_select+0x1ed/0x316
 [4296647.183000]  [c01704c7] sys_select+0xbd/0x18d
 [4296647.183000]  [c010221b] sys_sigreturn+0xcf/0xde
 [4296647.183000]  [c0102ccd] syscall_call+0x7/0xb
 [4296647.183000] Mem-info:
 [4296647.183000] DMA per-cpu:
 [4296647.183000] cpu 0 hot: high 0, batch 1 used:0[4296647.183000] cpu 
 0 cold: high 0, batch 1 used:0
 [4296647.184000] cpu 1 hot: high 0, batch 1 used:0[4296647.184000] cpu 
 1 cold: high 0, batch 1 used:0
 [4296647.184000] cpu 2 hot: high 0, batch 1 used:0
 [4296647.184000] cpu 2 cold: high 0, batch 1 used:0[4296647.184000] 
 cpu 3 hot: high 0, batch 1 used:0
 [4296647.184000] cpu 3 cold: high 0, batch 1 used:0
 [4296647.184000] DMA32 per-cpu: empty[4296647.184000] Normal per-cpu:
 [4296647.184000] cpu 0 hot: high 186, batch 31 used:96[4296647.184000] 
 cpu 0 cold: high 62, batch 15 used:54[4296647.184000] cpu 1 hot: high 
 186, batch 31 used:31
 [4296647.184000] cpu 1 cold: high 62, batch 15 used:52
 [4296647.184000] cpu 2 hot: high 186, batch 31 used:155
 [4296647.184000] cpu 2 cold: high 62, batch 15 used:47
 [4296647.184000] cpu 3 hot: high 186, batch 31 used:32
 [4296647.184000] cpu 3 cold: high 62, batch 15 used:7
 [4296647.184000] HighMem per-cpu:
 [4296647.184000] cpu 0 hot: high 186, batch 31 used:145
 [4296647.185000] cpu 0 cold: high 62, batch 15 used:12
 [4296647.185000] cpu 1 hot: high 186, batch 31 used:14
 [4296647.185000] cpu 1 cold: high 62, batch 15 used:1
 [4296647.185000] cpu 2 hot: high 186, batch 31 used:185
 [4296647.185000] cpu 2 cold: high 62, batch 15 used:5
 [4296647.185000] cpu 3 hot: high 186, batch 31 used:14
 [4296647.185000] cpu 3 cold: high 62, batch 15 used:4
 [4296647.185000] Free pages:14219236kB (14211892kB HighMem)
 [4296647.185000] Active:2840 inactive:406695 dirty:78930 
 writeback:147046 unstable:0 free:3554809 slab:26149 mapped:2601 
 pagetables:102
 [4296647.185000] DMA free:3588kB min:88kB low:108kB high:132kB 
 active:0kB inactive:0kB present:16384kB pages_scanned:6 
 all_unreclaimable? no
 [4296647.185000] lowmem_reserve[]: 0 0 880 18416
 [4296647.185000] DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB 
 inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
 [4296647.185000] lowmem_reserve[]: 0 0 880 18416
 [4296647.185000] Normal free:3756kB min:5028kB low:6284kB high:7540kB 
 active:604kB inactive:324kB present:901120kB pages_scanned:414 
 all_unreclaimable? no
 [4296647.186000] lowmem_reserve[]: 0 0 0 140288[4296647.186000] 
 HighMem free:14211892kB min:512kB low:6836kB high:13164kB 
 active:10756kB inactive:1626456kB present:17956864kB pages_scanned:0 
 all_unreclaimable? no
 [4296647.186000] lowmem_reserve[]: 0 0 0 0
 [4296647.186000] DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 
 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB
 [4296647.186000] DMA32: empty
 [4296647.186000] Normal: 1*4kB 1*8kB 0*16kB 1*32kB 0*64kB 1*128kB 
 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3756kB
 [4296647.186000] HighMem: 2015*4kB 3457*8kB 3245*16kB 3099*32kB 
 5194*64kB 5422*128kB 2960*256kB 1088*512kB 474*1024kB 116*2048kB 
 2676*4096kB = 14211892kB
 [4296647.186000] Swap cache: add 0, delete 0, find 0/0, race 0+0
 [4296647.186000] Free swap  = 16779884kB
 [4296647.186000] Total swap = 16779884kB
 [4296647.187000] Free swap:   16779884kB
 [4296647.288000] 4718592 pages of RAM
 [4296647.288000] 4489216 pages of HIGHMEM
 [4296647.289000] 562809 reserved pages[4296647.289000] 347365 pages 
 shared
 [4296647.289000] 0 pages swap cached[4296647.289000] 78668 pages dirty
 [4296647.289000] 147126 pages writeback
 [4296647.289000] 2601 pages mapped[4296647.289000] 26149 pages slab
 [4296647.289000] 102 pages pagetables
 [4296647.289000] Out of Memory: Kill process 1304 (portmap) score 422 
 and children.[4296647.289000] Out of memory: Killed process 1304 
 (portmap).


 suggestions?  So I'm running out of lowmem?  will turning on HIGHPTE 
 be enough to fix this?

 --pj

 On Jun 29, 2006, at 5:02 PM, Sunil Mushran wrote:

 HighFree: 11877028 kB
 LowFree:391020 kB
 HighFree: 11761892 kB

Re: [Ocfs2-users] What is wrong

2006-07-06 Thread Sunil Mushran

Before you can mount, you have to ensure all the nodes
in the cluster access the same device.

#echo stats | debugfs.ocfs2 -n /dev/sdX | grep UUID

should return the same uuid from all nodes.

Once all nodes can see the same device, the you can mount
it on all nodes. There are no passive node(s). The dlm ensures
only one node updates a particular metadata block at a time.

boka wrote:
 Hello,

 i have configuration made with
 slackware 10.2
 2.6.17.2
 ocfs2tools 1.2.1

 on two dell poweredge 852 machines with eonstore array with two scsi
 controllers. Array is divided in two logical volumes. First logical drive
 is connected to first node as sdb1, etc. I will use linuxha software for
 standby cluster.

 Cluster software, ocfs2tools, tells that it is working.

 First question
 is ocfs2 partition should be mounted on all nodes
 if yes what determine the active node

 Second question
 i have mounted ocfs2 partition on node1 and node2 can not see that it is
 mounted.

 node1:~# echo slotmap | debugfs.ocfs2 -n /dev/sdb1
 Slot#   Node#
 0   1

 node2:~# echo slotmap | debugfs.ocfs2 -n /dev/sdb1
 Slot#   Node#

 Any idea

 Third question
 I can not see traffic on interconnect devices.

 ps. sorry for my poor english

   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Resizing OCFS2 Filesystems

2006-07-06 Thread Sunil Mushran

ocfs2-tools 1.2.2 will have the offline-extend feature.
Still in testing.

Karen Penman wrote:
 Hi All,

 Can anyone tell me if OCFS2 filesystems can be dynamically extended?  If not, 
 is this something that is likely to be available in the future?

 Thanks,

 Karen


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to mount node2 mount.ocfs2: Transport endpoint is not connected while mounting /dev/sdb1 on /u02/oradata/orcl

2006-07-06 Thread Sunil Mushran

Check dmesg on both nodes.

The error indicates that the connect failed. Ensure the ip addresses
of all nodes in /etc/ocfs2/cluster.conf are correct. Also, that
the conf file is the same on all nodes.

Try pinging the other node on the configured interface:
# ping -I ethX node1

Akin Seigmund Walter-Johnson III wrote:
 I currenlty have the setup below, both nodes can see the shared drive ( 
 confirmed with fdisk -l )
 However I am unable to mount the shared device from  node (2) after I 
 mounted from node(1)
 I get the follwoign error
 mount.ocfs2: Transport endpoint is not connected while mounting 
 /dev/sdb1 on /u02/oradata/orcl

 OS Red Hat uname -r
 -
 2.6.9-22.ELsmp

 OCFS version
 
 OCFS2 1.2.1 Fri Apr 21 12:21:12 PDT 2006 (build 
 bd2f25ba0af9677db3572e3ccd92f739)

 /sbin/lsmod |grep ocfs
 ocfs2 350660  1
 debugfs14216  2 ocfs2
 ocfs2_dlmfs27272  1
 ocfs2_dlm 183816  2 ocfs2,ocfs2_dlmfs
 ocfs2_nodemanager 154464  7 ocfs2,ocfs2_dlmfs,ocfs2_dlm
 configfs   28044  2 ocfs2_nodemanager
 jbd59481  2 ocfs2,ext3


 rpm  -qa | grep ocfs
 ocfs2console-1.2.1-1
 ocfs2-tools-1.2.1-1
 ocfs2-2.6.9-22.ELsmp-1.2.1-1


   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] [Ocfs2-announce] OCFS2 1.2.2 released

2006-07-18 Thread Sunil Mushran

ocfs2-tools 1.2.2 :)

Brian Long wrote:
 On Fri, 2006-06-30 at 16:10 -0700, Sunil Mushran wrote:
   
 All,

 We are pleased to announce the release of OCFS2 1.2.2.

 This release includes some recent fixes, including bugzilla#723 
 http://oss.oracle.com/bugzilla/show_bug.cgi?id=723.
 (Users running 1.2.1-3 are encouraged to upgrade to 1.2.2.)

 With this release, OCFS2 now detects nodes having different heartbeat
 timeout values (O2CB_HEARTBEAT_THRESHOLD). Check dmesg after mount(s)
 to look for errors suggesting the same. Multipath users are encouraged
 to refer to the FAQ for more on this parameter.

 Also, new with this release are the largesmp packages for the x86-64,
 IA64 and PPC64 architectures.
 

 In an earlier thread, you mentioned 1.2.2 was going to support offline
 resize / extend.  I do not see this mentioned in the FAQ or the Users
 Guide.  Is there any documentation on this new feature?

 Thanks.

 /Brian/
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and Snapshots

2006-07-20 Thread Sunil Mushran

OCFS2 relies on the uniqueness of the uuid for it to distinguish between
different volumes. One cannot mount two volumes having the same uuid
on the same node. Infact, one should not do that across the cluster too, 
i.e.,
mount two different physical volumes having the same identical uuid.

If you have to mirror and mount, mount it on another node in a different 
cluster.
It could be a 1 node cluster too.

Andre Brinkmann wrote:
 Hello,

 I am trying to couple OCFS2 with a storage virtualization environment to 
 use features like mirroring and snapshots. Unfortunately it seems to be 
 impossible for ocfs2console (and for mount.ocfs2) to distinguish between 
 the original volume and its snapshot and ocffs2 stops the mount-process 
 with the following messages:

 Jul 20 17:23:54 sinalco kernel: (5028,1):ocfs2_initialize_super:1395 
 max_slots for this device: 4
 Jul 20 17:23:54 sinalco kernel: (5028,0):ocfs2_fill_super:642 ERROR: 
 Unable to create per-mount debugfs root.

 Is it possible too change the uuid and other relevant parameters?

 Best Regards

 André

   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and Snapshots

2006-07-21 Thread Sunil Mushran

Cool.

Andre Brinkmann wrote:
 I hope this patch is in a better diff -u -p-format :-)

 Patch for the Makefile
 ===

 --- tunefs.ocfs2/Makefile   2006-04-21 23:40:29.0 +0200
 +++ tunefs.ocfs2_new/Makefile   2006-07-21 14:29:48.0 +0200
 @@ -36,6 +36,6 @@ OBJS = $(subst .c,.o,$(CFILES))
 DIST_FILES = $(CFILES) tunefs.ocfs2.8.in

 tunefs.ocfs2: $(OBJS) $(LIBOCFS2_DEPS) $(LIBO2DLM_DEPS) $(LIBO2CB_DEPS)
 -   $(LINK) $(LIBOCFS2_LIBS) $(LIBO2DLM_LIBS) $(LIBO2CB_LIBS) 
 $(COM_ERR_LIBS)
 +   $(LINK) $(LIBOCFS2_LIBS) $(UUID_LIBS) $(LIBO2DLM_LIBS) 
 $(LIBO2CB_LIBS) $(COM_ERR_LIBS)

 include $(TOPDIR)/Postamble.make


 Patch for tunefs-c
 

 --- tunefs.ocfs2/tunefs.c   2006-04-21 23:40:29.0 +0200
 +++ tunefs.ocfs2_new/tunefs.c   2006-07-21 14:25:19.0 +0200
 @@ -44,6 +44,7 @@
 #include inttypes.h
 #include ctype.h
 #include signal.h
 +#include uuid/uuid.h

 #include ocfs2.h
 #include ocfs2_fs.h
 @@ -70,6 +71,7 @@ typedef struct _ocfs2_tune_opts {
char *progname;
char *device;
int verbose;
 +int uuid;
int quiet;
int prompt;
time_t tune_time;
 @@ -84,7 +86,7 @@ static void usage(const char *progname)
 {
fprintf(stderr, usage: %s [-N number-of-node-slots] 
[-L volume-label]\n
 -   \t[-J journal-options] [-S volume-size] [-qvV] 
 +   \t[-J journal-options] [-S volume-size] [-qvuV] 
device\n,
progname);
exit(0);
 @@ -242,6 +244,7 @@ static void get_options(int argc, char *
{ quiet, 0, 0, 'q' },
{ version, 0, 0, 'V' },
{ journal-options, 0, 0, 'J'},
 +{ uuid-reset, 0, 0, 'u'},
{ volume-size, 0, 0, 'S'},
{ 0, 0, 0, 0}
};
 @@ -254,7 +257,7 @@ static void get_options(int argc, char *
opts.prompt = 1;

while (1) {
 -   c = getopt_long(argc, argv, L:N:J:S:vqVx, long_options,
 +   c = getopt_long(argc, argv, L:N:J:S:vquVx, long_options,
NULL);

if (c == -1)
 @@ -303,6 +306,10 @@ static void get_options(int argc, char *
opts.vol_size = val;
break;

 +case 'u':
 +opts.uuid = 1;
 +break;
 +
case 'v':
opts.verbose = 1;
break;
 @@ -471,6 +478,38 @@ static void update_volume_label(ocfs2_fi
return ;
 }

 +
 +static void update_uuid (ocfs2_filesys *fs, int *changed)
 +{
 +unsigned char *uuid = OCFS2_RAW_SB(fs-fs_super)-s_uuid;
 +   size_t i, max = sizeof(OCFS2_RAW_SB(fs-fs_super)-s_uuid);
 +uuid_t uuid_new;
 +
 +/* print out old uuid of device */
 +printf (Try to change uuid: \n);
 +   for(i = 0; i  max; i++)
 +   printf(%02x , uuid[i]);
 +
 +   printf(\n);
 +
 +/* generate new uuid */
 +uuid_generate(uuid_new);
 +
 +   memset (OCFS2_RAW_SB(fs-fs_super)-s_uuid, 0, 
 OCFS2_VOL_UUID_LEN);
 +   memcpy (OCFS2_RAW_SB(fs-fs_super)-s_uuid, uuid_new, 
 OCFS2_VOL_UUID_LEN);
 +
 +/* print out new uuid */
 +printf (New uuid: \n);
 +   for(i = 0; i  max; i++)
 +   printf(%02x , uuid[i]);
 +
 +printf(\n);
 +
 +*changed = 1;
 +
 +   return ;
 +}
 +
 static errcode_t update_slots(ocfs2_filesys *fs, int *changed)
 {
errcode_t ret = 0;
 @@ -553,6 +592,7 @@ int main(int argc, char **argv)
errcode_t ret = 0;
ocfs2_filesys *fs = NULL;
int upd_label = 0;
 +int upd_uuid = 0;
int upd_slots = 0;
int upd_jrnls = 0;
int upd_vsize = 0;
 @@ -674,6 +714,10 @@ int main(int argc, char **argv)
   vol_size, opts.vol_size);
}

 +/* update unique serial number of device has been selected */
 +if (opts.uuid)
 +printf ( Change unique serial number of device \n );
 +
/* Abort? */
if (opts.prompt) {
printf(Proceed (y/N): );
 @@ -690,6 +734,13 @@ int main(int argc, char **argv)
printf(Changed volume label\n);
}

 +/* update the unique serial number */
 +if (opts.uuid) {
 +update_uuid (fs, upd_uuid);
 +if (upd_uuid)
 +printf (Changed volume uuid \n);
 +}
 +
/* update number of slots */
if (opts.num_slots) {
ret = update_slots(fs, upd_slots);
 @@ -726,7 +777,7 @@ int main(int argc, char **argv)
}

/* write superblock */
 -   if (upd_label || upd_slots || upd_vsize) {
 +   if (upd_label || upd_slots || upd_vsize || upd_uuid) {
block_signals(SIG_BLOCK);
ret = ocfs2_write_super(fs);
if (ret) {



 Sunil Mushran wrote:
 Please could you send it to me again in the diff -u -p format.

 Andre Brinkmann wrote:
 Sorry,

 here the patch as text:

 For the Makefile:

 39c39
  $(LINK) $(LIBOCFS2_LIBS) $(LIBO2DLM_LIBS) $(LIBO2CB_LIBS) 
 $(COM_ERR_LIBS)
 ---
   $(LINK) $(LIBOCFS2_LIBS) $(UUID_LIBS) $(LIBO2DLM_LIBS) 
 $(LIBO2CB_LIBS) $(COM_ERR_LIBS)


 For tunefs.ocfs2.c:

 46a47

Re: [Ocfs2-users] OCFS2: Could not start cluster stack

2006-07-27 Thread Sunil Mushran

Check the support guide on cluster start/stop in the doc section on 
http://oss.oracle.com/projects/ocfs2.

Vicki Luo wrote:
 I installed OCFS2 on RHEL4 with  ocfs2-2.6.9-22.ELsmp-1.2.2-1.i686.rpm. When 
 I start ocfs2console and click on Cluster, and then Configure Nodes, it 
 returns a dialog with the following message Could not start cluster stack. 
 This must be resolved before any OCFS2 filesystem can be mounted 
 Here is some information about my system:
 1. Uname -a
 Linux SDCHS40I030 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005 i686 
 i686 i386 GNU/Linux
  2. rpm -qa|grep ocfs2
 ocfs2-tools-1.2.1-1
 ocfs2console-1.2.1-1
 ocfs2-2.6.9-22.ELsmp-1.2.2-1

 3. rpm -qa | grep kernel
 kernel-smp-2.6.9-5.EL
 kernel-utils-2.4-13.1.48
 kernel-2.6.9-5.EL
 kernel-smp-2.6.9-22.EL

 I saw a few solution posted, for example: depmod -a , then try again. Or 
 SELINUX=disabled
 I did both of them , but I still got the error message.
 Can anybody help me? What else could be wrong?

 Thanks,
 Vicki


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Private Interconnect and self fencing

2006-07-28 Thread Sunil Mushran

Do you have a netdump server configured? If so, it'll have the details
of the hb timeout.

Jeffery P. Humes wrote:
 I have set it to 30 seconds, and the same thing still happens.

 (15,1):o2hb_write_timeout:164 ERROR: Heartbeat write timeout to device 
 etherd/e0.1p1 after 3 milli
 seconds
 panic+0x3e/0x174(15,1):o2hb_stop_all_regions:1789 ERROR: stopping 
 heartbeat on all active regions.
 Kernel panic - not syncing: ocfs2 is very sorry to be fencing this 
 system by panicing

 [c01233de]  [f8cc826a] o2quo_disk_timeout+0x0/0x2 [ocfs2_nodemanager]
 [c01313f8] run_workqueue+0x7f/0xba [f8cc6b15] 
 o2hb_write_timeout+0x0/0x65 [ocfs2_nodemanager]
 [c0131be5] worker_thread+0x0/0x117 [c0131ccb] 
 worker_thread+0xe6/0x117
 [c011daa9] default_wake_function+0x0/0xc [c01344fd] 
 kthread+0x9d/0xc9
 [c0134460] kthread+0x0/0xc9 [c0102005] 
 kernel_thread_helper+0x5/0xb

 -JPH


 Sunil Mushran wrote:
 The 12 sec default is low. Bump it up to 30 secs or even higher. FAQ 
 has the details.
 The higher you set it to, the longer the brown-out time.

 Jeffery P. Humes wrote:
 I have an OCFS2 filesystem on a coraid AOE device.
 It mounts fine, but with heavy I/O the server self fences claiming a 
 write timeout:

 (16,2):o2hb_write_timeout:164 ERROR: Heartbeat write timeout to 
 device etherd/e0.1p1 after 12000 milliseconds
 (16,2):o2hb_stop_all_regions:1789 ERROR: stopping heartbeat on all 
 active regions.
 Kernel panic - not syncing: ocfs2 is very sorry to be fencing this 
 system by panicing

 It is my understanding that OCFS is expecting that the only 
 heartbeat available to be on disk the same disk that I am writing to?

 Is there any way like with other clustering setups to setup a 
 different or even multiple heartbeats?  On a crossover between 
 servers, or on a private interface?
 Seems like putting it only on the disk, that may have heavy IO is 
 going to cause problems.

 Any advice on setting up the heartbeats would be greatly appreciated.

 Thanks,

 -JPH


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2_search_chain: Group Descriptor has bad signature

2006-07-31 Thread Sunil Mushran


What version of ocfs2 is on the nodes? Do modinfo ocfs2 on all nodes.

The version of OCFS2 shipped with SLES9 SP3 varies with kernel.
Are you using the modules shipped by suse or building them yourself?

Vladan Gunjic wrote:

I've got a strange issue with the following configuration:

Using Oracle 10gR2, having EMC CX500 with FC drives and 2 LUNs
configured (one RAID5, one RAID1/0). We have 5 node ocfs2 cluster (4
nodes are SLES9 SP3 64-bit, kernel 2.6.5-7.252-smp, one node is SLES9
SP3 32-bit, 2.6.5-7.257-bigsmp). On all machines latest available OCFS2
is installed (RPMs: ocfs2console-1.2.1-4.2, ocfs2-tools-1.2.1-4.2).
As we have at the moment Oracle 10gR2 on other 32-bit machines, we
wanted to migrate two such machines into Oracle RAC plus using our new
SAN as a storage behind. Therefore I made ocfs2 filesystems on two LUNs
(from 64-bit machines) and
Connect all five machines in OCFS2 cluster). 
- 32 bit machine is mounting both LUNs (and acting as a standby for our

other existing productive Oracles unrelated to 5 machines described
here).
- 2 64-bit machines are mounting one of the LUNs (RAID5) and they are
one of the two Oracle RACs.
- 2 more 64-bit machines are mounting one of the LUNs (RAID1/0) and they
are one of the two Oracle RACs.

As we want to avoid big downtime for the switch, the idea is to use
32-bit standbies, convert them to 64-bit and use them under 64-bit
Oracle RACs. We tested this scenario and it worked well. 
Now we made final layout of the SAN (more disks in LUNs, etc.) and

during the standby building one of the LUNs was suddenly mounted read
only and I got following in dmesg:

OCFS2: ERROR (device emcpowere1): ocfs2_search_chain: Group Descriptor #
0 has bad signature File system is now read-only due to the potential of
on-disk corruption. Please run fsck.ocfs2 once the file system is
unmounted.
(9727,3):ocfs2_claim_suballoc_bits:1157 ERROR: status = -5
(9727,3):ocfs2_claim_clusters:1392 ERROR: status = -5
(9727,3):ocfs2_local_alloc_new_window:852 ERROR: status = -5
(9727,3):ocfs2_local_alloc_slide_window:959 ERROR: status = -5
(9727,3):ocfs2_reserve_local_alloc_bits:515 ERROR: status = -5
(9727,3):ocfs2_reserve_clusters:592 ERROR: status = -5
(9727,3):ocfs2_extend_file:836 ERROR: status = -5
(9727,3):ocfs2_write_lock_maybe_extend:689 ERROR: status = -5
(9727,3):ocfs2_write_lock_maybe_extend:693 ERROR: Failed to extend inode
262690 from 0 to 512

After umounting and fsck I found a lot of errors:

Checking OCFS2 filesystem in /dev/emcpowere1:
  label:  NONE
  uuid:   19 a2 94 f5 91 5d 4c ca be 2f c2 51 21 65 6e 2c
  number of blocks:   175172744
  bytes per block:4096
  number of clusters: 21896593
  bytes per cluster:  32768
  max slots:  4
Pass 0a: Checking cluster allocation chains
[CHAIN_LINK_MAGIC] Chain 85 in allocator at inode 23 contains a
reference at depth 1 to block 84639744 which doesn't have a valid
checksum.  Truncate this chain? y
[CHAIN_BITS] Chain 85 in allocator inode 23 has 64716 bits marked free
out of 96768 total bits but the block groups in the chain have 206 free
out of 32256 total.  Fix this by updating the chain record? y
[CHAIN_LINK_MAGIC] Chain 113 in allocator at inode 23 contains a
reference at depth 2 to block 154570752 which doesn't have a valid
checksum.  Truncate this chain? y
[CHAIN_BITS] Chain 113 in allocator inode 23 has 64509 bits marked free
out of 96768 total bits but the block groups in the chain have 32254
free out of 64512 total.  Fix this by updating the chain record? y
[CHAIN_LINK_MAGIC] Chain 241 in allocator at inode 23 contains a
reference at depth 0 to block 62189568 which doesn't have a valid
checksum.  Truncate this chain? y
[CHAIN_BITS] Chain 241 in allocator inode 23 has 64510 bits marked free
out of 64512 total bits but the block groups in the chain have 0 free
out of 0 total.  Fix this by updating the chain record? y
[CHAIN_GROUP_BITS] Allocator inode 23 has 6215157 bits marked used out
of 21896593 total bits but the chains have 6215152 used out of 21735313
total.  Fix this by updating the inode counts? y
[CHAIN_I_CLUSTERS] Allocator inode 23 has 21735313 clusters represented
in its allocator chains but has an i_clusters value of 21896593. Fix
this by updating i_clusters? y
[CHAIN_I_SIZE] Allocator inode 23 has 21735313 clusters represented in
its allocator chain which accounts for 71736384 total bytes, but its
i_size is 717507559424. Fix this by updating i_size? y
[GROUP_EXPECTED_DESC] Block 62189568 should be a group descriptor for
the bitmap chain allocator but it wasn't found in any chains.
Reinitialize it as a group desc and link it into the bitmap allocator?
y
[GROUP_EXPECTED_DESC] Block 84639744 should be a group descriptor for
the bitmap chain allocator but it wasn't found in any chains.
Reinitialize it as a group desc and link it into the bitmap allocator?
y
[GROUP_EXPECTED_DESC] Block 124895232 should be a group descriptor for
the bitmap chain allocator but it wasn't found in

Re: [Ocfs2-users] Question

2006-08-01 Thread Sunil Mushran


Just create a one node cluster.

However, if you were to mount two mirrored volumes on the same node,
you will have problems as detailed in this thread:
http://oss.oracle.com/pipermail/ocfs2-users/2006-July/000630.html

Thanks to Andre, the next drop of ocfs2-tools will have a fix for this
(ability to change the uuid).

J Angel Villegas wrote:

Hi everybody,

I am new in the ocfs2 technology,

I have a cluster installed, and I have a EMC clones for backup propose,

Now I want to mount the disk cloned by EMC in another machine ( not in the
cluster ) I need to make another cluster for this porpose? What are the best
practices to mount the disks (ocfs2 FileSystems) in another machine? It is
possible?

I think that the disk have the same information of the original disk with
the ocfs2 FS created..

Regards,

 



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] re: question on adding a node to RAC cluster and o2cb

2006-08-01 Thread Sunil Mushran


When you added the new node using ocfs2console, did it show up in:
# ls /config/cluster/clustername/node/

I am assuming that it was added in /etc/ocfs2/cluster.conf.

Yes, the docs does not cover this as of now. I will update the 
FAQ/user's guide

with the info.

Peter Santos wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Folks,
I'm trying to find information about how to dynamically add
a 2nd node to a 1 node RAC cluster.  I'm posting this only after not 
getting the
details from my oracle tar via metalink.

My installation is Suse Enterprise 9 x86_64 (kernel 267).

Installing the single node was not a problem, what is not clear is how 
to prepare
the cluster.conf file and the ocr stuff to add a 2nd or additional 
node. Obviously the
2nd node has to have all the ip configurations in place and ssh has to 
be working, but
at some point, the /etc/ocfs2/cluster.conf file has to be modified and 
propagated and the
ocfs2 mount point has to be mounted on the additional nodes ..this is 
where we had problems.

Here is what we did.
1. setup the 2nd node with all the proper network 
configuration, and ssh equivalence.
2. we added a 2nd node to cluster.conf via ocfs2console and 
propagated that to the new node.
3. We tried to mount the ocfs2 mount point, but could not .. it 
said something like
transpoint end point not found 

4. We then restarted the cluster on node1 and were able to 
mount the ocfs2 mount point and go
   on to add the 2nd node.

We are trying to identify the sequence of actions/procedures to add a 
2nd node at the o2cb/ocfs2 level.

Oracle support didn't have this level of detail, so I'm hoping someone 
knows how to do this without
shutting down the cluster on node1

thanks

- -peter

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEz2gCoyy5QBCjoT0RAu6cAJ9C2oRLQUD437fuRF9DSuI9zZb3VgCePP9Y
mBoOxNLILnKGo5z0qQtvU3o=
=t1Zv
-END PGP SIGNATURE-

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] re: question on adding a node to RAC cluster and o2cb

2006-08-01 Thread Sunil Mushran


The real error was the one you got when you were not able
to add the new node in node1. It is an ocfs2console problem.
That it did not work when you added the node in node2 and
propagated, is explainable.

When you get the third node, do the following:

1. On the existing two nodes, add the new node by hand by
executing this (on both).
# o2cb_ctl -C -i -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR
 -a ip_port= -a cluster=CLUSTERNAME

2. By doing so, you are not only adding the node in /etc/ocfs2/cluster.conf
but also activating it (/config/cluster/CLUSTERNAME/node).

3. Either Propagate or hand copy the cluster.conf to the new node.

4. Start the cluster on the new node and then mount.

Peter Santos wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I don't know what the entries looked  like in 
/config/cluster/clustername/node/
when we tried this.


Now it does show both nodes... but we have since restarted the entire cluster 
in order to get this to work.
We are waiting to get another new machine to try it again.

What I do remember is that initially we started up the ocfs2console from node1 
and
clicked add to add a 2nd node and the tool complained ( I can't remember the 
exact error message now).

Then we tried to run ocfs2console from the new/2nd node and added both node1 
and node2 to the configuration
Then we clicked propagate .. this worked without any error messages, but we 
were not able to mount the ocfs2
filesystem on node2 until we restarted the cluster on node1. (transport 
endpoint errors..)

We will definitely try again on a 3rd node, I'm just not clear on what the 
sequence of events
should be.

thanks
peter



Sunil Mushran wrote:
  

When you added the new node using ocfs2console, did it show up in:
# ls /config/cluster/clustername/node/

I am assuming that it was added in /etc/ocfs2/cluster.conf.

Yes, the docs does not cover this as of now. I will update the
FAQ/user's guide
with the info.

Peter Santos wrote:

Folks,
I'm trying to find information about how to dynamically add
a 2nd node to a 1 node RAC cluster.  I'm posting this only after
not getting the
details from my oracle tar via metalink.

My installation is Suse Enterprise 9 x86_64 (kernel 267).

Installing the single node was not a problem, what is not clear is
how to prepare
the cluster.conf file and the ocr stuff to add a 2nd or additional
node. Obviously the
2nd node has to have all the ip configurations in place and ssh
has to be working, but
at some point, the /etc/ocfs2/cluster.conf file has to be modified
and propagated and the
ocfs2 mount point has to be mounted on the additional nodes ..this
is where we had problems.

Here is what we did.
1. setup the 2nd node with all the proper network
configuration, and ssh equivalence.
2. we added a 2nd node to cluster.conf via ocfs2console and
propagated that to the new node.
3. We tried to mount the ocfs2 mount point, but could not ..
it said something like
transpoint end point not found 

4. We then restarted the cluster on node1 and were able to
mount the ocfs2 mount point and go
   on to add the 2nd node.

We are trying to identify the sequence of actions/procedures to
add a 2nd node at the o2cb/ocfs2 level.

Oracle support didn't have this level of detail, so I'm hoping
someone knows how to do this without
shutting down the cluster on node1

thanks

-peter



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFEz6EBoyy5QBCjoT0RAo89AJ9QoGYnyEcjJtjDTmOgdnPdiJqS+ACgkZEV
p58c7/3nlVoJ2Gk2FnzOTyc=
=KCxu
-END PGP SIGNATURE-
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: AW: [Ocfs2-users] ocfs2_search_chain: Group Descriptor has bad signature

2006-08-01 Thread Sunil Mushran


The ocfs2 version should be the same on all the nodes. Mixing nodes
with 1.1.8 and 1.2.1 will cause problems. We had fixed a lot of issues
in 1.2.1. I'll write more when I reread your prev email.

Vladan Gunjic wrote:

I'm using ocfs2 and all modules from Suse (SLES9), no self compilations.
Here are the details:

* 32-bit machine (writing to ocfs2 partition/LUN and where the corruption was 
reported):
Kernel: 2.6.5-7.257-bigsmp #1 SMP  i686 i386 GNU/Linux
OCFS2 rpms: ocfs2console-1.2.1-4.2
ocfs2-tools-1.2.1-4.2
o2cb_ctl -V:o2cb_ctl version 1.2.1
/etc/init.d/o2cb status:
Module configfs: Loaded
Filesystem configfs: Mounted
Module ocfs2_nodemanager: Loaded
Module ocfs2_dlm: Loaded
Module ocfs2_dlmfs: Loaded
Filesystem ocfs2_dlmfs: Mounted
Checking cluster dbrac: Online
Checking heartbeat: Active
/etc/init.d/ocfs2 status:
Configured OCFS2 mountpoints:  /mnt/emcpowera1 
mnt/emcpowere1
Active OCFS2 mountpoints:  /mnt/emcpowera1 
/mnt/emcpowere1

* 2 identical 64-bit machines (that are supposed to use the data after 32-64 
bit conversion):
Kernel: 2.6.5-7.257-smp #1 SMP x86_64 GNU/Linux
OCFS2 rpms: ocfs2console-1.2.1-4.2
ocfs2-tools-1.2.1-4.2
o2cb_ctl -V:o2cb_ctl version 1.2.1
/etc/init.d/o2cb status:
Module configfs: Loaded
Filesystem configfs: Mounted
Module ocfs2_nodemanager: Loaded
Module ocfs2_dlm: Loaded
Module ocfs2_dlmfs: Loaded
Filesystem ocfs2_dlmfs: Mounted
Checking cluster dbrac: Online
Checking heartbeat: Active
/etc/init.d/ocfs2 status:
Configured OCFS2 mountpoints:  /mnt/emcpowerd1
Active OCFS2 mountpoints:  /mnt/emcpowerd1
(other 2 64-bit machines have other LUN from 32-bit machine mounted)

modinfo on all 5 machines:

1. (32-bit)
license:GPL
author: Oracle
version:1.2.1-SLES AC2C92855997647E2A862F0
description:OCFS2 1.2.1-SLES Thu Apr 20 18:03:18 PDT 2006 (build sles)
depends:ocfs2_nodemanager,ocfs2_dlm,jbd
supported:  yes
vermagic:   2.6.5-7.257-bigsmp SMP PENTIUMII REGPARM gcc-3.3


== next 2 machines are mounting the LUN that was corrupted (will be one 
Oracle RAC):
2. (64-bit)
license:GPL
author: Oracle
version:1.2.1-SLES AC2C92855997647E2A862F0
description:OCFS2 1.2.1-SLES Thu Apr 20 18:03:18 PDT 2006 (build sles)
depends:ocfs2_nodemanager,ocfs2_dlm,jbd
supported:  yes
vermagic:   2.6.5-7.257-smp SMP gcc-3.3

3. (64-bit)
license:GPL
author: Oracle
version:1.2.1-SLES AC2C92855997647E2A862F0
description:OCFS2 1.2.1-SLES Thu Apr 20 18:03:18 PDT 2006 (build sles)
depends:ocfs2_nodemanager,ocfs2_dlm,jbd
supported:  yes
vermagic:   2.6.5-7.257-smp SMP gcc-3.3

== next 2 machines are mounting the LUN that was NOT corrupted (will be 
another Oracle RAC):
4. (64-bit)
license:GPL
author: Oracle
version:1.1.8-SLES E9BF6AA66857FAE88EF441B
description:OCFS2 1.1.8-SLES Tue Dec 13 18:20:37 PST 2005 (build sles)
depends:ocfs2_nodemanager,ocfs2_dlm,jbd
supported:  yes
vermagic:   2.6.5-7.252-smp SMP gcc-3.3

5. (64-bit)
license:GPL
author: Oracle
version:1.1.8-SLES E9BF6AA66857FAE88EF441B
description:OCFS2 1.1.8-SLES Tue Dec 13 18:20:37 PST 2005 (build sles)
depends:ocfs2_nodemanager,ocfs2_dlm,jbd
supported:  yes
vermagic:   2.6.5-7.252-smp SMP gcc-3.3

Additionally I noticed last night, when I was shortly disabling the complete 
network of all of those machines that after restoring the network, the last two 
machines (older ocfs2 version) were confused and didn't rejoin the cluster 
before the system reboot.

So, I guess first step is to update last two on ocfs2 version 1.2.1 ?
Although they were not directly involved in corruption, maybe indirect ?

Thanks,
Vladan


-Ursprüngliche Nachricht-
Von: Sunil Mushran [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 1. August 2006 04:29

An: Vladan Gunjic
Cc: ocfs2-users@oss.oracle.com
Betreff: Re: [Ocfs2-users] ocfs2_search_chain: Group Descriptor has bad 
signature

What version of ocfs2 is on the nodes? Do modinfo ocfs2 on all nodes.

The version of OCFS2 shipped with SLES9 SP3 varies with kernel.
Are you using the modules shipped by suse or building them yourself?

  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] o2net: connect to node has been idle for 10 secs

2006-08-03 Thread Sunil Mushran


1. o2net talks tcp. It should be able to handle this.
2. If the cluster is active and the nodes are communicating,
the keepalive packet is rarely sent. It only sends the packet
if it does not hear from the other node for 5 secs.
3. Try the same with 1.2.3. (We made 2 important 1 line fixes.)
4. If this does happen again, and you are interested, we
could always give you a drop that dumps the stack of
all the procs, to get a better feel for the situation.

Andy Phillips wrote:

Hello,

   Apologies for following up on myself.

in ocfs2/cluster/tcp_internal.h
#define O2NET_KEEPALIVE_DELAY_SECS  5
#define O2NET_IDLE_TIMEOUT_SECS 10


   Is this really sensible? Potentially, given small variance in 
system clocks losing one keepalive packet (assuming that 
o2net_sc_send_keep_req is the only thing keeping the connection alive)

the loss of one packet could cause a node to self fence and reboot.

   Would
#define O2NET_KEEPALIVE_DELAY_SECS  5
#define O2NET_IDLE_TIMEOUT_SECS 20

   Cause any problems?

   Andy



On Thu, 2006-08-03 at 12:41 +0100, Andy Phillips wrote:
  

Hello,

   I've a two node 10gR2 rac cluster on a pair of sun opteron boxes.
Redhat AS 4.3 2.6.9-34.0.1.ELsmp x86_64. ocfs 1.2.2. RAC is using 
ASM to talk to the data files, but we have 3 ocfs2 filesystems up
to share dba files, and the usual bits and bobs. 


   Things were fine until, on mostly idle system, this happened out
of the blue;

Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10: has been idle for 10 seconds, shutting it down.
Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1154545576.798263 now
1154545586.796978 dr 1154545576.798238 adv
1154545576.798291:1154545576.798293 func (06aac8a1:1)
1154545566.800782:1154545566.800787)
Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
(num 0) at 172.16.6.10:
Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
fencing this node because it is connected to
a half-quorum of 1 out of 2 nodes which doesn't include the lowest
active node 0
Aug  2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR:
stopping heartbeat on all active regions.

   And the node then halted. 

   Barney is node 0. The systems were idle. We've hammered the ocfs2 
file systems, and set o2cb_heartbeat_threshold to 61. All is good and

stable under heavy i/o.
   
   The interconnect is a bonded interface, with two gig cards, each

connected (with flow control on) to two separate FESX424 switches.
The switches dont register any problems at this time, nor does linux
register any interface issues.

   I'm looking at the source code at the moment, but nothing is leaping
out at me. Any ideas - Do the timer debug lines above mean anything to
anyone.

  Thanks
   Andy 



   
  




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Re: Problems with OCFS2 and Oracle 10g

2006-08-04 Thread Sunil Mushran


ocfs2 requires a shared disk. As in, all nodes must be able to concurrently
read/write to the device.

sorapak Last wrote:

Yes. my disk is an IDE. Would it cause the problems?

Thanks
Sorapak



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] o2net: connect to node has been idle for 10 secs

2006-08-07 Thread Sunil Mushran


Alexei_Roudnev wrote:

In my case, after spending few days, I find that my HugeTLB setting (in
Oracle) caused long kernel loop and it forced OCFSv2 to reboot because of
losing connection.
  

I am keen to hear more about this. Please could you elaborate.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Installing ocfs2-tools from source?

2006-08-14 Thread Sunil Mushran


Do, make rpm instead.

Change Copyright to License in the spec file and do make rpm.

I built the following for fc5/x86.
http://oss.oracle.com/~smushran/.fc5-rpms/

Eric Adair wrote:

building on fedora core 5, kernel 2.6.16.-1.2133.FC5smp

Everything builds fine, but I can't find a means to make install.

Obviously, I'm noob-ing up the place here. What am I missing? 
( ocfs2-tools-1.2.1 tarball is being used).



-Eric

/
/
/***This message has been dgitally signed by Thawte Certificate 
Authority*/





___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS- EMC Issue

2006-08-14 Thread Sunil Mushran


# cd /tmp
# wget http://oss.oracle.com/~smushran/.debug/stat_sysdir.sh
# ./stat_sysdir -d sdX sys.out

Email me the output.

amit pansare wrote:



I’ve an issue related to Oracle 10g RAC.
I’ve 2 node cluster each being Dell 2850 Server with RHEL 4.0
I’ve EMC CX300 SAN storage with following partitions

/orasoft 10 Gb OCFS2 File system
/oracrs 2 Gb OCFS2 File system
/orabackup 100 Gb OCFS2 File system

The datafiles are on ASM which is not directly visible in OS.
I’ve common Oracle Home installed in /orasoft/db_1 which is shared by 
both nodes in cluster.

I’ve faced an issue recently related to EMC storage.
The /orasoft partition displays 1.4 Gb space available using df command.
when ever I try to create a file on this partition I get an error as 
No Space left on device. I’m unable to start any service with the same 
reason. while i am able to use this partition from another node in 
cluster.


Can anyone help me with this storage issue ??

Regards ,
Amit Pansare
DBA
Net Magic Solutions



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] cfq scheduler?

2006-08-14 Thread Sunil Mushran


U4 has the fix.

We've tested U2  (and U3) + fix internally already. So we don't feel the
need to rerun the test for the same again.

Brian Long wrote:

Has anyone at Oracle tested the RHEL 4.4 beta or GA kernel to verify the
cfq scheduler is fixed wrt. OCFS2?  Or will that testing only begin now
that U4 is GA?

http://oss.oracle.com/bugzilla/show_bug.cgi?id=671

Thanks for any info.

/Brian/
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] re: Process to change cluster.conf IPS ?

2006-08-16 Thread Sunil Mushran


# o2cb_ctl -H -t node -n node_name -a ip_address=NEW_IP_ADD
o2cb_ctl: Node changes not yet supported

The man page is missing -t node but that will still not help you.
Currently o2cb_ctl only allows dynamic adding of new nodes not
updating existing nodes.

So, edit /etc/ocfs2/cluster.conf, and change the ip address.
Copy it to all nodes before restarting the cluster on all nodes.

# cat /config/cluster/clus/node/nodename/ip_address
should show the updated value after the cluster is up.

Peter Santos wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sunil,
The link you pointed me to, just says to stop the cluster, modify the 
cluster.conf and re-start the cluster.
Is that all that needs to be done?
The reason I ask is that I found this url 
http://manpage.willempen.org/8/o2cb_ctl with the man pages for
o2cb_ctl and it says that you can do this o2cb_ctl -H -n node_name -a 
ip_address=NEW_IP_ADDRESS. However,
everytime I tried it, I kept getting the error invalid attribute.

Can you please just confirm the proper way? I just want to be sure that 
editing the cluster.conf file is enough
to update all the proper locations where that IP may exist.


- -peter




Sunil Mushran wrote:
  

http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#CONFIGURE


Peter Santos wrote:

Folks,
I have a simple 2 node 10gR2 RAC cluster. Each node has a
public/private and virtual IP.
We moved the network to a different subnet  and now I need to
figure out how to make the changes
visible to ocfs2 and it's services including making the changes to
cluster.conf.

I suspect that simply changing the IP addresses in cluster.conf is
not enough? My cluster.conf files
have the nodes physical ip address. I'm not even sure if they
should have the private interconnect IP or not,
but that is also a different issue.

Can someone point me to the correct procedure?



TIA
-peter




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFE4yR1oyy5QBCjoT0RAqxSAJ4jQyDkWzHSoTDCuLVxd9Kn8mU+ewCggk2O
1XtMV3qfhvanqGHwvVFuUck=
=jteH
-END PGP SIGNATURE-

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 over DRBDv8

2006-08-17 Thread Sunil Mushran


As far as ocfs2 is concerned, bio_add_page() is failing. The one thing that
springs to mind is that o2hb sets bio-bi_sector to 512 bytes and not
the block size.

Kilian CAVALOTTI wrote:

Hi all,

I'm new to OCFS2, but not so new to DRBD. I'd like to use the new 
primary/primary feature of DRBDv8 to create a shared storage space and 
concurrently access it from multiple clients, using OCFS2.


I configured two hosts with DRBD, allowed two primaries, and successfully 
made each partition primary. 


# cat /proc/drbd
version: 8.0pre4 (api:84/proto:82)
SVN Revision: 2375M build by [EMAIL PROTECTED], 2006-08-17 15:54:17
 0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate r---
ns:0 nr:1398278 dw:1398278 dr:98 al:0 bm:1895 lo:0 pe:0 ua:0 ap:0
resync: used:0/7 hits:86007 misses:1381 starving:0 dirty:0 
changed:1381

act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
 1: cs:Unconfigured

I tried to format the volume with a traditionnal filesystem, and 
successfully mounted it on both nodes.


I then tried with ocfs2. On the first node, mkfs and mount went without a 
hitch, but on the second one, I systematically get an error when I try to 
do anything on the volume (fsck'ing, starting ocfs2-heartbeat, mounting, 
etc.). dmesg shows the following,


drbd0: role( Secondary - Primary )
drbd0: Writing meta data super block now.
(6672,0):o2hb_setup_one_bio:290 ERROR: Error adding page to bio i = 1, 
vec_len = 4096, len = 0

, start = 0
(6672,0):o2hb_read_slots:385 ERROR: status = -5
(6672,0):o2hb_populate_slot_data:1279 ERROR: status = -5
(6672,0):o2hb_region_dev_write:1379 ERROR: status = -5


It seems that the heartbeat process can't write to the device, for an 
unknown reason:


open(/sys/kernel/config/cluster, O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 4
fstat(4, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
fcntl(4, F_SETFD, FD_CLOEXEC)   = 0
getdents64(4, /* 3 entries */, 4096)= 88
getdents64(4, /* 0 entries */, 4096)= 0
close(4)= 0
mkdir(/sys/kernel/config/cluster/ocfs2_cluster/heartbeat/D6F76726AFE4472CBF0650A1FEF09135, 
0755) = 0
open(/sys/kernel/config/cluster/ocfs2_cluster/heartbeat/D6F76726AFE4472CBF0650A1FEF09135/block_bytes, 
O_WRONLY) = 4

write(4, 512, 3)  = 3
close(4)= 0
open(/sys/kernel/config/cluster/ocfs2_cluster/heartbeat/D6F76726AFE4472CBF0650A1FEF09135/start_block, 
O_WRONLY) = 4

write(4, 2176, 4) = 4
close(4)= 0
open(/sys/kernel/config/cluster/ocfs2_cluster/heartbeat/D6F76726AFE4472CBF0650A1FEF09135/blocks, 
O_WRONLY) = 4

write(4, 255, 3)  = 3
close(4)= 0
open(/dev/drbd0, O_RDWR)  = 4
open(/sys/kernel/config/cluster/ocfs2_cluster/heartbeat/D6F76726AFE4472CBF0650A1FEF09135/dev, 
O_WRONLY) = 5

write(5, 4, 1)= -1 EIO (Input/output error)
close(5)= 0
close(4)= 0
rmdir(/sys/kernel/config/cluster/ocfs2_cluster/heartbeat/D6F76726AFE4472CBF0650A1FEF09135) 
= 0

semop(0, 0x7fff930bfe30, 1) = 0
close(3)= 0
write(2, mkfs.ocfs2, 10mkfs.ocfs2)  = 10
write(2, : , 2: )   = 2
write(2, I/O error on channel, 20I/O error on channel)= 20
write(2,  , 1 )= 1
write(2, while initializing the dlm, 26while initializing the dlm) = 26
write(2, \r\n, 2

I can't figure if it's a DRBD- or a OCFS2-related issue, and I'd take any 
enlightenment with gratitude.


BTW, I use amd64, debian-provided 2.6.17 kernel, drbd8-module-source 
8.0pre4-1 (I tried SVN trunk too), and ocfs2-tools 1.2.1-1.


Thanks in advance, 
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Wrong dm device used

2006-08-29 Thread Sunil Mushran


Well, mounted.ocfs2 is dumb... as in, it just scans /proc/partitions.
We have to teach it new tricks. :)

Fabio Corazza wrote:

Hi there,
 I've just setup an EVMS cluster with Heartbeat 2.0.7 and OCFS2.

Everything seems to be working fine except this:

[EMAIL PROTECTED] photos]# mounted.ocfs2 -d
DeviceFS UUID  Label
/dev/dm-6 ocfs2  c1a56afe-3d4b-4b88-919c-b9454b1ec708  cache
/dev/dm-7 ocfs2  c1a56afe-3d4b-4b88-919c-b9454b1ec708  cache
/dev/dm-8 ocfs2  0663bfeb-60ad-400a-8c1a-61156772eebc photos
/dev/dm-14ocfs2  e2533760-1c3f-4f7a-886f-8769e73f1088 photos
/dev/dm-15ocfs2  e2533760-1c3f-4f7a-886f-8769e73f1088 photos

[EMAIL PROTECTED] photos]# mounted.ocfs2 -f
DeviceFS Nodes
/dev/dm-6 ocfs2  mybbook-as01, mybbook-as02
/dev/dm-7 ocfs2  mybbook-as01, mybbook-as02
/dev/dm-8 ocfs2  Unknown: OCFS2 directory corrupted
/dev/dm-14ocfs2  mybbook-as01, mybbook-as02
/dev/dm-15ocfs2  mybbook-as01, mybbook-as02
[EMAIL PROTECTED] photos]#

The same in the other node.


I tried to reboot, to run dmsetup delete_all restart evms... nothing
happens. That dm-8 still hangs there. Everything else is fine... what
could it be? The filesystems seem to work correctly.



Regards,

  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Wrong dm device used

2006-08-30 Thread Sunil Mushran

If you are dealing with creating/removing lots of small files, a large 
journal

will help. Currently there is no way other than trial and error. We'll look
into making this easier but right now there is no other way.

Pick the largest value of all the subcomponents for the hb timeout.

Fabio Corazza wrote:

Sorry for my laziness, I just had a read at the mkfs.ocfs2 man page and
had answered to some questions by myself.

If you can still give me some hints about the block-size and
cluster-size values about the filesystem that I'm going to create, I'd
appreciated it. Also, I'm a little bit curious about the journal size,
how and why it should be tuned.

I'd also have another question... reading the faq it's stated that I
should set the O2CB_HEARTBEAT_THRESHOLD to a value calculated through a
specific formula over the I/O layer timeout. Where can I look to obtain
such value?

I'm using the iSCSI Linux initiator with the parameter
ConnFailTimeout=180, don't know if this has something to do with the I/O
layer timeout. Also, I'm using multipath-tools.


Thanks,
Fabio

Fabio Corazza wrote:
  

OK, so basically the filesystem keeps on relying on EVMS devices even if
ocfs2console or ocfs2tools will be detecting other devices. Please
confirm me that this is correct.

Also, relating to the options I'm given during the creation of an ocfs2
volume, which options do you suggest for a volume that _only_ stores a
LOT of small files (images, maximum size for each will be 3MB) and a lot
of directories. Actually, I will have 2 nodes on r/w and a third node
that will just read (is the backup server).

[-b block-size] [-C cluster-size] [-N number-of-node-slots] [-T
filesystem-type] [-L volume-label] [-J journal-options] [-HFqvV] device
[blocks-count]

Basically: block-size, cluster-size.

Also, what number-of-node-slots mean? The maximum number of nodes the
filesystem can be accessed from? I've seen that this defaults to 4, can
this be expanded after the filesystem creation or has to be prevented on
time?

Also, what about journal-options?


Thanks for your attention, highly appreciated.


Fabio


Sunil Mushran wrote:


Well, mounted.ocfs2 is dumb... as in, it just scans /proc/partitions.
We have to teach it new tricks. :)

Fabio Corazza wrote:
  

Hi there,
 I've just setup an EVMS cluster with Heartbeat 2.0.7 and OCFS2.

Everything seems to be working fine except this:

[EMAIL PROTECTED] photos]# mounted.ocfs2 -d
DeviceFS UUID  Label
/dev/dm-6 ocfs2  c1a56afe-3d4b-4b88-919c-b9454b1ec708  cache
/dev/dm-7 ocfs2  c1a56afe-3d4b-4b88-919c-b9454b1ec708  cache
/dev/dm-8 ocfs2  0663bfeb-60ad-400a-8c1a-61156772eebc photos
/dev/dm-14ocfs2  e2533760-1c3f-4f7a-886f-8769e73f1088 photos
/dev/dm-15ocfs2  e2533760-1c3f-4f7a-886f-8769e73f1088 photos

[EMAIL PROTECTED] photos]# mounted.ocfs2 -f
DeviceFS Nodes
/dev/dm-6 ocfs2  mybbook-as01, mybbook-as02
/dev/dm-7 ocfs2  mybbook-as01, mybbook-as02
/dev/dm-8 ocfs2  Unknown: OCFS2 directory corrupted
/dev/dm-14ocfs2  mybbook-as01, mybbook-as02
/dev/dm-15ocfs2  mybbook-as01, mybbook-as02
[EMAIL PROTECTED] photos]#

The same in the other node.


I tried to reboot, to run dmsetup delete_all restart evms... nothing
happens. That dm-8 still hangs there. Everything else is fine... what
could it be? The filesystems seem to work correctly.



Regards,

  





  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] self fencing and system panicproblem afterforced reboot

2006-09-15 Thread Sunil Mushran


Yes, we are working on it. :)

Alexei_Roudnev wrote:

It's all about the same - need 'single node' mounting mode on OCFSv2, so
that sysadmin be able to mount it with any media errors and without
working cluster.

(Of course, such mount should show many warnings before going thru).


- Original Message - 
From: Holger Brueckner [EMAIL PROTECTED]

To: Sunil Mushran [EMAIL PROTECTED]
Cc: ocfs2-users@oss.oracle.com
Sent: Friday, September 15, 2006 1:20 AM
Subject: Re: [Ocfs2-users] self fencing and system panicproblem afterforced
reboot


  

i guess i found the solution. while dumping some files with debugfs, it
suddenly stopped working and could not be killed. and guess what, media
error on the drive :-/. funny that a filesystem check succeeds.

anyway thx a lot to those who responded.

holger

On Thu, 2006-09-14 at 11:03 -0700, Sunil Mushran wrote:


Not sure why a power outage should cause this.

Do you have the full stack of the oops? It will show the times taken
in the last 24 operations in the hb thread. That should tell us as to
what is up.

Holger Brueckner wrote:
  

i just discovered the ls, cd, dump and rdump commands in


debugfs.ocfs2.
  

they work fine :-). neverless i would really like to know why mounting
and accessing the volume is not possible anymore.

but thanks for the hint pieter

holger brueckner

On Thu, 2006-09-14 at 14:30 +0200, Pieter Viljoen - MWEB wrote:



Hi Holger

Maybe you should try the fscat tools
(http://oss.oracle.com/projects/fscat/) - which has a fsls (to list)
  

and
  

fscp (to copy) directly from the device.

I have not tried it yet, so good luck!


Pieter Viljoen


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Holger
Brueckner
Sent: Thursday, September 14, 2006 14:17
To: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] self fencing and system panic problem
afterforced reboot

side note: setting HEARBEAT_THRESHOLD to 30 did not help either.

could it be that the syncronization between the daemons does not
  

work?
  

(e.g daemons think fs is mounted on some nodes and try to synchonize
  

but
  

actually the fs isn't mounted on any node?)

i'm rather clueless now. finding a way to access the data and copy it
  

to
  

the non shared partitions would help me a lot.

thx

holger brueckner


On Thu, 2006-09-14 at 13:47 +0200, Holger Brueckner wrote:

  

X-CS-3-Report: plain


hello,

i'm running ocfs2 to provide a shared disk thoughout a xen cluster.
this setup was working fine until today where there was an power



outage

  

and all xen nodes where forcefully shut down. whenever i try to
mount/access the ocfs2 partition the system panics and reboots:

darks:~# fsck.ocfs2 -y -f /dev/sda4
(617,0):__dlm_print_nodes:377 Nodes in my domain
(5BA3969FC2714FFEAD66033486242B58):
(617,0):__dlm_print_nodes:381  node 0
Checking OCFS2 filesystem in /dev/sda4:
  label:  NONE
  uuid:   5b a3 96 9f c2 71 4f fe ad 66 03 34 86 24 2b


58
  

  number of blocks:   35983584
  bytes per block:4096
  number of clusters: 4497948
  bytes per cluster:  32768
  max slots:  4

/dev/sda4 was run with -f, check forced.
Pass 0a: Checking cluster allocation chains
Pass 0b: Checking inode allocation chains
Pass 0c: Checking extent block allocation chains
Pass 1: Checking inodes and blocks.
[CLUSTER_ALLOC_BIT] Cluster 295771 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? y
[CLUSTER_ALLOC_BIT] Cluster 2456870 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? y
[CLUSTER_ALLOC_BIT] Cluster 2683096 is marked in the global cluster
bitmap but it isn't in use.  Clear its bit in the bitmap? y
Pass 2: Checking directory entries.
Pass 3: Checking directory connectivity.
Pass 4a: checking for orphaned inodes
Pass 4b: Checking inodes link counts.
All passes succeeded.
darks:~# mount /data
(622,0):ocfs2_initialize_super:1326 max_slots for this device: 4
(622,0):ocfs2_fill_local_node_info:1019 I am node 0
(622,0):__dlm_print_nodes:377 Nodes in my domain
(5BA3969FC2714FFEAD66033486242B58):
(622,0):__dlm_print_nodes:381  node 0
(622,0):ocfs2_find_slot:261 slot 2 is already allocated to this


node!
  

(622,0):ocfs2_find_slot:267 taking node slot 2
(622,0):ocfs2_check_volume:1586 File system was not unmounted


cleanly,
  

recovering volume.
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,4) on (node 0, slot 2) with ordered data



mode.

  

(630,0):ocfs2_replay_journal:1181 Recovering node 2 from slot 0 on
device (8,4)
darks:~# (4,0):o2hb_write_timeout:164 ERROR: Heartbeat write timeout



to

  

device sda4 after 12000 milliseconds
(4,0):o2hb_stop_all_regions:1789 ERROR: stopping heartbeat on all



active

  

regions

Re: [Ocfs2-users] ocfs2 - disk usage inconsistencies

2006-09-20 Thread Sunil Mushran


Another node or that node itself.

As far as the filesize goes, ls -l does not give the ondisk size.
Do stat inodenum on the unlinked files and see the Clusters.

Matthew Flusche wrote:

There has been a lot of file system activity recently.

I have files in orphan_dir: and orphan_dir:0002.  But that doesn't
seem to account for the 17 GB missing.  The truncate logs seem clean.
So having files in orphan_dir: is telling me that the node in slot 0
deleted files and another node(s) still has the file open, correct?

Matt

debugfs: ls -l //orphan_dir:
16  drwxr-xr-x  13 0 0  774144
10-Sep-2006 00:08 .
10  drwxr-xr-x   6 0 04096
2-May-2006 16:11 ..
3052182 drwxrwxrwx   0   501   5004096
19-Jul-2006 17:50 002e9296
8234094 drwxrwxrwx   0   501   5004096
19-Jul-2006 17:50 007da46e
13063783drwxrwxrwx   0   501   5004096
19-Jul-2006 17:50 00c75667
7869995 drwxrwxrwx   0   501   5004096
22-Aug-2006 13:27 0078162b
3741473 drwxrwxrwx   0   501   5004096
22-Aug-2006 13:29 00391721
3351057 drwxrwxrwx   0   501   5004096
19-Jul-2006 17:50 00332211
7842503 drwxrwxrwx   0   501   5004096
19-Jul-2006 17:50 0077aac7
2056493 drwxrwxrwx   0   501   5004096
5-Sep-2006 08:53 001f612d
7861894 drwxrwxrwx   0   501   5004096
5-Sep-2006 08:53 0077f686
1487817 drwxrwxrwx   0   501   5004096
5-Sep-2006 08:53 0016b3c9
1702439 drwxrwxrwx   0   501   5004096
5-Sep-2006 08:53 0019fa27

debugfs: ls -l //orphan_dir:0002
18  drwxr-xr-x   2 0 0   94208
5-Jul-2006 17:13 .
10  drwxr-xr-x   6 0 04096
2-May-2006 16:11 ..
4301446 -rw-r--r--   0   503   500   0
12-Aug-2006 10:40 0041a286

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 20, 2006 12:32 PM

To: Matthew Flusche
Cc: Ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 - disk usage inconsistencies

Did you remove some large files recently? If so, check the orphan_dir
and truncate_log for all the slots.

1. Start debugfs:
# debugfs.ocfs2 /dev/sdX

2. List system directory:
  ls -l //

3. List files in all orphan_dir(s):
  ls -l //orphan_dir:

If there are files, means some process in the cluster is still using 
that file.


4. stat all trancate_log(s):
  stat //truncate_log:

I will be surprised if you see any bits here. If there are, do 
sync;sync;sync; on the

appropriate node.

5. You can find the appropriate node by dumping the slotmap:
  slotmap
Find the slot-to-nodenum mapping. Do the sync on that node.

For this and more, refer to the on-disk format support guide.
http://oss.oracle.com/projects/ocfs2/dist/documentation/03-disk_format.p
df

Matthew Flusche wrote:
  

Hi all.

I have a 50 GB OCFS2 file system. I'm currently using ~26GB of space 
but df is reporting 43 GB used. Any ideas how to find out where the 
missing 17GB is at?


The file system was formatted with a 16K cluster  4K block size.

Thanks,

Matt





  

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Use of OCFS2 file systems.

2006-09-29 Thread Sunil Mushran


Yes.

Bill Wells wrote:

All,
  Can someone comment on whether it is recommended to use the OCFS2 
file system for the admin directories of a RAC database.  
Specifically, for bdump, udump, cdump, etc.

This is being considered on RHEL4-U4 with 10gR2 on a 3 node cluster.

Thanks much,
Bill Wells
--

Oracle logo
*Bill Wells*, OCP 7,8i,9i,10g
/Principal Service Delivery Engineer
Advanced Customer Services - GEH/
Cell#   (919) 624-6300
Office# (919) 846-8426

email: [EMAIL PROTECTED]

The statements and opinions expressed here are my own and do not 
necessarily represent those of Oracle Corporation.




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Re: FW: Use of OCFS2 file systems.

2006-10-04 Thread Sunil Mushran


File a bug on bugzilla (oss.oracle.com/bugzilla) with the full oops trace
and any other information that seems relevant.

Galan Merchan, Martin wrote:


Hello,

I’m working with OCFS2 on Radhat Advanced Server 4 Patch 3 and I had 
kernel panics too. I use OCFS2 only for RAC archive logs and RMAN backups.


Well, I’m testing one solution and seems to be fine:

In /etc/ocfs2/cluster.conf I have replaced the public IPs by the 
heartbeat IPs (parameter ip_address), but keeping the names.


Is there anyone that knows this solution and have tested it with fails?

Regards from Spain,

*_MARTÍN_*



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Resizing mountpoint in ocfs2

2006-10-05 Thread Sunil Mushran


Yes, the last patch to add this feature is in review. We will release
this as part of ocfs2-tools 1.2.2.

Kerr-Sheppard, Stephen wrote:


Has anyone had to resize a mountpoint in ocfs2. In ocfs version 1 it 
was a case of unmounting and using the resizeocfs command. Is this 
still the same for ocfs2??


Thanks

Stephen

*Stephen Kerr-Sheppard*

T +44(0)1908 257469
F +44(0)1908 692791
E [EMAIL PROTECTED]
W_ __http://www.imserv.invensys.com_
IMServ Europe Ltd
Scorpio Rockingham Drive Linford Wood Milton Keynes MK14 6LY
Registered in England and Wales No.2749624
Registered Address: Invensys Portland House Stag Place London SW1E 5BF

Disclaimer Notice
This message/attachment(s) are CONFIDENTIAL and may contain LEGALLY 
PRIVILEGED information. If this message/attachment(s) were not 
intended for you please contact the sender IMMEDIATELY and delete this 
message/attachment(s) from your computer. You must not copy, forward 
or disclose the contents of this message/attachment(s) to any other 
person. The views/opinions in this message are solely of the author 
and do not necessarily represent those of the company. Please check 
this message/attachment(s) for the presence of viruses. No liability 
for any damage caused by any virus transmitted by this 
message/attachment(s) is accepted by the company.




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] 2 Node cluster, and nodes OS hang

2006-10-06 Thread Sunil Mushran

tcpdump -i eth1 -C 10 -W 15 -s 1 -Sw /tmp/`hostname -s`_tcpdump.log 
-ttt 'port ' 


Do this on both nodes before mounting on the second node. Ping me with 
the path to the logs.


[EMAIL PROTECTED] wrote:

Hello All,

I have a NAS that I would like to use ocfs2 on. Currently there are three partitions made 
on it, I have included a fdisk listing below. I have created a 2 node cluster. I perform 
a basic mount, mount -t ocfs2 /dev/ndas-00500435:0p3 /media/nas3. When I do 
this on the 2nd node..  after the 5 sec it takes to mount, the 2nd node will completely 
hang after 5 secs. I did see some local iptables blocks from the 2nd node. so I disabled 
the firewall on both nodes, and that did not help.

I am using opensuse 10.1, kernel 2.6.16.21-0.25-default. I am fine with 
troubleshooting the cluster not working, but completely hanging my system ?

[ocfs2console]
Version: 0.90
Label: Nas3
UUID: Long #.
Maximum Nodes: 4
Cluster Size: 16K
Block Size: 4k

[/etc/ocfs2/cluster.conf]
node:
ip_port = 
ip_address = 192.168.123.198
number = 0
name = desk1
cluster = ocfs2

node:
ip_port = 
ip_address = 192.168.123.199
number = 1
name = desk2
cluster = ocfs2

cluster:
node_count = 2
name = ocfs2

[fdisk]
Disk /dev/ndas-00500435:0: 164.6 GB, 164694458368 bytes
255 heads, 63 sectors/track, 20022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot  Start End  Blocks   Id  System
/dev/ndas-00500425:0p1   1130510482381c  W95 FAT32 
(LBA)
/dev/ndas-00500425:0p21306   1139781063990   83  Linux
/dev/ndas-00500425:0p3   11398   2002269280312+  83  Linux


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Getting Started with ocfs2

2006-10-11 Thread Sunil Mushran


Martin J. Evans wrote:

fine but on selecting cluster/configure nodes I still get dialogue
saying Could not query the state of the cluster stack. This must be
resolved before any OCFS2 filesystemcan be mounted.


Could be because the script is installed as o2cb and not o2cb.init.


Fedora Core release 5 (Bordeaux)

with ocfs2 tools installed by downloading the source and the usual
configure and make (version ocfs2-tools-1.2.1.tar.gz) because the rpms I
saw seemed ages (years) out of date.

ocfs2-tools 1.2.1 was the last one released. We are working on releasing 
tools 1.2.2.



I'm new to this so I may be missing a lot but following the instructions
in the user guide did not get me to this point:

1. the ocfs2console does not run without setting my PYTHONPATH first - I
don't know why.
2. the ocfs2console does not seem to create the /etc/ocfs2/cluster.conf
(for me anyway).
3. if you install the ocfs2 tools from the rpm it create a minor
cluster.conf which ocfs2console does not seem to like since you can't
add nodes. If you install from the source code you don't even get a
cluster.conf.

I am willing to accept I've missed things but what were they? Does
everyone go through this?
  

I would hope not.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] out of memory... doing heavy IO on ocfs2 is wasting (low) memory?!

2006-10-11 Thread Sunil Mushran


Still in testing. It is a larger patch than normal and thus requires
more time/effort. Once we are comfortable with it, we will look into
releasing the patch for others to test before releasing 1.2.4.

Jonah H. Harris wrote:

What's the status on this?  I've researched Bugzilla, SVN, and the
lists and haven't seen any mention of it yet being fixed as of yet.

Kurt or Sunil, do you have a patch available that I could try?
Otherwise, what's the Bugzilla ID so I can follow it's progress.  Any
help you can give would be appreciated.  Thanks!

-Jonah

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] disk heartbeat timeout poll

2006-10-11 Thread Sunil Mushran


Thanks for all the replies in the previous usage poll.

One of the chief concerns expressed was the (very) low default disk
heartbeat timeout setting. Well, we want to bump it up but to what?

Here are some qs the answers to which will help us determine that value.

1. What is the your disk heartbeat timeout? If you are unsure,
cat /etc/sysconfig/o2cb.

2. What is your shared disk setup like? Fiber Channel, iscsi, AoE, etc.
Provide as much detail as you can.

3. Are you using some sort of multipathing? If so, provide details.

4. What is the cluster used for? Oracle database, mailserver, etc.

5. How many nodes in your cluster?

6. Any other relevant information?

Again, feel free to mail me directly.

Thanks
Sunil

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] SUSE Patches

2006-10-20 Thread Sunil Mushran


Ping Novell. They issue interim PTF SLES kernels with the required fix(es)
to help users tide over until the formal release.

Needless to add, you need to have Novell Support.

Andy Kipp wrote:

Hello all,

I am running SLES9 with the latest kernel patches (2.6.5-7.282-bigsmp)
and ocfs2 version (1.2.1-SLES), I was wondering if anyone had any info
on if the latest ocfs2 will be included in SLES anytime soon. Or how to
get the patches from Novell. Thanks in advance! 


- Andy


Andy Kipp
Network Administrator
Velcro USA Inc.
406 Brown Ave. 
Manchester, NH 03103

Phone: (603) 222-4844
Email: [EMAIL PROTECTED]

CONFIDENTIALITY NOTICE:
This email is intended only for the person or entity to which it is
addressed and may contain confidential and/or privileged material. Any
unauthorized review, use, disclosure or distribution is prohibited. If
you are not the intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message. If you are the
intended recipient but do not wish to receive communications through
this medium, please so advise immediately.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] RHEL 4 hotfix RPMs?

2006-10-23 Thread Sunil Mushran


# ./configure --with-kernel=/usr/src/kernels/2.6.9-42.X.EL-smp-i686/
# make rhel4_2.6.9-42.X.EL_rpm

The rpms will be in the rpmdir as specified in ~/.rpmmacros.
~$ cat .rpmmacros
%_topdir/rpmbuild/user
%_tmppath   /rpmbuild/user/tmp
%_sourcedir /rpmbuild/user/SOURCES
%_specdir   /rpmbuild/user/SPECS
%_srcrpmdir /rpmbuild/user/SRPMS
%_rpmdir/rpmbuild/user/RPMS
%_builddir  /rpmbuild/user/BUILD

Brian Long wrote:

Would it be possible to post the src.rpm used to build the RHEL 4 binary
RPMs for OCFS2 kernel modules?  Or could you explain how to easily build
the ocfs2 kernel modules for a Red Hat hotfix kernel?

I downloaded and extracted the ocfs2-1.2.3 tarball and ran ./configure
with the defaults. It found the hotfix -devel RPM installed.  When I run
make, it compiles ocfs2 properly, but it does not create RPMs for me
to install.

I found the ocfs2.spec-generic in the vendor/rhel4 directory, but how
can I easily use it to build RPMs for my hotfix kernel?

Thanks.

/Brian/
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] 1.2.2 dump issue

2006-10-25 Thread Sunil Mushran


As the ocfs2 home page suggests, when building 1.2.x against mainline
2.6.14 and above, specify GENERIC_DELETE_INODE_NOT_TRUNCATES=1.

Peter Larsen wrote:

I'm running 1.2.2 here - compiled from source, and while I can read
files, trying to delete a file on my OCFS2 volume produces the following:

[EMAIL PROTECTED] orcl]# rm testing
rm: remove regular file `testing'? yes
Segmentation fault
[EMAIL PROTECTED] orcl]#
Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] [ cut here ]

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] kernel BUG at fs/inode.c:253!

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] invalid opcode:  [#1]

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] CPU:0

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] EIP is at clear_inode+0x27/0x142

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] eax: 09c4   ebx: f5b56bdc   ecx:
   edx: f5b56d28

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] esi: fc6d115b   edi: f5b56bdc   ebp:
f5ac2f58   esp: f5ac2ebc

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] ds: 007b   es: 007b   ss: 0068

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] Process rm (pid: 5583,
threadinfo=f5ac2000 task=f77eaab0)

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] Stack: 0fc6d115b f5b56bdc fc6d115b
fc6d11a5 0001   

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]f1eb9240 00200246 f5b56bdc
0001 f58f8c9c   

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] f5b56bdc fc6d115b
c016ea67 f5b56bdc f5b56980 fc6d269d f228a6b8

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] Call Trace:

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [fc6d115b]
ocfs2_delete_inode+0x0/0x56b [ocfs2]

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [fc6d115b]
ocfs2_delete_inode+0x0/0x56b [ocfs2]

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [fc6d11a5]
ocfs2_delete_inode+0x4a/0x56b [ocfs2]

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [fc6d115b]
ocfs2_delete_inode+0x0/0x56b [ocfs2]

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [c016ea67]
generic_delete_inode+0x9e/0x13d

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [fc6d269d]
ocfs2_drop_inode+0x5d/0x195 [ocfs2]

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [c0306f0b]
__mutex_unlock_slowpath+0x93/0x200

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [c01c87b1] 
_atomic_dec_and_lock+0xd/0x3c


Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [c016ecae] iput+0x53/0x67

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [c0165dca] do_unlinkat+0xba/0xf6

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [c01041de] do_IRQ+0x53/0x85

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [c0156434] filp_close+0x33/0x60

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000]  [c0102c03] sysenter_past_esp+0x54/0x75

Message from [EMAIL PROTECTED] at Tue Oct 24 18:45:17 2006 ...
ora02 kernel: [17180140.004000] Code: c0 01 5b c3 56 ba f9 00 00 00 53
89 c3 b8 cf 41 32 c0 83 ec 04 e8 21 99 fa ff 89 d8 e8 50 ad fe ff 8b 83
28 01 00 00 85 c0 74 08 0f 0b fd 00 cf 41 32 c0 8b 83 ac 01 00 00 a8
10 75 08 0f 0b ff

This is the module information:
[EMAIL PROTECTED] orcl]# modinfo ocfs2
filename:   /lib/modules/2.6.16.9/extra/ocfs2/ocfs2.ko
author: Oracle
license:GPL
description:OCFS2 1.2.2 Tue Jul  4 22:18:34 EDT 2006 (build
89ef9a0a0785d11d426e7842446d505b)
version:1.2.2
vermagic:   2.6.16.9 PENTIUM4 REGPARM 4KSTACKS gcc-3.4
depends:ocfs2_nodemanager,ocfs2_dlm,jbd
srcversion: E4F740AE9E8176169DAB864

I see a 1.2.3 version has been released, and I'll try to see if that
makes any difference. But in the mean time, is this a known issue?

Btw. I have not recieved any messages on this list for a long time.

Re: [Ocfs2-users] lvm2 not cluster aware - okay, so how should Istripe my LUNs?

2006-10-25 Thread Sunil Mushran


Fabio Corazza wrote:

Last but not least.. a question for Sunil if he's gonna read this.. when
OCFS2 will support data-on-inode would we need to reformat the file
systems or will the new module be compatible with the 1.4 on-disk data?
  
I am envisioning a compat flag to be added on existing volumes using 
tunefs.ocfs2.
But it is hard to state anything with surety about a feature that is yet 
not implemented. :)



Thanks team for the new OCFS2 tools by the way, now we can grow our file
systems. Yet a step forward.
  

One step at a time.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 Fencing and Locking MSA500 Array: Help

2006-10-25 Thread Sunil Mushran


Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000250
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000250

That's where the problem begins. The cciss driver is unable to to 
complete the
ios due to a bus reset maybe. Ping HP or whoever your contact is for the 
MSA500.


You may get more information if you setup a netconsole server to catch the
stack dumps.

Deaderick, David (EDS) wrote:

I have a RedHat Enterprise Linux 4.0 two node cluster on HP ProLiant
ML350 Servers connected to an HP MSA500 with HP 532 SCSI adapters (cciss
driver).
The following list includes critical component versions:
ocfs2console-1.2.1-1  Mon 28 Aug 2006 05:39:20
PM EDT
ocfs2-2.6.9-42.0.2.ELsmp-1.2.3-1  Mon 28 Aug 2006 05:39:19
PM EDT
ocfs2-2.6.9-42.0.2.ELhugemem-1.2.3-1  Mon 28 Aug 2006 05:39:18
PM EDT
ocfs2-2.6.9-42.0.2.EL-1.2.3-1 Mon 28 Aug 2006 05:39:17
PM EDT
ocfs2-tools-1.2.1-1   Mon 28 Aug 2006 05:39:15
PM EDT
oracleasmlib-2.0.2-1  Mon 28 Aug 2006 05:37:51
PM EDT
oracleasm-2.6.9-42.0.2.ELhugemem-2.0.3-1  Mon 28 Aug 2006 05:37:49
PM EDT
oracleasm-2.6.9-42.0.2.EL-2.0.3-1 Mon 28 Aug 2006 05:37:47
PM EDT
oracleasm-2.6.9-42.0.2.ELsmp-2.0.3-1  Mon 28 Aug 2006 05:37:45
PM EDT
oracleasm-support-2.0.3-1 Mon 28 Aug 2006 05:37:44
PM EDT
kernel-hugemem-2.6.9-42.0.2.ELMon 28 Aug 2006 05:25:32
PM EDT
kernel-doc-2.6.9-42.0.2.ELMon 28 Aug 2006 05:25:29
PM EDT
kernel-hugemem-devel-2.6.9-42.0.2.EL  Mon 28 Aug 2006 05:25:07
PM EDT
kernel-smp-devel-2.6.9-42.0.2.EL  Mon 28 Aug 2006 05:21:45
PM EDT
kernel-smp-2.6.9-42.0.2.ELMon 28 Aug 2006 05:20:51
PM EDT
kernel-utils-2.4-13.1.83  Mon 28 Aug 2006 05:20:48
PM EDT
kernel-devel-2.6.9-42.0.2.EL  Mon 28 Aug 2006 04:42:48
PM EDT
kernel-2.6.9-42.0.2.ELMon 28 Aug 2006 04:42:37
PM EDT

When ever a heavy load is on the I/O system (i.e. database full backups
using RMAN), the servers fence, reboot and cannot reconnect with the
MSA500.
We must power the servers and the MSA500 off and restart.

Where can I start troubleshooting this?

/var/log/messages: (Node 2)

Oct 11 05:16:56 vhaispora02 kernel: o2net: connection to node
vhaispora01 (num 0) at 192.168.1.1: has been idle for 10 seconds,
shutting it down.
Oct 11 05:16:56 vhaispora02 kernel: (0,0):o2net_idle_timer:1309 here are
some times that might help debug the situation: (tmr 1160558206.560358
now 1160558216.558300 dr 1160558206.560323 adv
1160558206.560375:1160558206.560379 func (0d6da305:504)
1160552001.561116:1160552001.561125)
Oct 11 05:16:56 vhaispora02 kernel: o2net: no longer connected to node
vhaispora01 (num 0) at 192.168.1.1:
Oct 11 05:16:59 vhaispora02 kernel: cciss0: unsolicited abort f7010e90
Oct 11 05:16:59 vhaispora02 kernel: cciss0: retrying f7010e90
.
.
.
Oct 11 05:17:18 vhaispora02 kernel: cciss0: f7010550 retried too many
times
Oct 11 05:17:18 vhaispora02 kernel: cciss0: unsolicited abort f70107a0
Oct 11 05:17:18 vhaispora02 kernel: cciss0: f70107a0 retried too many
times
Oct 11 05:17:18 vhaispora02 kernel: cciss0: unsolicited abort f70109f0
Oct 11 10:35:57 vhaispora02 syslogd 1.4.1: restart.
Oct 11 10:35:57 vhaispora02 syslog: syslogd startup succeeded
Oct 11 10:35:57 vhaispora02 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Oct 11 10:35:57 vhaispora02 kernel: Linux version 2.6.9-42.0.2.ELsmp
([EMAIL PROTECTED]) (gcc version 3.4.6 20060404
(Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 18:00:32 EDT 2006

/var/log/messages (Node 1)
Oct 11 05:10:01 vhaispora01 crond(pam_unix)[14577]: session closed for
user root
Oct 11 05:14:25 vhaispora01 ntpd[3243]: synchronized to 10.4.31.254,
stratum 2
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000250
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000250
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70004a0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70004a0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70006f0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70006f0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000940
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000940
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000b90
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000b90
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000de0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000de0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7001030
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7001030
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7001280
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7001280
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70014d0
Oct 11 05:15:28

Re: [Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference

2006-10-27 Thread Sunil Mushran

Please file a bugzilla with the details provided. It is easier to manage 
bugs

that a way.

Thanks

Christian Schlittchen wrote:

Thanks to syncronous writes on the log-files I finally managed to get
a log of the regular panics we experience.

The setup is as follows: Three blades (IBM HS20) accessing a shared storage
on a fibre channel connected storage server (IBM DS4300). The storage is
used as a central mailstorage for about 35000 users, so it is pretty heavy
duty storage wise.

blade01 crashes every few days with a kernel panic. Unfortunatly all
watchdogs we tried fail to reboot the machine, and setting
/proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops to non-zero
values doesn't help either. The machine still responds to pings, but
to nothing else. Even more unfortunatly the file system on the other
blades starts to hang sometime after blade01 crashes.

Logging /proc/slabinfo showed a steady increase of the size-256 and size-32
number of objects and we thought the crashes might have something to do
with it. We then did a nightly umount/mount which reduced the values a
bit and which does seem to reduce the frequency of crashes slightly.

Nevertheless today we had a crash with rather low values of size-256 and
size-32:

From /proc/slabinfo, timestamped, a few seconds before the crash:

2006-10-27-06:20:01 size-256   92187 169605256   151 : tunables 
 120   608 : slabdata  11307  113 07  0
2006-10-27-06:20:01 size-3294037 534942 32  1131 : tunables 
 120   608 : slabdata   4734   47 34  0

The kern.log shows:

Oct 27 06:20:11 blade01 kernel: BUG: unable to handle kernel NULL pointer 
dereference at virtual address 0004
Oct 27 06:20:11 blade01 kernel:  printing eip:
Oct 27 06:20:11 blade01 kernel: f92b9431
Oct 27 06:20:11 blade01 kernel: *pde = 
Oct 27 06:20:11 blade01 kernel: Oops: 0002 [#1]
Oct 27 06:20:11 blade01 kernel: SMP 
Oct 27 06:20:11 blade01 kernel: Modules linked in: i6300esb ocfs2 xt_state ip_conntrack xt_limit ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager md_mod dm_snapshot dm_mirror dm_mod mptctl qla2xxx i2c_i801 firmware_class i2c_core scsi_transport_fc rtc

Oct 27 06:20:11 blade01 kernel: CPU:1
Oct 27 06:20:11 blade01 kernel: EIP:0060:[f92b9431]Not tainted VLI
Oct 27 06:20:11 blade01 kernel: EFLAGS: 00010286   (2.6.18 #1) 
Oct 27 06:20:11 blade01 kernel: EIP is at dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm]

Oct 27 06:20:11 blade01 kernel: eax:    ebx: d61e4c00   ecx: c4ce5988   
edx: 
Oct 27 06:20:11 blade01 kernel: esi: f7531de4   edi: c4ce5980   ebp: e1873080   
esp: f7531d6c
Oct 27 06:20:11 blade01 kernel: ds: 007b   es: 007b   ss: 0068
Oct 27 06:20:11 blade01 kernel: Process o2net (pid: 1698, ti=f753 
task=c215b560 task.ti=f753)
Oct 27 06:20:11 blade01 kernel: Stack:  c0327a2c f7531d88 e6805a80 f7531e6c 0048 0040 d61e4c00 
Oct 27 06:20:11 blade01 kernel:d899a020  0001  0102  d899a021 004d 
Oct 27 06:20:11 blade01 kernel:c4ce5980  d61e4c00 fff4 f92bb927 f7531de4 d899a020 001f 
Oct 27 06:20:11 blade01 kernel: Call Trace:

Oct 27 06:20:11 blade01 kernel:  [c0327a2c] sock_recvmsg+0xe9/0x10b
Oct 27 06:20:11 blade01 kernel:  [f92bb927] 
dlm_migrate_request_handler+0x17b/0x231 [ocfs2_dlm]
Oct 27 06:20:11 blade01 kernel:  [f9256762] o2net_process_message+0x46e/0x626 
[ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [c0120312] __do_softirq+0x73/0xdf
Oct 27 06:20:11 blade01 kernel:  [f9256057] o2net_recv_tcp_msg+0x6b/0x7e 
[ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [c0114142] find_busiest_group+0x129/0x4f9
Oct 27 06:20:11 blade01 kernel:  [f925819e] o2net_rx_until_empty+0x1e6/0x6b9 
[ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [c011619f] __wake_up+0x32/0x43
Oct 27 06:20:11 blade01 kernel:  [c012af5b] run_workqueue+0x73/0xe1
Oct 27 06:20:11 blade01 kernel:  [f9257fb8] o2net_rx_until_empty+0x0/0x6b9 
[ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [c012b710] worker_thread+0x143/0x15f
Oct 27 06:20:11 blade01 kernel:  [c011563d] default_wake_function+0x0/0x15
Oct 27 06:20:11 blade01 kernel:  [c012b5cd] worker_thread+0x0/0x15f
Oct 27 06:20:11 blade01 kernel:  [c012e151] kthread+0xfc/0x100
Oct 27 06:20:11 blade01 kernel:  [c012e055] kthread+0x0/0x100
Oct 27 06:20:11 blade01 kernel:  [c0100d95] kernel_thread_helper+0x5/0xb
Oct 27 06:20:11 blade01 kernel: Code: 98 0a 00 00 c7 44 24 0c 62 81 2c f9 89 54 24 08 89 44 24 04 c7 04 24 80 06 2d f9 e8 85 29 e6 c6 e9 57 fe ff ff 8b 57 08 8b 41 04 89 42 04 89 10 89 4f 08 89 49 04 eb 9c f7 05 a0 2b 26 f9 00 09 
Oct 27 06:20:11 blade01 kernel: EIP: [f92b9431] dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] SS:ESP 0068:f7531d6c


This is with a vanilla 2.6.18 kernel from kernel.org. There were no
suspicious messages in the logs before the crash.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com

Re: [Ocfs2-users] Unexpected reboot / crash

2006-10-27 Thread Sunil Mushran

The first issue could be because you don't have ocfs2-tools 1.2.2. The 
earlier

version was missing a line in the ocfs2 init script.

Rafal Maliszewski wrote:


Hi guys

I installed ocfs2 on 4 node (redhat 4u3) on shared FC devices ( EMC 
storage ).


So I've noticed several problems:

1. When I restart first node, it shutdown so slow and I have to power 
off machine. What is the problem, heartbeat timeout, dlm ?


2. What will happen when I plug off network cabel ( for interconnect 
). What shoud I do, unmount ocfs2 volume on each node, stop ocfs2 service?


regards



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Interesting Error

2006-10-30 Thread Sunil Mushran


Which version of OCFS2?
Did you run fsck.ocfs2 -f on that device?

Do:
# echo stat 6518860 | debugfs.ocfs2 -n /dev/sdX /tmp/ext.out
Email ext.out.

Andy Kipp wrote:

Anybody have any idea what this error involves? Or how to resolve it?


Oct 30 05:11:24 groupwise-1-mht kernel:
(8494,0):ocfs2_extent_map_find_leaf:287 ERROR: status = -53
Oct 30 05:11:24 groupwise-1-mht kernel: OCFS2: ERROR (device dm-0):
ocfs2_extent_map_find_leaf: Extent 29 at e_blkno 1973744 of inode
6518860 goes past ip_clusters of 441
Oct 30 05:11:24 groupwise-1-mht kernel:
Oct 30 05:11:24 groupwise-1-mht kernel: File system is now read-only
due to the potential of on-disk corruption. Please run fsck.ocfs2 once
the file system is unmounted.
Oct 30 05:11:24 groupwise-1-mht kernel:
(8494,0):ocfs2_extent_map_lookup_read:383 ERROR: status = -53
Oct 30 05:11:24 groupwise-1-mht kernel:
(8494,0):ocfs2_extent_map_get_blocks:858 ERROR: status = -53
Oct 30 05:11:24 groupwise-1-mht kernel: (8494,0):ocfs2_get_block:171
ERROR: Error -53 from get_blocks(0xf3c2d608, 0, 1, 0, NULL)

- Andy


Andy Kipp
Network Administrator
Velcro USA Inc.
406 Brown Ave. 
Manchester, NH 03103

Phone: (603) 222-4844
Email: [EMAIL PROTECTED]

CONFIDENTIALITY NOTICE:
This email is intended only for the person or entity to which it is
addressed and may contain confidential and/or privileged material. Any
unauthorized review, use, disclosure or distribution is prohibited. If
you are not the intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message. If you are the
intended recipient but do not wish to receive communications through
this medium, please so advise immediately.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 error messages

2006-10-31 Thread Sunil Mushran


Are you using NFS by any chance? I am looking into bug#790
that also encounters the same error (ESTALE).

Matthew Flusche wrote:


I received the following error messages in the system logs.  Is this 
anything to be concerned with?


 

kernel: (4074,0):ocfs2_populate_inode:234 ERROR: Invalid dinode: 
i_ino=1293597, i_blkno=1293597, signature = INODE01, flags = 0x0


kernel: (4074,0):ocfs2_read_locked_inode:389 ERROR: populate inode 
failed! i_blkno=1293597, i_ino=1293597


kernel: (4074,0):ocfs2_iget:131 ERROR: status = -116

kernel: (4074,0):ocfs2_iget:141 ERROR: status = -116

kernel: (4074,0):ocfs2_get_dentry:63 ERROR: status = -116

 

This is a three node cluster, no other error messages on any of the 
other nodes. 

 


System Information

RHEL 4U4 2.6.9-42.0.2 kernel

ocfs2console-1.2.1-1

ocfs2-tools-debuginfo-1.2.1-1

ocfs2-2.6.9-42.0.2.ELhugemem-1.2.3-1

ocfs2-tools-1.2.1-1

 


Thanks,

 


Matt



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Interesting Error

2006-10-31 Thread Sunil Mushran


Replace sdX with the device on which the ocfs2 fs exists. You can use
mount | grep ocfs2 to find that volume.

If the inode on disk is good, one explanation for the issue could be the
lvb bug which was fixed in 1.2.2. Ping Novell to get a PTF kernel with
ocfs2 1.2.3.

Andy Kipp wrote:

  Which version of OCFS2?

ocfs2 1.2.1 (sles)
ocfs2-tools 1.2.1 (sles)

  

Did you run fsck.ocfs2 - f on that device?



Not yet. Wanted to see what the error was about. Before I take down a
production machine to do the fsck. 

  

Do:
# echo stat 6518860 | debugfs.ocfs2 - n /dev/sdX /tmp/ext.out
Email ext.out.



This keeps returning saying that the device can not be found. I have
tried running it with the following options with consideration for
multipathing:
/dev/sdb 
/dev/sdd 
/dev/dm-0 
/dev/disk/by-name/vol_groupwise_data 


Am I missing the obvious?
Thank for your help.


Andy Kipp
Network Administrator
Velcro USA Inc.
406 Brown Ave. 
Manchester, NH 03103

Phone: (603) 222-4844
Email: [EMAIL PROTECTED]

CONFIDENTIALITY NOTICE:
This email is intended only for the person or entity to which it is
addressed and may contain confidential and/or privileged material. Any
unauthorized review, use, disclosure or distribution is prohibited. If
you are not the intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message. If you are the
intended recipient but do not wish to receive communications through
this medium, please so advise immediately.

  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 error messages

2006-10-31 Thread Sunil Mushran


So it is bug#790. It just may be a case of unnecessary error messages
for you. I am still investigating it.

Matthew Flusche wrote:

Yes, one of the clustered file systems is shared with nfs.

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 31, 2006 12:25 PM

To: Matthew Flusche
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 error messages

Are you using NFS by any chance? I am looking into bug#790
that also encounters the same error (ESTALE).

Matthew Flusche wrote:
  
I received the following error messages in the system logs.  Is this 
anything to be concerned with?


 

kernel: (4074,0):ocfs2_populate_inode:234 ERROR: Invalid dinode: 
i_ino=1293597, i_blkno=1293597, signature = INODE01, flags = 0x0


kernel: (4074,0):ocfs2_read_locked_inode:389 ERROR: populate inode 
failed! i_blkno=1293597, i_ino=1293597


kernel: (4074,0):ocfs2_iget:131 ERROR: status = -116

kernel: (4074,0):ocfs2_iget:141 ERROR: status = -116

kernel: (4074,0):ocfs2_get_dentry:63 ERROR: status = -116

 

This is a three node cluster, no other error messages on any of the 
other nodes. 

 


System Information

RHEL 4U4 2.6.9-42.0.2 kernel

ocfs2console-1.2.1-1

ocfs2-tools-debuginfo-1.2.1-1

ocfs2-2.6.9-42.0.2.ELhugemem-1.2.3-1

ocfs2-tools-1.2.1-1

 


Thanks,

 


Matt





  

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Ocfs2 and low memory

2006-10-31 Thread Sunil Mushran


To monitor ocfs2 memory usage, do:

# cat /proc/slabinfo | egrep 'ocfs|dlm|size-256 |size-32 '
ocfs2_lock16226 16  2261 : tunables  120   60
0 : slabdata  1  1  0
ocfs2_inode_cache 22 24   115231 : tunables   24   12
0 : slabdata  8  8  0
ocfs2_uptodate28119 32  1191 : tunables  120   60
0 : slabdata  1  1  0
ocfs2_em_ent  10183 64   611 : tunables  120   60
0 : slabdata  3  3  0
dlmfs_inode_cache  1  576851 : tunables   54   27
0 : slabdata  1  1  0
dlm_mle_cache  0  0384   101 : tunables   54   27
0 : slabdata  0  0  0
size-256   40245  40245256   151 : tunables  120   60
0 : slabdata   2683   2683  0
size-3241650  41650 32  1191 : tunables  120   60
0 : slabdata350350  0


# cat /proc/fs/ocfs2_dlm/*/stat
local=26, remote=0, unknown=0
local=39963, remote=6, unknown=0

size-256/32 are generic slab caches but are also used by ocfs2dlm.
The ocfs2dlm impact on it can be detected by the second cat which lists
the number of locally mastered (local) locks which are currently not
freed until the volume is umounted. (The patch-fix is being tested.)

Rafal Maliszewski wrote:

Hi guys

I have 4 node cluster with OCFS2.
From time to time redhat fired : oom-killer to kill process with high 
amout of memory ( for example tomcat)


My friends suppose that ocfs2 consume memory.

So I have question:
How check how many memory is occupied by ocfs processes?
Is it a low memory?

regards




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Newbie questions -- is OCFS2 what I even want?

2006-11-03 Thread Sunil Mushran


You are probably looking for a distributed file system. Check
out afs and/or v9fs.

Thad Beier wrote:

Dear Sirs and Madams,

I run a small visual effects production company, Hammerhead Productions.

We'd like to have an easily extensible inexpensive relatively 
high-performance

storage network using open-source components.  I was hoping that OCFS2
would be that system.

I have a half-dozen 2 TB fileservers I'd like the rest of the network 
to see

as a single 12 TB disk, with the aggregate throughput of the six servers
serving some 50 compute nodes on the network.

Is this what OCFS2 is for, or not?  My guess is that it isn't, but I'm 
having
a hard time parsing the documentation.  If it isn't what OCFS2 is for, 
what

am I looking for?

Thanks,

Thad Beier

--
Thad Beier
Hammerhead Productions
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] about 2 nodes enviroment and metalink note 394827.1

2006-11-09 Thread Sunil Mushran


I would imagine you are using RHEL4. If so, upgrade the ocfs2-tools
to 1.2.2. The previous version of the ocfs2 init script did not always
umount ocfs2 volumes on clean shutdowns leading to this problem.

[EMAIL PROTECTED] wrote:

Hi to all:

In 2 nodes environment I've 'suffered' the 'reboot 1st node hangs 2nd
one', has described in metalink note 394827.1

Exactly this note says that this occurs when interconnect fails. Then i
understand that if interconnect fails the idea is that node 1 stay up
and running and node 2 'kills' itself to avoid split-brain. 


When 1st node reboots ( planned reboot )  ocfs2 thinks that interconnect
has failed? If this is true, the cluster is condemned to die, cause node
1 is rebooting and node 2 kills itself, isn't it?

Under a well know 2 nodes environment, does not exist some type of
message like '2nd node,I'm rebooting, don't panic and stay tuned ' ? :)

Any tip to avoid this behaviour ?

I think that one way ( not optimal in any way ) could be adding another
node, only for ocfs2,  to help second node to think that it is in the
max nodes group, when the 1st node reboots...


Regards and TIA

D. 





___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 Block / Clustersize with Oracle 10gR2

2006-11-09 Thread Sunil Mushran


http://www.dominicgiles.com/swingbench.html  ??

Brian Long wrote:

The DBA wrote a patented Java-based application which stress tests the
Oracle IO subsystem.  We use this to benchmark our IO subsystems
(compare SAN to NAS, etc).  This same benchmark is showing a maximum
sustained throughput of 3,400 IO/sec while the same benchmark with the
same data will max out at 7K+ IO/sec on RAW.

I'll grab the iostat data which we've kept over time and try to make
some sense of it before posting anything additional.

Thanks.

/Brian/

On Thu, 2006-11-09 at 10:20 -0800, Sunil Mushran wrote:
  

Why are you looking at iops and not the io thruput?

What is the actual io thruput? Please could you share some iostat
numbers with us. In all our tests, we've seen very little difference
in the actual io thruput between raw and ocfs2.

Clustersize will mainly affect the alloc/dealloc performance. It has very
little role to play in io performance. If anything, it could help coalesce
requests to reduce number of ios (read cdbs) required to do the task.

Brian Long wrote:


Hello,

I followed the user's guide recommendation of 4K block size and 128K
cluster size.  I have 8 32GB OCFS2 filesystems mounted on two nodes.
The DBA has created a large tablespace with 4GB data files on each
filesystem.

The performance is only getting 3,400 IO/sec read/write combined.  If I
re-use the LUNs and give the DBA 4GB raw partitions, he can get over
7,000 IO/sec read/write combined (single-node) and over 11,000 IO/sec on
two nodes.

What's my next step to improve performance of OCFS2?  Since the DBA is
using 4GB datafiles, should I increase the cluster size to the max 1MB?

Thanks for any hints.

/Brian/
  
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] frozen ocfs2 filesystem under heavy webserver load

2006-11-13 Thread Sunil Mushran

None of these locks are busy. So they should not be the cause of the 
problem.


Start with the version of ocfs2. Also, which kernel?
What does top say? Is some process spinning?

Also, what does this stresstest entail?

Stephan Hendl wrote:

Hi,

I use a cluster of 4 nodes with ocfs2 as a webserver cluster. During a stresstest it 
occurs after a couple of minutes that the webserverprocesses are ideling but the system 
load is extremely high (abou 150...200) where the waits are very low. After that I cannot 
interrupt the webserver processes anymore and in some directories a ls -ls 
comes not back - so it seems that the file system has a problem. Only a reset of the 
server solves the problem ;-((

In the debug mode I can find the following lines as an example:

Lockres: D0ccda0fcc1c07e  Mode: Exclusive
Flags: Initialized Attached
RO Holders: 0  EX Holders: 0
Pending Action: None  Pending Unlock Action: None
Requested Mode: Exclusive  Blocking Mode: Invalid

Lockres: M0ccda0fcc1c07e  Mode: Exclusive
Flags: Initialized Attached
RO Holders: 0  EX Holders: 0
Pending Action: None  Pending Unlock Action: None
Requested Mode: Exclusive  Blocking Mode: Invalid

Lockres: D0ccd9ffcc1c07d  Mode: Exclusive
Flags: Initialized Attached
RO Holders: 0  EX Holders: 0
Pending Action: None  Pending Unlock Action: None
Requested Mode: Exclusive  Blocking Mode: Invalid

Lockres: M0ccd9ffcc1c07d  Mode: Exclusive
Flags: Initialized Attached
RO Holders: 0  EX Holders: 0
Pending Action: None  Pending Unlock Action: None
Requested Mode: Exclusive  Blocking Mode: Invalid

Could it be that two servers like to write to the same file and under heavy 
load the clusterd processes cannot handle this?

Regards und thanks,
Stephan
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Ocfs2 errors on 3 node cluster

2006-11-14 Thread Sunil Mushran


It will be easier if you file a bug on oss.oracle.com/bugzilla with all
the details. Like messages files from all nodes, etc.

Why are you using 1.2.1? 1.2.3 has been out for few months now.

Randy Ramsdell wrote:

Hi,

Maybe someone could elaborate on these re-occuring ocfs2 errors that
always results in a reboot of 1 or more systems.

Our setup:

3 node cluster
Ocfs2 v. 1.2.1
OpenSuse 10.1
SAN storage uses Iscsi for disk access.


Cluster settings:

# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=60


Kernel line parameters:

elevator=deadline panic=5
I have used the deadline or not testing to see if this will help.


The messages we receive are simply this:


Node 0

Nov  4 10:54:10 atl02010305 kernel: o2net: connection to node
atl02010310 (num 0) at 192.168.3.110: has been idle for 10 seconds,
shutting it down.
Nov  4 10:54:10 atl02010305 kernel: (0,0):o2net_idle_timer:1309 here are
some times that might help debug the situation: (tmr 1162655640.698739
now 1162655650.695937 dr 1162655640.698734 adv
1162655640.698739:1162655640.698739 func (ca3835ec:504)
1162654980.779007:1162654980.779011)
Nov  4 10:54:10 atl02010305 kernel: o2net: no longer connected to node
atl02010310 (num 0) at 192.168.3.110:


And the complimentary

Node 1
Nov  4 10:54:11 atl02010310 kernel: o2net: connection to node
atl02010305 (num 1) at 192.168.3.105: has been idle for 10 seconds,
shutting it down.
Nov  4 10:54:11 atl02010310 kernel: (32479,0):o2net_idle_timer:1309 here
are some times that might help debug the situation: (tmr
1162655640.698521 now 1162655650.701661 dr 1
162655650.695829 adv 1162655640.698525:1162655640.698525 func
(ca3835ec:505) 1162654980.778881:1162654980.778886)
Nov  4 10:54:11 atl02010310 kernel: o2net: no longer connected to node
atl02010305 (num 1) at 192.168.3.105:


This showed up shortly after and repeated for hours:


Nov  4 11:00:00 atl02010310 kernel:
(32540,1):dlm_send_remote_convert_request:398 ERROR: status = -107
Nov  4 11:00:00 atl02010310 kernel:
(32540,1):dlm_wait_for_node_death:371 32E007178FA24E87B45ECDDE6F7D5D52:
waiting 5000ms for notification of death of node 1
Nov  4 11:00:04 atl02010310 sshd[5242]: Accepted publickey for nagios
from 192.168.3.102 port 44292 ssh2


Node 3

saw nothing.


So I wonder why neither node rebooted from a kernel panic? Or what
happened, in general.

Weren't they supposed to fence etc..?

Randy Ramsdell


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Bad magic number in inode

2006-11-15 Thread Sunil Mushran


The quick detect just looks for the superblock which is in the third
block of the device. The full detect looks up the superblock and then
the system directory. In your case it fails to locate the latter.

This is one of the quirks when using an unpartitioned disk and later
partitioning it. The partitioning does not clear all the header blocks
that includes the fs superblock.

Long story short... use any binary editor (bvi), open the volume
and search for the OCFSV2 string. Change that signature.
Hint: It will be the first string in the third block. If you had formatted
with 4k blocksize, it will be on block starting at 8K, for 2k it will be
block starting at 4K you get the drift.

Marcel Savelkoul wrote:

Hi,

Pretty new here with SAN's and Oracle RAC.
I had set up everything but because I wanted to enlarge one of the
disks used I removed everything and started over and stumble upon the
following:

There is one disk defined on the SAN. This is /dev/sda.
I haven't done fdisk yet so I also don't have /dev/sda1 yet.

But if I now do the mounted.ocfs2 command I see the following:

# mounted.ocfs2 -d
DeviceFS UUID  Label
/dev/sda  ocfs2  6fe56000-97e6-4ea1-a302-29a8213c6e04  oradb

# mounted.ocfs2 -f
DeviceFS Nodes
/dev/sda  ocfs2  Unknown: Bad magic number in inode

This device is also listed when I check it with the ocfs2console.

I tried to open it with debug but then I get this:

# mount -t debugfs debugfs /debug
# echo fs_locks | debugfs.ocfs2 /dev/sda /tmp/fslocks
debugfs.ocfs2 1.2.2
Could not open debug state for 6FE5600097E64EA1A30229A8213C6E04.
Perhaps that OCFS2 file system is not mounted?

If I now use fdisk to create a partition on the device /dev/sda I see the
following with the mounted.ocfs2 command:

# mounted.ocfs2 -f
DeviceFS Nodes
/dev/sdc  ocfs2  Unknown: Bad magic number in inode
/dev/sdc1 ocfs2  Not mounted

Now the /dev/sda isn't listed anymore in the ocfs2console but the 
/dev/sda1

is and after actually mounting the /dev/sdc1 I see:

# mounted.ocfs2 -f
DeviceFS Nodes
/dev/sda  ocfs2  Unknown: Bad magic number in inode
/dev/sda1 ocfs2  pub.host.com

Why do I keep seeing the /dev/sda device?

Best Regards,

Marcel

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

2006-11-15 Thread Sunil Mushran


Again, create a bug on oss.oracle.com/bugzilla and upload
the messages files from both nodes. It is hard to state anything
with incomplete information.

[EMAIL PROTECTED] wrote:

I decided to rebuild this from scratch today and got the same result.

two cluster node, both boxes remain connected to the shared storage
throughout tests.

I unplug network connection from node0 and get e1000 driver Tx Unit Hang
messages on node0 console
node1 console displays o2net_idle_timer:1309 here are some times to help
debug the situation followed by additional output
node1 sits for a while and eventually displays o2quo_make_decision:143
error: fencing this node because it is connected to a half-quorum of one of
two nodes which doesn't include the lowest active node 0
node 0 replays node 1's journal, too bad it still isn't on the network

this is in node 1 /var/log/messages after reboot

Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net
(num 0) at 10.xxx.0.45: has been idle for 10 seconds, shutting it down.
Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1163570146.656474 now
1163570156.65
5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
(3a33f0f8:505) 1163570057.403947:1163570057.403950)
Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
FTP01.mydomain.net (num 0) at 10.xxx.0.45:

I'm confused by this.  Shouldn't node 0 have eventually rebooted since it
lost network connectivity and node 1 replayed node 0's journal and kept
going?  As it is right now we are left with no IP reachable box.

If I do this same test but unplug node 1 instead of node 0, it works as it
should. node 1 will fence and node 0 will reply the journal and stay
online.

Any input is greatly appreciated.

Thanks,

Colin Farley
Network Administrator
E-Care Contact Center Services
Phone:(204) 940-6244
Fax:(204) 940-7394


   
 Sunil Mushran 
 [EMAIL PROTECTED] 
 acle.com  To 
   [EMAIL PROTECTED]   
 11/13/2006 08:23   cc 
 PMocfs2-users@oss.oracle.com  
   Subject 
   Re: [Ocfs2-users] ESX and   
   Unbreakable 2.0 OCFS2 problem   
   
   
   
   
   
   




Considering o2net only cares whether it is connected to the other node
or not, it should not make a difference whether one unplugs node 0 or
node 1.
The result should be the same. Node 1 should fence in both cases.

Do you see messages indicating that the node(s) have lost connectivity?
If so, could you share them.

It would be easiest if you could file a bug on oss.oracle.com/bugzilla with
the messages file and listing the course of events... as in, unplugged
cable
on node 0 at time x, etc.

[EMAIL PROTECTED] wrote:
  

I'm testing a 2 node cluster in a VMWare ESX environment for use as a


high
  

availability FTP server to support a CRM application.  Both nodes run
Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM LUN


on
  

an HP EVA.  All disk connectivity is fine and haven't seen any problems
there.  The problem comes when doing some IP failover testing.  The IP
failover is done using UCARP so to test failover I tried unplugging one
nodes virtual network cable to see what happens.

If I unplug node 1 everything is fine, node 1 eventually panics and


reboots
  

while node 0 chugs along fine.  The problem comes when unplugging node 0.
When node 0 loses network connectivity it does not panic and eventually
node 1 panics and reboots.  Is there a reason why the lower node does not
panic if it loses network connectivity?

Heartbeat thresholds are the same on each node at 31 and both nodes are


set
  

to reboot on panic, node0 just never panics.  All software installed are
versions that come with Unbreakable 2.0.

I didn't do the config on these boxes so the first thing I'm going to do


on
  

Tuesday when I work on this is rebuild both nodes from scratch but I
figured I would ask first to see if it was an easy question for someone


on
  

the list to answer

Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

2006-11-15 Thread Sunil Mushran


You are missing his point. He is not saying that fencing is the problem.
He is asking as to why the behavior differs between unplugging node 0
and node 1.

Alexei_Roudnev wrote:

It is not a bug; it is all by design.

Problem is that OCFSv2:
- can't support more than 1 interconnection link, so you always risk to lost
intercionnection;
In additional, to make things worst, it dont support serial interconenction;
- can't find a quorum in 2 node configuration (it's not ocfsv2 problem but
general concern with any 2 nodes cluster) -
 so all nodes lost quorum if network connection is lost;
- don't analyze FS activity and reboot all nodes without quorum, except
node0, in case of losing network connection.

It can't be improved without supporting multiple interconnections + better
decisions about fencing (there is not any use to fence a node, if it have
not outstanding IO on cluster file system).

Well known problem with OCFSv2. One solution is to add 3-d node and use
interface bonding (be sure that interface convergeency time is less that
o2cb timeout).


- Original Message - 
From: [EMAIL PROTECTED]

To: Sunil Mushran [EMAIL PROTECTED]
Cc: ocfs2-users@oss.oracle.com
Sent: Tuesday, November 14, 2006 10:35 PM
Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem


  

I decided to rebuild this from scratch today and got the same result.

two cluster node, both boxes remain connected to the shared storage
throughout tests.

I unplug network connection from node0 and get e1000 driver Tx Unit Hang
messages on node0 console
node1 console displays o2net_idle_timer:1309 here are some times to help
debug the situation followed by additional output
node1 sits for a while and eventually displays o2quo_make_decision:143
error: fencing this node because it is connected to a half-quorum of one


of
  

two nodes which doesn't include the lowest active node 0
node 0 replays node 1's journal, too bad it still isn't on the network

this is in node 1 /var/log/messages after reboot

Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net
(num 0) at 10.xxx.0.45: has been idle for 10 seconds, shutting it


down.
  

Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1163570146.656474 now
1163570156.65
5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
(3a33f0f8:505) 1163570057.403947:1163570057.403950)
Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
FTP01.mydomain.net (num 0) at 10.xxx.0.45:

I'm confused by this.  Shouldn't node 0 have eventually rebooted since it
lost network connectivity and node 1 replayed node 0's journal and kept
going?  As it is right now we are left with no IP reachable box.

If I do this same test but unplug node 1 instead of node 0, it works as it
should. node 1 will fence and node 0 will reply the journal and stay
online.

Any input is greatly appreciated.

Thanks,

Colin Farley
Network Administrator
E-Care Contact Center Services
Phone:(204) 940-6244
Fax:(204) 940-7394



 Sunil Mushran
 [EMAIL PROTECTED]
 acle.com  To
   [EMAIL PROTECTED]
 11/13/2006 08:23   cc
 PMocfs2-users@oss.oracle.com
   Subject
   Re: [Ocfs2-users] ESX and
   Unbreakable 2.0 OCFS2 problem






  





Considering o2net only cares whether it is connected to the other node
or not, it should not make a difference whether one unplugs node 0 or
node 1.
The result should be the same. Node 1 should fence in both cases.

Do you see messages indicating that the node(s) have lost connectivity?
If so, could you share them.

It would be easiest if you could file a bug on oss.oracle.com/bugzilla


with
  

the messages file and listing the course of events... as in, unplugged
cable
on node 0 at time x, etc.

[EMAIL PROTECTED] wrote:


I'm testing a 2 node cluster in a VMWare ESX environment for use as a
  

high


availability FTP server to support a CRM application.  Both nodes run
Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM LUN
  

on


an HP EVA.  All disk connectivity is fine and haven't seen any problems
there.  The problem comes when doing some IP failover testing.  The IP
failover is done using UCARP so to test failover I tried unplugging one
nodes virtual network cable to see what happens.

If I unplug node 1 everything is fine, node 1 eventually panics and
  

reboots


while node 0 chugs along fine.  The problem comes when unplugging node
  

0.
  

When node 0 loses network connectivity it does not panic and eventually
node 1 panics and reboots.  Is there a reason why

Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

2006-11-15 Thread Sunil Mushran


Again, read his email.

Alexei_Roudnev wrote:

Behavior is not difference - if you broke node1-node0 connection, node1 will
self-reboot in the current design.
It dont matter what exactly you unplug - socket on nod1, socket on node2 or
inter-switch connection if it is used.

Add node-3 and everything will change.

- Original Message - 
From: Sunil Mushran [EMAIL PROTECTED]

To: Alexei_Roudnev [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; ocfs2-users@oss.oracle.com
Sent: Wednesday, November 15, 2006 11:03 AM
Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem


  

You are missing his point. He is not saying that fencing is the problem.
He is asking as to why the behavior differs between unplugging node 0
and node 1.

Alexei_Roudnev wrote:


It is not a bug; it is all by design.

Problem is that OCFSv2:
- can't support more than 1 interconnection link, so you always risk to
  

lost
  

intercionnection;
In additional, to make things worst, it dont support serial
  

interconenction;
  

- can't find a quorum in 2 node configuration (it's not ocfsv2 problem
  

but
  

general concern with any 2 nodes cluster) -
 so all nodes lost quorum if network connection is lost;
- don't analyze FS activity and reboot all nodes without quorum, except
node0, in case of losing network connection.

It can't be improved without supporting multiple interconnections +
  

better
  

decisions about fencing (there is not any use to fence a node, if it
  

have
  

not outstanding IO on cluster file system).

Well known problem with OCFSv2. One solution is to add 3-d node and use
interface bonding (be sure that interface convergeency time is less that
o2cb timeout).


- Original Message - 
From: [EMAIL PROTECTED]

To: Sunil Mushran [EMAIL PROTECTED]
Cc: ocfs2-users@oss.oracle.com
Sent: Tuesday, November 14, 2006 10:35 PM
Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem



  

I decided to rebuild this from scratch today and got the same result.

two cluster node, both boxes remain connected to the shared storage
throughout tests.

I unplug network connection from node0 and get e1000 driver Tx Unit


Hang
  

messages on node0 console
node1 console displays o2net_idle_timer:1309 here are some times to


help
  

debug the situation followed by additional output
node1 sits for a while and eventually displays o2quo_make_decision:143
error: fencing this node because it is connected to a half-quorum of


one
  

of

  

two nodes which doesn't include the lowest active node 0
node 0 replays node 1's journal, too bad it still isn't on the network

this is in node 1 /var/log/messages after reboot

Nov 14 23:55:56 FTP02 kernel: o2net: connection to node


FTP01.mydomain.net
  

(num 0) at 10.xxx.0.45: has been idle for 10 seconds, shutting it



down.

  

Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1163570146.656474 now
1163570156.65
5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
(3a33f0f8:505) 1163570057.403947:1163570057.403950)
Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
FTP01.mydomain.net (num 0) at 10.xxx.0.45:

I'm confused by this.  Shouldn't node 0 have eventually rebooted since


it
  

lost network connectivity and node 1 replayed node 0's journal and kept
going?  As it is right now we are left with no IP reachable box.

If I do this same test but unplug node 1 instead of node 0, it works as


it
  

should. node 1 will fence and node 0 will reply the journal and stay
online.

Any input is greatly appreciated.

Thanks,

Colin Farley
Network Administrator
E-Care Contact Center Services
Phone:(204) 940-6244
Fax:(204) 940-7394



 Sunil Mushran
 [EMAIL PROTECTED]
 acle.com


To
  

   [EMAIL PROTECTED]
 11/13/2006 08:23


cc
  

 PMocfs2-users@oss.oracle.com



Subject
  

   Re: [Ocfs2-users] ESX and
   Unbreakable 2.0 OCFS2 problem





  



Considering o2net only cares whether it is connected to the other node
or not, it should not make a difference whether one unplugs node 0 or
node 1.
The result should be the same. Node 1 should fence in both cases.

Do you see messages indicating that the node(s) have lost connectivity?
If so, could you share them.

It would be easiest if you could file a bug on oss.oracle.com/bugzilla



with

  

the messages file and listing the course of events... as in, unplugged
cable
on node 0 at time x, etc.

[EMAIL PROTECTED] wrote:



I'm testing a 2 node cluster in a VMWare ESX environment for use as a

  

high



availability FTP server to support a CRM application.  Both

Re: [Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

2006-11-22 Thread Sunil Mushran


As ocfs2 heartbeats on the same device, unplugging a different device on the
storage should not affect ocfs2 as long as the ios are completing. But 
the logs

indicate otherwise. HB ios are erroring out.

The o2net message is the tcp connect message. We will be providing a way
to configure that too.

Peter Santos wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Suni,

after trying to chase this down, I think one of our sa's might have restarted 
the storage without
notifying anyone.

Similarly, today a disk that was not in use was re-initialized and caused 
everything to come down. I don't
know if this is an issue with ocfs2 or ( old_storage + our sa doing this 
incorrectly).

The idea was to re-initialize a disk that was not being used (sdc) and not have 
it affect
the ocfs2 storage (sdb).

After the re-initialization completed, I noticed that all 3 nodes weren't 
working and this was
what I found on dbo3

===
Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at 
192.168.134.141: has been idle for 10
seconds, shutting it down.

Nov 21 11:40:36 dbo3 kernel: (0,1):o2net_idle_timer:1310 here are some times 
that might help debug the situation: (tmr
1164127226.293816 now 1164127236.291931 dr 1164127226.293797 adv 
1164127226.293818:1164127226.293819 func (a77953f3:2)
1164124426.747626:1164124426.747628)


Nov 21 11:40:36 dbo3 kernel: o2net: no longer connected to node dbo2 (num 1) at 
192.168.134.141:

Nov 21 11:41:11 dbo3 kernel: SCSI error : 1 0 0 0 return code = 0x1
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502543
Nov 21 11:41:11 dbo3 kernel: SCSI error : 1 0 0 0 return code = 0x1
...
Nov 21 11:41:11 dbo3 kernel: SCSI error : 1 0 0 0 return code = 0x1
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502568
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status 
= -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status 
= -5
Nov 21 11:41:11 dbo3 kernel: SCSI error : 1 0 0 0 return code = 0x1
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 1983
Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
Nov 21 11:41:11 dbo3 kernel: SCSI error : 1 0 0 0 return code = 0x1
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 3921780
Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status 
= -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status 
= -5
...
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status 
= -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status 
= -5
Nov 21 11:41:11 dbo3 su: pam_unix2: session finished for user oracle, service su
Nov 21 11:41:11 dbo3 logger: Oracle CSSD failure 134.
Nov 21 11:45:07 dbo3 syslogd 1.4.1: restart.

I'm curious about the message
o2net: connection to node dbo2 (num 1) at 192.168.134.141: has been idle for 10 
seconds, shutting it down.

I have increased my O2CB_HEARTBEAT_THRESHOLD to 61, but where is this message getting 
10 seconds from?
Also this message is displayed because dbo2 was not able to check into the 
hearbeat filesystem right ?

- -peter





Sunil Mushran wrote:
  

On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
did not fully panic. As in, the network was shutdown but the hb thread
was still going strong for some reason.

Within 10 secs of that, by 17:12:59, db02 detected loss of network
connectivity with both nodes db01 and db03. However, it was still
seeing the nodes hb on disk and assumed that they were alive. As per
quorum rules, it paniced.

So the qs is: what was happening on nodes db01 and db03 after 17:12:49?

Peter Santos wrote:
Folks,

I'm trying to piece together what happened during a recent event where

our 3 node RAC cluster had problems.
It appears that all 3 nodes restarted .. which is likely to occur if
all 3 nodes cannot communicate with the
shared ocfs2 storage.

I did find out from our SA, that this happened during the time he was
replacing a failed drive on the storage
and the storage was in a degraded mode.  I'm trying to understand if
the 3 nodes had a difficult time accessing
the shared ocfs2 volume or was it a tcp connectivity issue. There is
nobody currently using the cluster ..so
it should have been idle from a user perspective.


prompt# cat /etc/fstab | grep ocfs2

/dev/sdb1  /ocfs2   ocfs2  _netdev,datavolume,nointr  0 0
/dev/sdb2  /backups ocfs2  _netdev,datavolume,nointr  0 0

we have 2 ocfs2 volumes.. once if for the voting and ocr files, while
the other is to be used as a
shared storage for backups of archivelog files etc.


/var/log/messages


NODE1 (dbo1

Re: [Ocfs2-users] Oracle 9i RAC on OCFS2

2006-11-27 Thread Sunil Mushran


Refer to CDSL (Conext Dependent Symbolic Links) in the OCFS2 user's guide.

Marcel Savelkoul wrote:

Hi,

I'm setting up a 2-node Oracle 9i RAC on OCFS2.
But I have some problems with understanding how the shared Oracle_Home 
is being used.


For instance there is the *$ORACLE_HOME/oracm/admin/cmcfg.ora* file. 
The $ORACLE_HOME is on
the SAN so the 2 nodes have access to this file. How will the settings 
then be done per node?


Is there a guide/tutorial/howto for installing Oracle 9i RAC on OCFS2 
with shared Oracle_Home's?


Or is it better to just install it without shared Oracle_Home's?

Best Regards,

Marcel



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and berkeley database files

2006-12-05 Thread Sunil Mushran


You are on a very old release of OCFS2. The OCFS2 homepage and FAQ both
list a SLES9 kernel version newer than the one you are using.

But that may not be the reason for the error. My bet is that bdb is 
attempting to create

a shared writeable mmap that ocfs2 1.2 does not support.

[EMAIL PROTECTED] wrote:

Hello Forum,

when trying to install a mailserver on a shared SAN Device with ocfs2, I
realized some strange problems with Berkeley .db files:

Trying to install OpenLDAP:
Dec  1 16:16:56 server_1 slapd[18497]: bdb_db_init: Initializing BDB database
Dec  1 16:16:56 server_1 slapd[18498]: bdb(ou=users,dc=mycomp,dc=de): mmap:
Invalid argument
Dec  1 16:16:56 server_1 slapd[18498]: bdb_db_open: dbenv_open failed:
Invalid
argument (22)
Dec  1 16:16:56 server_1 slapd[18498]: backend_startup: bi_db_open failed!
(22)
Dec  1 16:16:56 server_1 slapd[18498]: bdb(ou=users,dc=mycomp,dc=de):
DB_ENV-lock_id_free interface requires an environment configured for th
e locking subsystem
Dec  1 16:16:56 server_1 slapd[18498]: bdb(ou=users,dc=mycomp,dc=de):
txn_checkpoint interface requires an environment configured for the tran
saction subsystem
Dec  1 16:16:56 server_1 slapd[18498]: bdb_db_destroy: txn_checkpoint failed:
Invalid argument (22)
Dec  1 16:17:50 server_1 slapd[18525]: bdb_db_init: Initializing BDB database
Dec  1 16:17:50 server_1 slapd[18526]: bdb(ou=users,dc=mycomp,dc=de): mmap:
Invalid argument
Dec  1 16:17:50 server_1 slapd[18526]: bdb_db_open: dbenv_open failed:
Invalid
argument (22)
Dec  1 16:17:50 server_1 slapd[18526]: backend_startup: bi_db_open failed!
(22)
Dec  1 16:17:50 server_1 slapd[18526]: bdb(ou=users,dc=mycomp,dc=de):
DB_ENV-lock_id_free interface requires an environment configured for th
e locking subsystem
Dec  1 16:17:50 server_1 slapd[18526]: bdb(ou=users,dc=mycomp,dc=de):
txn_checkpoint interface requires an environment configured for the tran
saction subsystem



.

When installing on a standard ext3 - no problems.

When installing cyrus (initial start after creation of cyrus admin account):

Dec  5 04:15:01 server_1 ctl_mboxlist[29435]: IOERROR:
mapping /var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 ctl_cyrusdb[32679]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 idled[32680]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 idled[32680]: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3[32684]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3[32684]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3s[32685]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3s[32685]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3s[32691]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3s[32691]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3[32690]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3[32690]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3[32693]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3[32693]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3s[32694]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3s[32694]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3s[32695]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3s[32695]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3[32697]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3[32697]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3s[32698]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3s[32698]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 pop3s[32701]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 pop3s[32701]: Fatal error: failed to
mmap /communication/var/lib/imap/mailboxes.db file
Dec  5 14:14:43 server_1 sieve[32686]: IOERROR:
mapping /communication/var/lib/imap/mailboxes.db file: Invalid argument
Dec  5 14:14:43 server_1 sieve[32692]:

Re: [Ocfs2-users] Oracle Application Server 10.1.2.0.2 Install on OCFS2

2006-12-06 Thread Sunil Mushran


strace apache. That may provide us with some clues.

[EMAIL PROTECTED] wrote:

Hello all,
 
Has anyone installed Oracle Application Server 10.1.2.0.2 
Infrastructure tier including the preseeded 10.1.0.4 database (High 
Availability option otherwise known as a cold failover cluster) on 
OCFS2 where the ocfs2 device is only mounted on one node at a time?  I 
am trying to emulate Red Hat Cluster Suite on OCFS2 in an 
active/passive mode.
 
The software installs ok and the database runs.  However, it appears 
that Apache fails to start and no logs or error output is generated.  
A local disk installation is successful.
 
We are running Oracle's Enterprise Linux x86 with 1.2.3 ocfs2 module 
(included with release).
 
For Sunil and team, the Oracle Support SR is 6019530.994 if you care 
for further background.
 
Any help is appreciated.
 
Best regards,

Tony Orlando
MFG Systems
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] (work)
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
(home)



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and berkeley database files

2006-12-06 Thread Sunil Mushran


ocfs2 supports private mmap r/w and shared mmap readonly.
Shared mmap writeable is the only piece missing. We should have that by 1.4.

Alexei_Roudnev wrote:

There was a clear answer, WHY it did not worked on OCFSv2:

- BerkleyDB and LDAP uses mmap to the files;
- OCFSv2 don't implement it (because it is not possible to do such mapping
in the cluster FS)
- So they dont work on OCFSv2

Am I correct?

- Original Message - 
From: [EMAIL PROTECTED]

To: Michael Wood [EMAIL PROTECTED]
Cc: ocfs2-users@oss.oracle.com
Sent: Wednesday, December 06, 2006 2:33 AM
Subject: Re: [Ocfs2-users] OCFS2 and berkeley database files


Am Mi, 6.12.2006, 09:35, schrieb Michael Wood:

  

Berkeley DB does have problems with certain filesystems (e.g.
NFS) so maybe this is a similar issue.  (Just a wild guess.)



Yes I found the Orcale FAQ. So it is s structural problem with a simple
DB and then trying to open it more than once


  

OpenLDAP supports various different backends.  Maybe one of the
others will work better?



Perhaps, but my guess is, that the funktionality of an cluster filesystem
always uses an access to every file, even if it is not connected to a
service. So I think it will not work if theses files are not able to
handle more than one access at one time.

  

Cyrus also supports various different database types for its
databases.  Maybe try one of the others.  I have never used either


OpenLDAP
  

or Cyrus on ocfs2, so I can't guarantee anything :)



Same as above

  


Server is a DELL Poweredge with EMC CLARiiON SAN
System SuSE SLES9 SP3+
-  2.6.5-7.244-smp Kernel with (... modinfo ocfs2):
license:GPL
author: Oracle
version:1.1.7-SLES 5AF01E6455FC04917FE52FB
description:OCFS2 1.1.7-SLES Tue Nov  1 14:45:27 PST 2005 (build
sles) depends:ocfs2_nodemanager,ocfs2_dlm,jbd
supported:  yes
vermagic:   2.6.5-7.244-smp SMP gcc-3.3

  

[snip]


2.6.5-7.244 was released with SLES9 SP3.  You might want to
patch your box to 2.6.5-7.282 (which I think is the latest.)

The 7.282 version of the SLES9 SP3 kernel comes with ocfs2
version 1.2.1-SLES.



Yes, but this EMC multipath software installs some kernelmodules
(binaries, no sources) and they only match with the kernels in the list.
So I have to wait for the next release :(((

  

On the ocfs2 homepage it says:


SuSE Linux Exterprise Server 9: OCFS2 1.2.1 is bundled
with the SLES9 SP3 (2.6.5-7.257 and later) release. SLES9 users must
upgrade to the latest SP3 kernel (2.6.5-7.257 or later) and install
ocfs2-tools and ocfs2console packages. For more on OCFS2 bundling with
SLES9, refer to the Novell SLES9 section in the FAQ




See above

  

--
Michael Wood [EMAIL PROTECTED]







___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] re: is it possible for the o2cb stack to monitor multiple clusternames on the same box

2006-12-20 Thread Sunil Mushran


To expand on it, cluster is just a grouping for nodes. They nodes
only actively work together when two or more mount the same
volume and thus joining a common domain.

Say you add two test nodes in that cluster but ensure that they
do not mount the volumes being used by the rac cluster, they
will never be part of that domain.

Sunil Mushran wrote:

Currently it supports only one cluster.

Peter Santos wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Folks,
When I installed ocfs2 the first time and setup oracle to work 
with it, the clustername defaulted to
ocfs2.   We are testing adding new nodes, but we would like to 
add new nodes to the o2cb cluster in a

different clustername.


Do I need to do anything to keep that separate on the filesystem. 
I just want to make sure that when
a user is testing adding/deleting nodes from a cluster, it's not 
the ocfs2 because that's the one

I'm using for RAC.

- -peter
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFiZ2/oyy5QBCjoT0RAl6UAJ95jPfwFkJEkUTH2f1/+mGqZu1XhQCeNgOs
5zKPGX32Q8B4e0UbruFPk0Y=
=pUMY
-END PGP SIGNATURE-

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Problem installing OCFS 1.2.3

2007-01-04 Thread Sunil Mushran


depmod -a ?

Lin Shen (lshen) wrote:

Switched the kernel to 2.6.9-42.Elsmp, still got the same error.



[EMAIL PROTECTED] Desktop]# uname -a
Linux cfs2 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686 i686
i386 GNU/Linux

  

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 04, 2007 12:08 PM

To: Lin Shen (lshen)
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problem installing OCFS 1.2.3

Refer to the FAQ. The module's kernel version should match 
the kernel version.


Lin Shen (lshen) wrote:


Hi,

The kernel I'm using is: 


[EMAIL PROTECTED] Desktop]# uname -a
Linux cfs2 2.6.9-42.7.ELsmp #1 SMP Tue Sep 5 18:29:39 EDT 2006 i686 
i686

i386 GNU/Linux

So I installed ocfs2-2.6.9-42.ELsmp-1.2.3-1.i686.rpm.

While trying to load the modules, I got the following error:

[EMAIL PROTECTED] Desktop]# /etc/init.d/o2cb load Loading module 
  
configfs: 


Unable to load module configfs
Failed

Any ideas?

Thanks
lin

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Problem installing OCFS 1.2.3

2007-01-04 Thread Sunil Mushran


Lin Shen (lshen) wrote:

That worked, thanks.

I'm a newbie to OCFS2, so bear with me if some of my questions sound too
trivial. I couldn't find the answers either in FAQ or User's Guide.

1. Does OCFS2 need gnbd (or similar thing) to work with off-the-shelf PC
boxes like GFS does? If I have two nodes A and B, each has a partition.
How do I include both partitions into a single instance of OCFS2 file
system? 

  

you need a shared disk. you can use iscsi.
2. Does OCFS2 support cross-node RAID? 
  

no

3. How easy is it to port OCFS2 onto a different transport protocol such
as TIPC?
o2net code is pretty well contained and isolated. while we have 
discussed tipc,

not sure if we ever gave it a serious look.

lin  

  

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 04, 2007 1:21 PM

To: Lin Shen (lshen)
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problem installing OCFS 1.2.3

depmod -a ?

Lin Shen (lshen) wrote:


Switched the kernel to 2.6.9-42.Elsmp, still got the same error.



[EMAIL PROTECTED] Desktop]# uname -a
Linux cfs2 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686 
i686

i386 GNU/Linux

  
  

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 04, 2007 12:08 PM
To: Lin Shen (lshen)
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problem installing OCFS 1.2.3

Refer to the FAQ. The module's kernel version should match 

the kernel 


version.

Lin Shen (lshen) wrote:



Hi,

The kernel I'm using is: 


[EMAIL PROTECTED] Desktop]# uname -a
Linux cfs2 2.6.9-42.7.ELsmp #1 SMP Tue Sep 5 18:29:39 EDT 
  

2006 i686


i686
i386 GNU/Linux

So I installed ocfs2-2.6.9-42.ELsmp-1.2.3-1.i686.rpm.

While trying to load the modules, I got the following error:

[EMAIL PROTECTED] Desktop]# /etc/init.d/o2cb load Loading module
  
  
configfs: 



Unable to load module configfs
Failed

Any ideas?

Thanks
lin

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  
  
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Problem installing OCFS 1.2.3

2007-01-04 Thread Sunil Mushran


theoretically yes... but for practical usage go with atleast iscsi

Lin Shen (lshen) wrote:

So w/o shared disk, is it possible to make OCFS2 to work by utilizing
GNBD or etc?

lin 

  

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 04, 2007 2:48 PM

To: Lin Shen (lshen)
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problem installing OCFS 1.2.3

Lin Shen (lshen) wrote:


That worked, thanks.

I'm a newbie to OCFS2, so bear with me if some of my 
  
questions sound 

too trivial. I couldn't find the answers either in FAQ or 
  

User's Guide.

1. Does OCFS2 need gnbd (or similar thing) to work with 
  
off-the-shelf 

PC boxes like GFS does? If I have two nodes A and B, each 
  

has a partition.

How do I include both partitions into a single instance of 
  
OCFS2 file 


system?

  
  

you need a shared disk. you can use iscsi.

2. Does OCFS2 support cross-node RAID? 
  
  

no

3. How easy is it to port OCFS2 onto a different transport protocol 
such as TIPC?
  
o2net code is pretty well contained and isolated. while we 
have discussed tipc, not sure if we ever gave it a serious look.



lin  

  
  

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 04, 2007 1:21 PM
To: Lin Shen (lshen)
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problem installing OCFS 1.2.3

depmod -a ?

Lin Shen (lshen) wrote:



Switched the kernel to 2.6.9-42.Elsmp, still got the same error.



[EMAIL PROTECTED] Desktop]# uname -a
Linux cfs2 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686
i686
i386 GNU/Linux

  
  
  

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 04, 2007 12:08 PM
To: Lin Shen (lshen)
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problem installing OCFS 1.2.3

Refer to the FAQ. The module's kernel version should match



the kernel



version.

Lin Shen (lshen) wrote:




Hi,

The kernel I'm using is: 


[EMAIL PROTECTED] Desktop]# uname -a
Linux cfs2 2.6.9-42.7.ELsmp #1 SMP Tue Sep 5 18:29:39 EDT
  
  

2006 i686



i686
i386 GNU/Linux

So I installed ocfs2-2.6.9-42.ELsmp-1.2.3-1.i686.rpm.

While trying to load the modules, I got the following error:

[EMAIL PROTECTED] Desktop]# /etc/init.d/o2cb load Loading module
  
  
  
configfs: 




Unable to load module configfs
Failed

Any ideas?

Thanks
lin

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  
  
  
  

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] update on o2net_idle_timer

2007-01-04 Thread Sunil Mushran


That and also we've seen similar issues with Broadcom TG3 drivers. We use
Intel E1000 mostly and thus did not experience the same issue.

As far as the configurable net timeouts goes, the patch was added into
mainline on Dec 4th. So it will be available with ocfs2 1.4. We are still
seeing if we have the bandwidth to backport it to 1.2.

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=fs/ocfs2/cluster/tcp.c;h=ae4ff4a6636b23759522994898a95c148a4401f1;hb=HEAD

commit 828ae6afbef03bfe107a4a8cc38798419d6a2765
Author: Andrew Beekhof [EMAIL PROTECTED]
Date:   Mon Dec 4 14:04:55 2006 +0100

   [patch 3/3] OCFS2 Configurable timeouts - Protocol changes

   Modify the OCFS2 handshake to ensure essential timeouts are configured
   identically on all nodes.

   Only allow changes when there are no connected peers

   Improves the logic in o2net_advance_rx() which broke now that
   sizeof(struct o2net_handshake) is greater than sizeof(struct o2net_msg)

   Included is the field for userspace-heartbeat timeout to avoid the 
need for

   further protocol changes.

   Uses a global spinlock to ensure the decisions to update configfs 
entries

   are made on the correct value.  The region covered by the spinlock when
   incrementing the counter is much larger as this is the more critical 
case.


   Small cleanup contributed by Adrian Bunk [EMAIL PROTECTED]

   Signed-off-by: Andrew Beekhof [EMAIL PROTECTED]
   Signed-off-by: Mark Fasheh [EMAIL PROTECTED]

commit b5dd80304da482d77b2320e1a01a189e656b9770
Author: Jeff Mahoney [EMAIL PROTECTED]
Date:   Mon Dec 4 14:04:54 2006 +0100

   [patch 2/3] OCFS2 Configurable timeouts

   Allow configuration of OCFS2 timeouts from userspace via configfs

   Signed-off-by: Andrew Beekhof [EMAIL PROTECTED]
   Signed-off-by: Mark Fasheh [EMAIL PROTECTED]

Andy Phillips wrote:

Hello,

   I've made some progress with the o2net_idle_timer issue. Various
people seem to occasionally report instability and faults where the
following message is generated;

(From Andrew Brunton)
Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection to
node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110: has been idle
for 10 seconds, shutting it down.

(From Peter Santos)
Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at
192.168.134.141: has been idle for 10 seconds, shutting it down.

And from me;
Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10: has been idle for 10 seconds, shutting it down.

I've tried unsuccessfully to replicate the issue on my testbed
environment. The problem stems from the o2net layer function
'o2net_idle_timer' firing, after not receiving a valid packet after 
O2NET_IDLE_TIMEOUT_SECS, which is defined to be 10 seconds in

ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h. This then causes the rest
of the code to fall over in a heap, once the underlying socket goes.

 It turns out that its very likely not a bug in ocfs2. 


This code is doing what its supposed to do. Others will (and have)
argued that the network timeout is too low - see any and all posts by
Alexei to this list. Leaving that aside, or indeed the idea that the 
network layer should make an attempt at reconnecting before killing the

entire machine, I'll focus on the causes we've found here of this
problem which are not spanning tree related. 


One common thread is that people finding this are on EM64T or Opteron
based systems. There are various bugs reported against RedHat Linux (and
probably SuSE as well) for the kernels before RHAS 4.4. 


e.g. page 16 of this document - lost ticks Message Under Stress With
Non Uniform Memory Access Enabled on AMD Processor-Based Systems
http://support.dell.com/support/edocs/software/osrhel4/en/INT/HJ834A00.pdf

Or oracle bug 4593892 referenced in;
http://www.oracle.com/technology/tech/linux/validated-configurations/html/vc_dell6850-rhel4-cx500-1_1.html

We were also seeing messages of the form;

Dec 18 10:35:44 gs2dwdb02 kernel: warning: many lost ticks.
Dec 18 10:35:44 gs2dwdb02 kernel: Your time source seems to be instable
or some driver is hogging interupts
(sic)

Our problem seems to have been at least partially down to dodgy AMI
megaraid firmware for the system disks. We were getting messages from
the megaraid driver module on the console, which correlated with dropped
packets as logged by Oracle RAC's cssd.log. 


So given the above numa and driver/hardware errors its likely that ocfs2
was going for periods as long as 10 seconds without receiving a packet,
and failing accordingly.

Ocfs2 was hit the worst, as it has the finest trigger on lost packets.
The heartbeat failure times for rac are over 60 seconds. The o2cb
heartbeat is set to 61 for us, which is about 120 seconds IIRC, which is
fine for interruptions to the SAN/multipathing failover failures. 


We're planning an upgrade to 4.4 which apparently has fixed several of
these bugs, and would recommend others with

Re: [Ocfs2-users] Kernel panic - not syncing: ocfs2 is very sorry

2007-01-05 Thread Sunil Mushran


Lot of ink has been spilled on this subject. ;)

Check out the heartbeat section in the FAQ. One easy solution is to
increase the hb timeout to 60 secs...
O2CB_HEARTBEAT_THRESHOLD = 31

We will leaning towards making that number the default in the 1.4 release.

George Liu wrote:

Both systems crash with the following message on their consoles,

Index 19: took 0 ms to do submit_bio for read
Index 20: took  ms to do waiting for read completion
(6,0):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active
regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing

kernel: Linux  2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006
i686 i686 i386 GNU/Linux
other : ocfs2-2.6.9-42.0.3.ELsmp-1.2.3-1.i686.rpm 
ocfs2-tools-1.2.2-1.i386.rpm   ocfs2console-1.2.2-1.i386.rpm




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] mount error

2007-01-09 Thread Sunil Mushran


You are using two different versions of ocfs2 on the two nodes.
Different enough that they are not network compatible.
It is working as designed.

Consulente3 wrote:

Hi,
I'm new to ocfs2, and in my test's environment, i have:

2 node, becks and vaix

becks can mount ocfs2 fs, but vaix can't.

When vaix try to mount fs, it raise this error:

vaix:/# mount -t ocfs2 /dev/etherd/e2.0 /ocfs2/

mount.ocfs2: Transport endpoint is not connected while mounting
/dev/etherd/e2.0 on /ocfs2/

in becks syslog:

Jan 10 00:57:38 localhost kernel: (23628,0):o2net_check_handshake:1107
node vaix (num 1) at 10.1.7.151: advertised net protocol version 4
but 1 is required, disconnecting
Jan 10 00:57:44 localhost kernel: (23628,0):o2net_connect_expired:1444
ERROR: no connection established with node 1 after 10 seconds, giving up
and returning errors.

becks and vaix are two different linux versions (debian and redhat),
whith different kernel and ocfs2tools

thanks

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 crash

2007-01-16 Thread Sunil Mushran


Looks to be running out of lowmem.

# date
# cat /proc/meminfo
# cat /proc/slabinfo

Run a script that dumps the above every 1 to 5 mins. That should
help explain the cause.

Brian Sieler wrote:

Using 2-node clustered file system on DELL/EMC SAN/RHEL
2.6.9-34.0.2.ELsmp x86_64.

Config:

O2CB_HEARTBEAT_THRESHOLD=30

Kernel param: elavator=deadline (per FAQ)

These log items appear and the server crashes. Has happened twice now
at three week intervals, each time during a heavy IO operation:

Jan 15 16:08:29 db100 kernel: (3898,6):o2hb_setup_one_bio:371 ERROR:
Could not alloc slots BIO!
Jan 15 16:08:29 db100 kernel: (3898,6):o2hb_read_slots:507 ERROR: 
status = -12

Jan 15 16:08:29 db100 kernel: (3898,6):o2hb_do_disk_heartbeat:973
ERROR: status = -12
Jan 15 16:08:29 db100 kernel: (3898,6):o2hb_setup_one_bio:371 ERROR:
Could not alloc slots BIO!
Jan 15 16:08:29 db100 kernel: (3898,6):o2hb_read_slots:507 ERROR: 
status = -12
Jan 15 16:08:29 db100 kernel: (3898,6):o2hb_do_disk_heartbeat:973 
ERROR: status


Can't find much on any of these errors…what is 507 ERROR status = -12?

Any help appreciated



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] ocfs2-1.2.4 RC2 released

2007-01-17 Thread Sunil Mushran


All,

http://oss.oracle.com/~smushran/.ocfs2-1.2.4-0.2/

The final 1.2.4 should look very close to this drop. We still have one
slippery issue open that we are working on. But, other than that, this
drop is looking good.

The list of patches added post 1.2.4-0.1 is as follows:

r2948: fs - Allow direct I/O read past end of file
r2950: fs - Don't print errors when following symlinks
r2951: dlm - Fixes race between migrate and dirty
r2952: dlm - dlmunlock waits for migration to complete before unlocking
r2953: dlm - migrate lockres handler looks for its lock on all queues
r2954: fs - Directory c/mtime update fixes
r2955: fs - Cleanup ocfs2_iget() errors
r2956: dlm - Flush dlm workqueue before starting to migrate
r2957: dlm - Drop inflight refmap even if no locks found on the lockres
r2958: dlm - dlm dispatch was stopping too early
r2959: dlm - wake up sleepers on the lockres waitqueue
r2960: dlm - Silence a failed lock convert
r2961: dlm - Dump dlm work queue entries for debugging
r2962: dlm - Cookies in locks not being printed correctly in error messages
r2963: o2net - Added post handler callable function in o2net message handler
r2964: dlm - Calling post handler function in assert master handler


Thanks
OCFS2 Team

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 keeps fencing all my nodes

2007-01-18 Thread Sunil Mushran

1. In SLES10, the /config has been moved to /sys/kernel/config. That's 
how it

is on mainline.

2. To monitor heartbeat do:
# watch -d -n2 debugfs.ocfs2 -R hb /dev/sdX
This comand will work if you have ocfs2-tools 1.2.2. (Not sure whether 
sles10 ships

with 1.2.2 or 1.2.1.) If 1.2.1, do:
# watch -d -n2 echo \hb\ | debugfs.ocfs2 -n /dev/sdX | grep -v 
\  \


3. Configure netconsole to catch any oops stack trace.

4. From the looks of it the issue is related to the disk hb timeout.
Check the FAQ on increasing it to 60 secs from a default of 14 secs.

John Lange wrote:

I have a 4 node SLES 10 cluster with all nodes attached to a SAN via
fiber.

The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf.

I can mount the volume on any single node but as soon as I mount it on
the second node, it fences one of the nodes. There is never more than
one node active at a time.

When I check the status of the nodes (quickly before they get fenced)
the satus shows they are heartbeating.

# /etc/init.d/o2cb status
Module configfs: Loaded
Filesystem configfs: Mounted
Module ocfs2_nodemanager: Loaded
Module ocfs2_dlm: Loaded
Module ocfs2_dlmfs: Loaded
Filesystem ocfs2_dlmfs: Mounted
Checking O2CB cluster ocfs2: Online
Checking O2CB heartbeat: Active

 


Here are the logs from 2 machines (NOTE that this is the logs from 2
machines at the same time as they were captured via remote syslog on a
3rd machine machine) of what happens when the node vs2 is already
running, and node vs3 joins the cluster (mounts the ocfs2 file system).
In this instance vs3 gets fenced.

Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3 (num 2) at 
10.1.1.13:
Jan 18 14:52:41 vs3 kernel: o2net: connected to node vs2 (num 1) at 
10.1.1.12:
Jan 18 14:52:45 vs3 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 
(build sles)
Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Node 2 joins domain 
89FC5CB6C98B43B998AB8492874EA6CA
Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain (89FC5CB6C98B43B998AB8492874EA6CA): 1 2 
Jan 18 14:52:45 vs3 kernel: ocfs2_dlm: Nodes in domain (89FC5CB6C98B43B998AB8492874EA6CA): 1 2 
Jan 18 14:52:45 vs3 kernel: kjournald starting.  Commit interval 5 seconds

Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2, slot 0)
Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256 too short
Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num 2) at 
10.1.1.13: has been idle for 10 seconds, shutting it down.
Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are some times 
that might help debug the situation: (tmr 1169153561.99906 now 1169153571.93951 
dr 1169153566.98
030 adv 1169153566.98039:1169153566.98040 func (09ab0f3c:504) 
1169153565.211482:1169153565.211485)
Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num 1) at 
10.1.1.12:
Jan 18 14:52:51 vs2 kernel: o2net: no longer connected to node vs3 (num 2) at 
10.1.1.13:

==

I previously had configured ocfs2 for userspace heartbeating but
couldn't get that running so I reconfigured for disk based. Could that
now be the cause of this problem?

Where do the nodes write the heartbeats? I see nothing on the ocfs2
system.

Also, I have no /config directory that is mentioned in the docs. Is that
normal?

Here is /etc/ocfs2/cluster.conf

node:
ip_port = 
ip_address = 10.1.1.11
number = 0
name = vs1
cluster = ocfs2

node:
ip_port = 
ip_address = 10.1.1.12
number = 1
name = vs2
cluster = ocfs2

node:
ip_port = 
ip_address = 10.1.1.13
number = 2
name = vs3
cluster = ocfs2

node:
ip_port = 
ip_address = 10.1.1.14
number = 3
name = vs4
cluster = ocfs2

cluster:
node_count = 4
name = ocfs2


Regards,

Any tips on how I can go about diagnosing this problem?

Thanks,
John Lange



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2_cdsl_follow_link errors

2007-01-22 Thread Sunil Mushran


#define EACCES 13 /* Permission denied */
The messages are harmless. Patch to silence them has already been checked
into the 1.2 repo and mainline git.

Matthew Flusche wrote:


I’m seeing the following errors in my two node cluster. Is this 
anything to be concerned with?


Host information:

RedHat AS 4U4 2.6.9-42.0.2.ELsmp (x86_64)

ocfs2-2.6.9-42.0.2.ELsmp-1.2.3-1

ocfs2console-1.2.2-1

ocfs2-tools-debuginfo-1.2.2-1

ocfs2-tools-1.2.2-1

Jan 22 02:51:44 host1 kernel: (3666,1):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 02:51:44 host1 kernel: (3666,1):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 02:51:44 host1 kernel: (3666,1):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 02:51:44 host1 kernel: (3666,1):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 02:51:44 host1 kernel: (3666,1):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 02:51:44 host1 kernel: (3666,1):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 03:22:31 host1 kernel: (3666,3):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 03:22:31 host1 kernel: (3666,3):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 03:22:31 host1 kernel: (3666,3):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 03:22:31 host1 kernel: (3666,3):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 03:22:31 host1 kernel: (3666,3):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 03:22:31 host1 kernel: (3666,3):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 05:06:04 host1 kernel: (18268,0):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 05:06:04 host1 kernel: (18268,0):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 05:06:04 host1 kernel: (18268,0):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 05:06:04 host1 kernel: (18268,0):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 05:06:04 host1 kernel: (18268,0):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 05:06:04 host1 kernel: (18268,0):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 06:04:27 host1 kernel: (24727,1):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 06:04:27 host1 kernel: (24727,1):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 06:04:27 host1 kernel: (24727,1):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 06:04:27 host1 kernel: (24727,1):ocfs2_follow_link:410 ERROR: 
status = -13


Jan 22 06:04:27 host1 kernel: (24727,1):ocfs2_cdsl_follow_link:372 
ERROR: status = -13


Jan 22 06:04:27 host1 kernel: (24727,1):ocfs2_follow_link:410 ERROR: 
status = -13


Regards,

Matt



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] kernel panic - not syncing

2007-01-22 Thread Sunil Mushran


o2net timeout cannot cause the o2hb panic. The two are totally
different. From the outputs, I would guess o2hb is timing out but
I cannot say for sure till I don't see the full logs.

Andy Phillips wrote:
Its worth pointing out that the o2net idle timer is triggering on the 
network heartbeat, which is 10 seconds, in the current 1.2.x series.



O2CB_HEARTBEAT_THRESHOLD has no effect on this, because its another part
of the code which causes the problem.

see ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h
#define O2NET_IDLE_TIMEOUT_SECS 10

Andy


On Mon, 2007-01-22 at 09:29 -0800, Srinivas Eeda wrote:
  

problem appears to be that IO is taking more time than effective 
O2CB_HEARTBEAT_THRESHOLD. Your configured value 31 doesn't seem to be 
effective?

Index 6: took 1995 ms to do msleepIndex 
Index 17: took 1996 ms to do msleep

Index 22: took 10001 ms to do waiting for read completion.

Can you please cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold and verify. 


Thanks,
--Srini.




Consulente3 wrote:

Hi all, 


my test environment, is composed by 2 server with centos 4.4
nodes is exporting with aoe6-43 + vblade-14

kernel-2.6.9-42.0.3.EL
ocfs2-tools-1.2.2-1
ocfs2console-1.2.2-1
ocfs2-2.6.9-42.0.3.EL-1.2.3-1

/dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local)

DeviceFS Nodes
/dev/etherd/e2.0  ocfs2  ocfs2, becks
/dev/etherd/e3.0  ocfs2  ocfs2, becks

DeviceFS UUID  Label
/dev/etherd/e2.0  ocfs2  b24cc18d-af89-4980-a75e-a87530b1b878  test1
/dev/etherd/e3.0  ocfs2  101a92fd-b83b-4294-8bfc-fbaa069c3239  nfs4

O2CB_HEARTBEAT_THRESHOLD=31

when i try to make stress test:

Index 4: took 0 ms to do checking slots
Index 5: took 2 ms to do waiting for write completion
Index 6: took 1995 ms to do msleep
Index 7: took 0 ms to do allocating bios for read
Index 8: took 0 ms to do bio alloc read
Index 9: took 0 ms to do bio add page read
Index 10: took 0 ms to do submit_bio for read
Index 11: took 2 ms to do waiting for read completion
Index 12: took 0 ms to do bio alloc write
Index 13: took 0 ms to do bio add page write
Index 14: took 0 ms to do submit_bio for write
Index 15: took 0 ms to do checking slots
Index 16: took 1 ms to do waiting for write completion
Index 17: took 1996 ms to do msleep
Index 18: took 0 ms to do allocating bios for read
Index 19: took 0 ms to do bio allo read
Index 20: took 0 ms to do bio add page read
Index 21: took 0 ms to do submit_bio for read
Index 22: took 10001 ms to do waiting for read completion
(3,0):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active
regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing


6o2net: connection to node ocfs2 (num 2) at 10.1.7.107:777 has been
idle for 10 seconds, shutting it down
(3,0): o2net_idle_timer:1309 here are some times that might help debug
the situation:
(tmr: 1169487957.71650 now 1169487967.69569 dr 1169487962.3 adv
1169487957.71671:1159487957.71674
func 83bce37b2:505) 1169487901.984644:1169487901.984676)

the kernel panic occurs always on the same node, and the other node
still responding

thanks!
 


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  
  

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] kernel panic - not syncing

2007-01-22 Thread Sunil Mushran


I understand that. But that's not what the user experienced in this case.
One node ran into the o2hb timeout (and panic) that caused the o2net
message on the other node.

These are two separate issues. FWIW, I am trying to get the o2net config
backported to the 1.2 tree.

Andy Phillips wrote:
With respect sunil, 


 the observed problems I see normally go like this;

- o2net timeout - socket closes.
Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10: has been idle for 10 seconds, shutting it down.
Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1154545576.798263 now

- Upper layers realise they have no connection, and panic the box.
  
Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney

(num 0) at 172.16.6.10:
Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
fencing this node because it is connected to
a half-quorum of 1 out of 2 nodes which doesn't include the lowest
active node 0

Irrespective of that. The o2net message observed comes about due to the
value of O2NET_HEARTBEAT_TIMEOUT not the o2cb heartbeat.
The code that is probably giving you that error message is;

The function o2net_idle_timer, which is referenced in your error
message, is in ocfs2-1.2.3/fs/ocfs2/cluster/tcp.c

printk(KERN_INFO o2net: connection to  SC_NODEF_FMT  has been idle
for 10 
 seconds, shutting it down.\n, SC_NODEF_ARGS(sc));
mlog(ML_NOTICE, here are some times that might help debug the 
 situation: (tmr %ld.%ld now %ld.%ld dr %ld.%ld adv 
 %ld.%ld:%ld.%ld func (%08x:%u) %ld.%ld:%ld.%ld)\n,
 sc-sc_tv_timer.tv_sec, sc-sc_tv_timer.tv_usec,
 now.tv_sec, now.tv_usec,
 sc-sc_tv_data_ready.tv_sec, sc-sc_tv_data_ready.tv_usec,
 sc-sc_tv_advance_start.tv_sec,
sc-sc_tv_advance_start.tv_usec,
 sc-sc_tv_advance_stop.tv_sec,
sc-sc_tv_advance_stop.tv_usec,
 sc-sc_msg_key, sc-sc_msg_type,
 sc-sc_tv_func_start.tv_sec, sc-sc_tv_func_start.tv_usec,
 sc-sc_tv_func_stop.tv_sec, sc-sc_tv_func_stop.tv_usec);

The original post only posted that error message, but the other error
messages usually follow. If I'm wrong, please email me directly and help
sort out my understanding. 


Andy

On Mon, 2007-01-22 at 10:38 -0800, Sunil Mushran wrote:
  

o2net timeout cannot cause the o2hb panic. The two are totally
different. From the outputs, I would guess o2hb is timing out but
I cannot say for sure till I don't see the full logs.

Andy Phillips wrote:

Its worth pointing out that the o2net idle timer is triggering on the 
network heartbeat, which is 10 seconds, in the current 1.2.x series.



O2CB_HEARTBEAT_THRESHOLD has no effect on this, because its another part
of the code which causes the problem.

see ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h
#define O2NET_IDLE_TIMEOUT_SECS 10

Andy


On Mon, 2007-01-22 at 09:29 -0800, Srinivas Eeda wrote:
  
  

problem appears to be that IO is taking more time than effective 
O2CB_HEARTBEAT_THRESHOLD. Your configured value 31 doesn't seem to be 
effective?

Index 6: took 1995 ms to do msleepIndex 
Index 17: took 1996 ms to do msleep

Index 22: took 10001 ms to do waiting for read completion.

Can you please cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold and verify. 


Thanks,
--Srini.




Consulente3 wrote:


Hi all, 


my test environment, is composed by 2 server with centos 4.4
nodes is exporting with aoe6-43 + vblade-14

kernel-2.6.9-42.0.3.EL
ocfs2-tools-1.2.2-1
ocfs2console-1.2.2-1
ocfs2-2.6.9-42.0.3.EL-1.2.3-1

/dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local)

DeviceFS Nodes
/dev/etherd/e2.0  ocfs2  ocfs2, becks
/dev/etherd/e3.0  ocfs2  ocfs2, becks

DeviceFS UUID  Label
/dev/etherd/e2.0  ocfs2  b24cc18d-af89-4980-a75e-a87530b1b878  test1
/dev/etherd/e3.0  ocfs2  101a92fd-b83b-4294-8bfc-fbaa069c3239  nfs4

O2CB_HEARTBEAT_THRESHOLD=31

when i try to make stress test:

Index 4: took 0 ms to do checking slots
Index 5: took 2 ms to do waiting for write completion
Index 6: took 1995 ms to do msleep
Index 7: took 0 ms to do allocating bios for read
Index 8: took 0 ms to do bio alloc read
Index 9: took 0 ms to do bio add page read
Index 10: took 0 ms to do submit_bio for read
Index 11: took 2 ms to do waiting for read completion
Index 12: took 0 ms to do bio alloc write
Index 13: took 0 ms to do bio add page write
Index 14: took 0 ms to do submit_bio for write
Index 15: took 0 ms to do checking slots
Index 16: took 1 ms to do waiting for write completion
Index 17: took 1996 ms to do msleep
Index 18: took 0 ms to do allocating bios for read
Index 19: took 0 ms to do bio allo read
Index 20: took 0 ms to do bio add page read

Re: [Ocfs2-users] ocfs2 kernel bug in Fedora Core 4 update kernel

2007-01-23 Thread Sunil Mushran

This was the lvb issue that was fixed long ago. In the 1.2 tree, it was 
fixed in 1.2.2.

2.6.18 should definitely have the fix for this.

davide rossetti wrote:

OS: Fedora Core release 4 (Stentz)
KERNEL: Linux rack1.ape 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 
22:57:02 EDT 2006 i686 i686 i386 GNU/Linux

CLUSTER: 11 Linux kernels, mixed environment FC4,FC5,FC6
SAN: FC Infortrend storage, QLogic16 port  FC switch, FC adapter LSI 
FC929X



(21224,1):ocfs2_truncate_file:242 ERROR: bug expression: 
le64_to_cpu(fe-i_size) != i_size_read(inode)
(21224,1):ocfs2_truncate_file:242 ERROR: Inode 1029752381, inode 
i_size = 582 != di i_size = 690, i_flags = 0x

1
[ cut here ]
kernel BUG at fs/ocfs2/file.c:242!
invalid opcode:  [#1]
SMP
last sysfs file: /class/vc/vcs12/dev
Modules linked in: nfs nfsd exportfs lockd nfs_acl ipv6 autofs4 ocfs2 
rfcomm l2cap bluetooth ocfs2_dlmfs ocfs2
_dlm ocfs2_nodemanager configfs sunrpc video button battery ac 
uhci_hcd e7xxx_edac edac_mc i2c_i801 i2c_core t
g3 e100 mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mptfc 
scsi_transport_fc mptscsih mptbase sd_m

od scsi_mod
CPU:1
EIP:0060:[f8bc5ebe]Not tainted VLI
EFLAGS: 00010286   (2.6.17-1.2142_FC4smp #1)
EIP is at ocfs2_setattr+0x6a8/0x1000 [ocfs2]
eax: 0073   ebx:    ecx: 0246   edx: 0246
esi: 02b2   edi:    ebp: c7fe   esp: efc31e30
ds: 007b   es: 007b   ss: 0068
Process cp (pid: 21224, threadinfo=efc31000 task=f7e37730)
Stack:  efc31ea0 0001 f0cbf9c0 f5bf2000 f0cbfba0 0001 
c9be8a60
   f5bf2000    c9be8a60  f0cbfa78 
efc31ea0
   f0cbf9c0 c047e75e e6875774 0122 0068 45b55d27  
e6875774

Call Trace:
 c047e75e notify_change+0x164/0x300  c0465388 do_truncate+0x54/0x6c
 c047372f may_open+0x1a8/0x1fc  c04755ad open_namei+0x24b/0x5c3
 c04666a6 do_filp_open+0x1c/0x31  c04667ad do_sys_open+0x3c/0xa9
 c0466847 sys_open+0x16/0x18  c0403d2f syscall_call+0x7/0xb
Code: fc ff ff ff b1 c0 fc ff ff 68 f2 00 00 00 68 8b 54 be f8 ff 70 
10 8b 00 ff b0 b4 00 00 00 68 52 9e be f8
 e8 1f e4 85 c7 83 c4 30 0f 0b f2 00 21 99 be f8 8b 4d 24 39 4c 24 
28 8b 55 20 0f 82 c6

EIP: [f8bc5ebe] ocfs2_setattr+0x6a8/0x1000 [ocfs2] SS:ESP 0068:efc31e30
 BUG: cp/21224, lock held at task exit time!
 [f0cbfa44] {inode_init_once}
.. held by:cp:21224 [f7e37730, 119]
... acquired at:   do_truncate+0x4b/0x6c
(2535,0):o2net_set_nn_state:415 accepted connection from node rack10 
(num 11) at 10.0.2.30: http://10.0.2.30:
(2535,0):__dlm_print_nodes:377 Nodes in my domain 
(41AE1AA4C5534E50A93784D2AD94A94D):

(2535,0):__dlm_print_nodes:381  node 1
(2535,0):__dlm_print_nodes:381  node 2
(2535,0):__dlm_print_nodes:381  node 3
(2535,0):__dlm_print_nodes:381  node 4
(2535,0):__dlm_print_nodes:381  node 5
(2535,0):__dlm_print_nodes:381  node 6
(2535,0):__dlm_print_nodes:381  node 7
(2535,0):__dlm_print_nodes:381  node 8
(2535,0):__dlm_print_nodes:381  node 9
(2535,0):__dlm_print_nodes:381  node 10
(2535,0):__dlm_print_nodes:381  node 11
(795,0):o2net_idle_timer:1284 connection to node rack6 (num 8) at 
10.0.2.26: http://10.0.2.26: has been idle for 10 seconds,

 shutting it down.

--
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
ICQ:290677265 SKYPE:d.rossetti



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 kernel bug in Fedora Core 4 update kernel

2007-01-23 Thread Sunil Mushran


Not really. The mainline tree is labeled 1.3.x because it is the tree
we add new features too. But bug fixes are applied to both 1.2 and
1.3 separately so it is hard to tell by the version# alone.

This is the fix in the git tree:

commit 4b1af774451bbc8440719e3fe441934a337c3b63
Author: Kurt Hackel [EMAIL PROTECTED]
Date:   Mon Jun 26 15:17:47 2006 -0700

   ocfs2: Fix lvb corruption

   Properly ignore LVB flags during a PR downconvert. This avoids an 
illegal

   lvb update.

   Signed-off-by: Kurt Hackel [EMAIL PROTECTED]
   Signed-off-by: Mark Fasheh [EMAIL PROTECTED]

davide rossetti wrote:



On 1/23/07, *Sunil Mushran* [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED] wrote:


This was the lvb issue that was fixed long ago. In the 1.2 tree,
it was
fixed in 1.2.2.
2.6.18 should definitely have the fix for this.


it seems it's even more recent:

/var/log/messages.4:Dec 27 19:40:40 rack1 kernel: OCFS2 Node Manager 1.3.3
/var/log/messages.4:Dec 27 19:40:40 rack1 kernel: OCFS2 DLM 1.3.3
/var/log/messages.4:Dec 27 19:40:40 rack1 kernel: OCFS2 DLMFS 1.3.3
/var/log/messages.4:Dec 27 19:40:40 rack1 kernel: OCFS2 User DLM 
kernel interface loaded
/var/log/messages.4:Dec 27 19:40:40 rack1 kernel: SELinux: initialized 
(dev ocfs2_dlmfs, type ocfs2_dlmfs), not configured for labeling

/var/log/messages.4:Dec 27 19:40:44 rack1 kernel: OCFS2 1.3.3


--
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
ICQ:290677265 SKYPE:d.rossetti 


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] unable to configure O2CB_HEARTBEAT_THRESHOLD

2007-01-24 Thread Sunil Mushran


The o2cb script fix is in ocfs2-tools 1.2.2 released Oct 2006.
Ping SUSE for the update.

[EMAIL PROTECTED] wrote:


Using SuSE SP2 Linux running V1.0.8 of OCFS2 and the tools/console 
that comes with SP2 distribution.


I am unable to set the* O2CB_HEARTBEAT_THRESHOLD* parameter in the 
/etc/sysconfig/o2cb file.

In doing so, running o2cb configure overwrites it.
Also noted that o2cb doesn't reference the parameter, even in its 
template file for sysconfig.
How can I get this set? It has left our cluster in a hopeless 
situation and I'm reluctant to install SuSE SP3 as this will mean a 
full install of asmlibs and other software rebuilds.


Phil Broughton



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 kernel bug in Fedora Core 4 update kernel

2007-01-24 Thread Sunil Mushran


This is not a fs issue. As in the file must be alright. This is a dlm issue.
The fs is asking the dlm to free the lock and the dlm is stuck. How many
nodes do you have? We've fixed a bunch of dlm bugs since what you appear
to be running.

davide rossetti wrote:

I rebooted the two faulty nodes.
now, I can't access anymore the file which was involved in the crash:
/mi11/simma/ghmc/m24/JOB.log

using the faq document, I'm trying to check the situation:

Lockres: M003d60c63da894d788  Mode: No Lock
Flags: Initialized Attached Busy
RO Holders: 0  EX Holders: 0
Pending Action: Convert  Pending Unlock Action: None

sh-3.00# echo locate D003d60c63da894d788 | 
/sbin/debugfs.ocfs2 /dev/sdc1

debugfs.ocfs2 1.2.2
debugfs: 1029752381  /mi11/simma/ghmc/m24/JOB.log

On another shell, I have a stuck un-killable process:
theboss 14:25 (5) ~ls /storage/disk1/mi11/simma/ghmc/m24/

rossetti  7930  0.0  0.0  62324   872 pts/13   D+   14:26   0:00 ls 
--color=tty -F /storage/disk1/mi11/simma/ghmc/m24/



How should I proceed to unlock the file and/or remove it ???

--
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
ICQ:290677265 SKYPE:d.rossetti 


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] OCFS2 1.2.4-2 released

2007-02-02 Thread Sunil Mushran


All,

We are pleased to announce the release of OCFS2 1.2.4-2.

This release addresses the lowmem consumption issue that has plagued 
many users.

It also addresses few races in the dlm relating to the lockres migration.

The complete list of changes post 1.2.3 is available here:
http://oss.oracle.com/projects/ocfs2/news/article_10.html

Please note that we did have to update the network protocol in the 1.2.4
release and thus cannot support rolling upgrade from 1.2.3 or earlier to 
1.2.4.

We are well aware that this causes problems for many users and we would not
have made the change if we did not think it necessary. We apologize for
the inconvenience.

The packages for Oracle's EL4 will be available on the ULN site sometime
early next week.

Novell has already incorporated all the patches bundled in this release in
their SLES10 SP1 code tree. Please contact Novell for the release schedule.

Packages for Red Hat's RHEL4 are available in the OCFS2 download area 
for all ten
kernels (starting 2.6.9-22.EL) and six architectures. Follow the links 
to download
the appropriate package. (Refer to the FAQ if you are unclear as to 
which package

you need to install.)

As always, we look forward to hearing from you on the 
ocfs2-users@oss.oracle.com

mailing list.

The OCFS2 Team

OCFS2: http://oss.oracle.com/projects/ocfs2
TOOLS: http://oss.oracle.com/projects/ocfs2-tools/
FAQ:   
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html
GUIDE: 
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_users_guide.pdf


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 mount problem

2007-02-05 Thread Sunil Mushran


It could be that the device name is not the same across the two nodes.

Do:
# mounted.ocfs2 -d
on both nodes. Match the device using the uuid. As in, you
should see a device with the same uuid on both nodes. If not,
then the device is not shared.

If you do see the device on both nodes but with differing names, you
could look into mounting by label.
# mount -t ocfs2 -L label /c1

aibolit 66 wrote:

Hello everybody!

I'm using RHEL4 U4, kernel-2.6.9-42.0.8.EL and ocfs2-tools-1.2.2-1.
Trying to set up OCFS2 on 2 nodes following this guide http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_users_guide.pdf: 


The problem is that after creation of cluster.conf via ocfs2console, and  
propagation it to second node, I can format /dev/sda5 as ocfs2FS, but I can 
mount it only on node1. Node2 mount  fails with following error:

[EMAIL PROTECTED] ~]# mount.ocfs2 /dev/sda5 /cl
ocfs2_hb_ctl: Bad magic number in superblock while reading uuid
mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: Operation not 
permitted

I think, that I'm doing something wrong, because when I'm configuring nodes via 
ocfs2console, and when I'm formatting /dev/sda5, there is absolutely nothing 
happens between node1 and node2. No network traffic at all... Here is my 
cluster.conf that exists on both nodes:

node:
ip_port = 
ip_address = 89.XXX.134.24
number = 0
name = node1.Y.ru
cluster = ocfs2

node:
ip_port = 
ip_address = 89.XXX.134.25
number = 1
name = node2.Y.ru
cluster = ocfs2

cluster:
node_count = 2
name = ocfs2



Any help will be very appreciated.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 mount problem

2007-02-05 Thread Sunil Mushran


The device needs to be shared. As in, both nodes need to be able
to see the same device concurrently.

Refer to iscsi, fiber channel, aoe, etc.

aibolit 66 wrote:

-Original Message-
From: Sunil Mushran [EMAIL PROTECTED]
To: aibolit 66 [EMAIL PROTECTED]
Date: Mon, 05 Feb 2007 12:46:26 -0800
Subject: Re: [Ocfs2-users] OCFS2 mount problem

  

It could be that the device name is not the same across the two nodes.

Do:
# mounted.ocfs2 -d
on both nodes. Match the device using the uuid. As in, you
should see a device with the same uuid on both nodes. If not,
then the device is not shared.

If you do see the device on both nodes but with differing names, you
could look into mounting by label.
# mount -t ocfs2 -L label /c1





[EMAIL PROTECTED] ~]# mounted.ocfs2 -d
DeviceFS UUID  Label
/dev/sda5 ocfs2  6a0fbcc8-675f-42c6-9bda-5513feb98a05  oracle


[EMAIL PROTECTED] ~]# mounted.ocfs2 -d
DeviceFS UUID  Label

I didn't format /dev/sda5 on second node, as guide.pdf says that.



P.S. Sorry for previous dup, thought that your server didn't accept message 
from mail.ru :(
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 1.2.4-2 released

2007-02-06 Thread Sunil Mushran


That's the source.

Randy Ramsdell wrote:

Mark Fasheh wrote:
  

On Tue, Feb 06, 2007 at 10:18:51AM -0500, Randy Ramsdell wrote:
  


Is source available?

  

http://oss.oracle.com/projects/ocfs2/dist/files/source/v1.2/ocfs2-1.2.4.tar.gz
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
  



I have this but thought it wasn't patched. It is patched with the -2
fixes?

Randy

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2-tools-1.2.2 compile.

2007-02-06 Thread Sunil Mushran


The following patch will address this issue. The fix will be provided
with the next tools release.


Index: libocfs2/include/ocfs2.h
===
--- libocfs2/include/ocfs2.h(revision 1269)
+++ libocfs2/include/ocfs2.h(revision 1270)
@@ -48,6 +48,9 @@

#include byteorder.h

+#if !defined(offsetof)
+#   define offsetof(type,memb) ((size_t)((type*)0)-memb)
+#endif

#if OCFS2_FLAT_INCLUDES
#include o2dlm.h


Randy Ramsdell wrote:

Hi,

The ocfs2 package compiled perfectly, but tools did not.

The test setup is using opensuse10.1 - updates applied

For ocfs2-tools-1.2.2:


In file included from include/ocfs2.h:60,
from alloc.c:32:
include/ocfs2_fs.h: In function ‘ocfs2_fast_symlink_chars’:
include/ocfs2_fs.h:566: warning: implicit declaration of function ‘offsetof’
include/ocfs2_fs.h:566: error: expected expression before ‘struct’
include/ocfs2_fs.h: In function ‘ocfs2_extent_recs_per_inode’:
include/ocfs2_fs.h:574: error: expected expression before ‘struct’
include/ocfs2_fs.h: In function ‘ocfs2_chain_recs_per_inode’:
include/ocfs2_fs.h:584: error: expected expression before ‘struct’
include/ocfs2_fs.h: In function ‘ocfs2_extent_recs_per_eb’:
include/ocfs2_fs.h:594: error: expected expression before ‘struct’
include/ocfs2_fs.h: In function ‘ocfs2_local_alloc_size’:
include/ocfs2_fs.h:604: error: expected expression before ‘struct’
include/ocfs2_fs.h: In function ‘ocfs2_group_bitmap_size’:
include/ocfs2_fs.h:614: error: expected expression before ‘struct’
include/ocfs2_fs.h: In function ‘ocfs2_truncate_recs_per_inode’:
include/ocfs2_fs.h:624: error: expected expression before ‘struct’
alloc.c: In function ‘ocfs2_init_inode’:
alloc.c:143: warning: pointer targets in passing argument 1 of ‘strcpy’
differ in signedness
alloc.c: In function ‘ocfs2_init_eb’:
alloc.c:184: warning: pointer targets in passing argument 1 of ‘strcpy’
differ in signedness
make[1]: *** [alloc.o] Error 1
make[1]: Leaving directory `/root/src/ocfs2-tools-1.2.2/libocfs2'
make: *** [libocfs2] Error 2


Anyone know how to resolve this?


Randy Ramsdell
Foreclosure.com

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] 1.3.3 mount problem

2007-02-07 Thread Sunil Mushran


The datavolume code is not in mainline. But you should
be able to get Oracle RDBMS to work with it. Ensure the
init.ora paramater filesystemio_options is set to direct_io.

Ivo Maya wrote:

Hi,

I need to mount ocfs2 with datavolume option on open
SuSE 10.2 Machines.
ocfs2 is 1.3.3 version and does not support the
datavolume option ???

I know it's not a supported distro (RHEL, Enterprise
Linux, etc) but I want to test this specific distro.
I tried to compile the 1.2.4 version but have some
problems.

Does that mean that ocfs2 is not supported on all 2.6
kernels?
Why is it part of the kernel if you can't use all
options normally available on the official distros?

Tx
Ivo


 

Get your own web address.  
Have a HUGE year through Yahoo! Small Business.

http://smallbusiness.yahoo.com/domains/?p=BESTDEAL

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] 1.2.4 symbols

2007-02-09 Thread Sunil Mushran


What does dmesg say?

Randy Ramsdell wrote:

Hi,

Everything compiled correctly for the ocfs2 package, but so far the
modules will not load with the well  known module symbol error.

FATAL: Error inserting ocfs2
(/lib/modules/2.6.16.27-0.6-smp/kernel/fs/ocfs2/ocfs2.ko): Unknown
symbol in module, or unknown parameter (see dmesg)


Okay not sure what is up here, any suggestions? BTW, this is the correct
module location and I manually ran depmod.

thanks,
randy

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 943 matches

Mail list logo