Re: [Lustre-discuss] Large Corosync/Pacemaker clusters

2012-11-06 Thread Marco Passerini
Hi,

I'm also setting up a high-available Lustre system, I configured pairs 
for the OSSes and MDSes, redundant Corosync rings (two separate rings: 
IB and Eth), and Stonith is enabled.

The current configuration seems to work fine, however yesterday we 
experienced some problem because 4 OSSes got rebooted by Stonith. I 
suspect that Corosync missed a heartbeat due to a kernel/corosync hung, 
rather than a network problem. I will try the renice solution you 
proposed.

I have been thinking that I could increase the token timeout value in 
/etc/corosync/corosync.conf , to prevent short hiccups. Did you 
specify a value to this parameter or did you leave the default 1000ms value?

Marco



On 2012-10-31 03:43, Hall, Shawn wrote:
 Thanks for the replies.  We've worked on the HA and have it to a
 satisfactory point where we can put it into production.  We broke it
 into a MDS pair and 4 groups of 4 OSS nodes.  From our perspective, it's
 actually easier to manage groups of 4 than groups of 2, since it's half
 as many configurations to keep track of.

 After splitting the cluster into 5 pieces it has become much more
 responsive and stable.  It's more difficult to manage than one large
 cluster, but the stability is obviously worth it.  We've been performing
 heavy load testing and have not been able to break the cluster.  We
 did a few more things to get to this point:

 - Lowered the nice value of the corosync process to make it more
 responsive under load and prevent a node from getting kicked out due to
 unresponsiveness.
 - Increased vm.min_free_kbytes to give TCP/IP w/ jumbo frames room to
 move around.  Without this certain nodes would have low memory issues
 related to networking and would get stonithed due to unresponsiveness.

 Thanks,
 Shawn

 -Original Message-
 From: Charles Taylor [mailto:tay...@hpc.ufl.edu]
 Sent: Wednesday, October 24, 2012 3:33 PM
 To: Hall, Shawn
 Cc: lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters


 FWIW, we are running HA Lustre using corosync/pacemaker.We broke our
 OSSs and MDSs out into individual HA *pairs*.   Thought about other
 configurations but it was our first step into corosync/pacemaker so we
 decided to keep it as simple as possible.   Seems to work well.I'm
 not sure I would attempt what you are doing though it may be perfectly
 fine.   When HA is a requirement, it probably makes sense to avoid
 pushing the limits of what works.

 Doesn't really help you much other than to provide a data point with
 regard to what other sites are doing.

 Good luck and report back.

 Charlie Taylor
 UF HPC Center

 On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:

 Hi,

 We're setting up fairly large Lustre 2.1.2 filesystems, each with 18
 nodes and 159 resources all in one Corosync/Pacemaker cluster as
 suggested by our vendor.  We're getting mixed messages on how large of a
 Corosync/Pacemaker cluster will work well between our vendor an others.

 1.   Are there Lustre Corosync/Pacemaker clusters out there of
 this size or larger?
 2.   If so, what tuning needed to be done to get it to work well?
 3.   Should we be looking more seriously into splitting this
 Corosync/Pacemaker cluster into pairs or sets of 4 nodes?

 Right now, our current configuration takes a long time to start/stop
 all resources (~30-45 mins), and failing back OSTs puts a heavy load on
 the cib process on every node in the cluster.  Under heavy IO load, the
 many of the nodes will show as unclean/offline and many OST resources
 will show as inactive in crm status, despite the fact that every single
 MDT and OST is still mounted in the appropriate place.  We are running 2
 corosync rings, each on a private 1 GbE network.  We have a bonded 10
 GbE network for the LNET.

 Thanks,
 Shawn
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] The latest updates around Lustre and open source files systems from OpenSFS and EOFS

2012-11-06 Thread Norman Morse
You headed to Salt Lake City? SC is always the main HPC conference of the
year. OpenSFS and EOFS will be at SC12 Nov 12 - 15th talking open source
and file systems at booth 2101. We've had a really busy year with some
great progress around Lustre development in particular! Also some new
important participants have joined. Lots of great momentum, come by and
find out the latest!

At SC12, we've got great talks at our booth Monday evening at 7:30 for
starters and our popular popcorn and beer reception wil be Tuesday evening
from 4 to 6. And we've got talks at the booth on Tuesday and Wednesday
covering the Lustre roadmap, Lustre WAN issues, Sequoia topics and tons
more. OpenSFS/EOFS Participants DataDirect Networks, Indiana University,
Lawrence Livermore National Laboratory (LNLL), NetApp, University of
Florida, Whamcloud/Intel, and Xyratex will all be talking open source file
systems.

We'll have an 'unofficial' Birds of a Feather (BOF) session Wednesday
evening the 14th.  Please stop by our booth for time, location and agenda.

The updates from the OpenSFS Community Development Working Group (CDWG) and
the OpenSFS Technical Working Group (TWG) in particular will be of interest
to open source file system technologists and users.

Mon, 7:30pm, Pam Hamilton, OpenSFS - OpenSFS Community Development Working
Group – Bringing the Lustre Community Together

Mon, 8:00pm, Dave Dillow, OpenSFS - OpenSFS Technical Working Group (TWG)
2012 Goals and Accomplishments

Full OpenSFS/EOFS Participant talks schedule:
http://www.opensfs.org/events-2/supercomputing2012

Come by and talk open source file systems!

With best regards,

Norm (OpenSFS), Hugo (EOFS)
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] The latest updates around Lustre and open source files systems from OpenSFS and EOFS

2012-11-06 Thread Norman Morse
You headed to Salt Lake City? SC is always the main HPC conference of the
year. OpenSFS and EOFS will be at SC12 Nov 12 - 15th talking open source
and file systems at booth 2101. We've had a really busy year with some
great progress around Lustre development in particular! Also some new
important participants have joined. Lots of great momentum, come by and
find out the latest!

At SC12, we've got great talks at our booth Monday evening at 7:30 for
starters and our popular popcorn and beer reception wil be Tuesday evening
from 4 to 6. And we've got talks at the booth on Tuesday and Wednesday
covering the Lustre roadmap, Lustre WAN issues, Sequoia topics and tons
more. OpenSFS/EOFS Participants DataDirect Networks, Indiana University,
Lawrence Livermore National Laboratory (LNLL), NetApp, University of
Florida, Whamcloud/Intel, and Xyratex will all be talking open source file
systems.

We'll have an 'unofficial' Birds of a Feather (BOF) session Wednesday
evening the 14th.  Please stop by our booth for time, location and agenda.

The updates from the OpenSFS Community Development Working Group (CDWG) and
the OpenSFS Technical Working Group (TWG) in particular will be of interest
to open source file system technologists and users.

Mon, 7:30pm, Pam Hamilton, OpenSFS - OpenSFS Community Development Working
Group – Bringing the Lustre Community Together

Mon, 8:00pm, Dave Dillow, OpenSFS - OpenSFS Technical Working Group (TWG)
2012 Goals and Accomplishments

Full OpenSFS/EOFS Participant talks schedule:
http://www.opensfs.org/events-2/supercomputing2012

Come by and talk open source file systems!

With best regards,

Norm (OpenSFS), Hugo (EOFS)
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Large Corosync/Pacemaker clusters

2012-11-06 Thread Hall, Shawn
Hi,

Our vendor actually has several of the parameters in corosync.conf
increased by default, and we have not touched them.  These are:

Token: 1
Retransmits_before_loss: 25
Consensus: 12000
Join: 1000
Merge: 400
Downcheck: 2000

We also have secauth turned off, as this can consume 75% of your CPU
cycles and cut bandwidth by a third, according to the corosync.conf
manpage.  I'm not sure if these parameters are necessary now that we
have split our cluster up, but they haven't seemed to hurt anything
either.

Hope this helps,
Shawn

-Original Message-
From: Marco Passerini [mailto:marco.passer...@csc.fi] 
Sent: Tuesday, November 06, 2012 7:13 AM
To: lustre-discuss@lists.lustre.org
Cc: Hall, Shawn
Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters

Hi,

I'm also setting up a high-available Lustre system, I configured pairs
for the OSSes and MDSes, redundant Corosync rings (two separate rings: 
IB and Eth), and Stonith is enabled.

The current configuration seems to work fine, however yesterday we
experienced some problem because 4 OSSes got rebooted by Stonith. I
suspect that Corosync missed a heartbeat due to a kernel/corosync hung,
rather than a network problem. I will try the renice solution you
proposed.

I have been thinking that I could increase the token timeout value in
/etc/corosync/corosync.conf , to prevent short hiccups. Did you
specify a value to this parameter or did you leave the default 1000ms
value?

Marco



On 2012-10-31 03:43, Hall, Shawn wrote:
 Thanks for the replies.  We've worked on the HA and have it to a 
 satisfactory point where we can put it into production.  We broke it 
 into a MDS pair and 4 groups of 4 OSS nodes.  From our perspective, 
 it's actually easier to manage groups of 4 than groups of 2, since 
 it's half as many configurations to keep track of.

 After splitting the cluster into 5 pieces it has become much more 
 responsive and stable.  It's more difficult to manage than one large 
 cluster, but the stability is obviously worth it.  We've been 
 performing heavy load testing and have not been able to break the 
 cluster.  We did a few more things to get to this point:

 - Lowered the nice value of the corosync process to make it more 
 responsive under load and prevent a node from getting kicked out due 
 to unresponsiveness.
 - Increased vm.min_free_kbytes to give TCP/IP w/ jumbo frames room to 
 move around.  Without this certain nodes would have low memory issues 
 related to networking and would get stonithed due to unresponsiveness.

 Thanks,
 Shawn

 -Original Message-
 From: Charles Taylor [mailto:tay...@hpc.ufl.edu]
 Sent: Wednesday, October 24, 2012 3:33 PM
 To: Hall, Shawn
 Cc: lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters


 FWIW, we are running HA Lustre using corosync/pacemaker.We broke
our
 OSSs and MDSs out into individual HA *pairs*.   Thought about other
 configurations but it was our first step into corosync/pacemaker so we
 decided to keep it as simple as possible.   Seems to work well.I'm
 not sure I would attempt what you are doing though it may be perfectly
 fine.   When HA is a requirement, it probably makes sense to avoid
 pushing the limits of what works.

 Doesn't really help you much other than to provide a data point with 
 regard to what other sites are doing.

 Good luck and report back.

 Charlie Taylor
 UF HPC Center

 On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:

 Hi,

 We're setting up fairly large Lustre 2.1.2 filesystems, each with 18
 nodes and 159 resources all in one Corosync/Pacemaker cluster as 
 suggested by our vendor.  We're getting mixed messages on how large of

 a Corosync/Pacemaker cluster will work well between our vendor an
others.

 1.   Are there Lustre Corosync/Pacemaker clusters out there of
 this size or larger?
 2.   If so, what tuning needed to be done to get it to work well?
 3.   Should we be looking more seriously into splitting this
 Corosync/Pacemaker cluster into pairs or sets of 4 nodes?

 Right now, our current configuration takes a long time to start/stop
 all resources (~30-45 mins), and failing back OSTs puts a heavy load 
 on the cib process on every node in the cluster.  Under heavy IO load,

 the many of the nodes will show as unclean/offline and many OST 
 resources will show as inactive in crm status, despite the fact that 
 every single MDT and OST is still mounted in the appropriate place.  
 We are running 2 corosync rings, each on a private 1 GbE network.  We 
 have a bonded 10 GbE network for the LNET.

 Thanks,
 Shawn
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss