That was our first thought as we had motherboard controller based raid 0 in place. We have since rebuilt all nodes with jbod using the recommended etx4 partition creation and mount parameters. So far so good.
On Tue, Jan 11, 2011 at 4:24 PM, Ted Dunning <[email protected]> wrote: > I have seen this also with evil disk controllers on the edge of dying. > > On Tue, Jan 11, 2011 at 12:10 PM, Wayne <[email protected]> wrote: > > > Thanks a lot for the heads up on this. We have only seen this once, but > if > > we start seeing it more we will definitely try to go back to a previous > > version. We are using 1.6u23. Are you using the Sun JVM? We were > previously > > working with cassandra and found the openJDK 1.6u17 to be a lot better > for > > other reasons (CMF). > > > > Thanks. > > > > > > On Tue, Jan 11, 2011 at 12:22 PM, Brent Halsey <[email protected]> > wrote: > > > > > Which jdk are you using? We've had similar problems with jdk1.6u22 on > > > Ubuntu 10.04 in Amazon EC2. Nodes would lock up for 20-40+ minutes. > > > > > > We haven't done any conclusive tests yet, but we haven't seen the same > > > problems after down rev'ing to jdk1.6u16. > > > > > > -brent > > > > > > On Mon, Jan 10, 2011 at 12:59 PM, Wayne <[email protected]> wrote: > > > > We had a node last night go awol and got stuck in permanent 50% CPU > > wait > > > > time. The node also steadily shot up the load to 400 before we saw it > > and > > > > had to hard reboot. Besides that all other ganglia metrics > flat-lined. > > Is > > > > this some sort of bizarre kernal problem? We are using xfs with std > > > > settings. I have seen a few postings talk about bizarre problems like > > > this. > > > > Can XFS be blamed or is it more kernal related? Is there a posting > > > somewhere > > > > suggesting the best file system settings? Are there recommended > > settings > > > for > > > > using CentOS 5.5? We have a 10 nodes cluster we have been pounding > for > > > weeks > > > > and we can't seem to keep all ten nodes up for a 24 hour period. I am > > > hoping > > > > there is a lower level problem causing much of it. > > > > > > > > Thanks. > > > > > > > > > >
