On Sun, Sep 15, 2013 at 08:04:27PM +0200, Dave Cottlehuber wrote: > NIF scheduler issues could be a reasonable suspect; > > heart: Fri Sep 13 20:59:36 2013: heart-beat time-out, no activity for > 15 seconds > > 15 seconds is a *long* time however. > > 1.4.0 needs 14B04 or higher I think due to one of our dependencies, so > I'd suggest reverting back to that & seeing if you are having any > other issues. > > Also, probably unrelated, why is kernel polling disabled?
Honestly, on my gentoo boxes I just use the ebuild. I have no idea why kernel polling is false...it is whatever the default is in the ebuild I guess. I have no clue about whether kpoll should be enabled, so I'm trusting the default. Since my last email, I reverted back to Erlang R15B03 and it has been crashing, same issues and symptoms. I can successfully make it crash pretty much within 10 minutes by firing up the two replication jobs and running a data processing job. So that's something at least! > And also likely unrelated, what sort of boxes are these running on, > and and are your baseline performance / throughput metrics holding up? Well, the box that is failing is a dual chip, quad core Intel Xeon E5420 (so 8 cores total), with a measly 8 Gig of RAM (it looked good when I built the machine years ago...) I forget the details of the disks, but it is writing to a 3ware hardware RAID array. Otherwise, Linux version 3.6.8-gentoo, gcc version 4.5.4 I haven't done a global update probably in the last 3 months or so, but the machine is reasonably up to date. As to your last question about baseline performance metrics...I'm a researcher, and I've set this up so as to spread out my work on several machines. So my baseline performance metric is binary: works or doesn't work. A long time ago I was crushing a single couch server and clogging my network, so I moved to this model of each processing box has its own couch and let couch sync the results. I don't really measure throughput, as my bottleneck is the data processing step. This is a good system when it works. -- Tomorrow I will try loading up another server in the middle of an a<->b<->c type replication, with the same databases, and see if maybe it is something in my current "b" machine's configuration, or whether I can always get CouchDB to crash. After that I will try downgrading to 14B04+, although there isn't an ebuild for it in Gentoo's portage anymore. Thanks for the replies. Regards, James > > > On 15 September 2013 15:59, Robert Newson <[email protected]> wrote: > > But, again, R15 is also new enough to have scheduler problems, if that > > turns out to be your problem then this change should also fail the same > > way. I trust R14B01 through extensive punishment, and recommend it. > > > > B. > > > > On 15 Sep 2013, at 04:14, James Marca <[email protected]> wrote: > > > >> eacce > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
