Thanks Mike! I filed https://issues.apache.org/jira/browse/COUCHDB-2231 and linked your gist in there as a possible solution.
Adam On Apr 30, 2014, at 1:12 AM, Mike Marino <[email protected]> wrote: > Hi Marty, > > It's difficult for me to tell the reason that couchdb is not stopping using > your init script, but we had a similar issue that I fixed by patching the > couchdb startup script ("executable"). The issue was that the 'shepherd' > program was respawning couch after a requested shutdown. > > This was discussed some time a while ago on the list and I sent our fix > out, but I don't think it was ever integrated. Anyways, here's the gist > (for 1.3, though I think the file has remained the same in the newer > versions): > > https://gist.github.com/7601778 > > Cheers, > Mike > > Am 30.04.2014 um 06:52 schrieb Marty Hu <[email protected]>: > > Okay, after doing a bit more work this is what I found out: > > 1. When I start couchdb on a fresh server, it appears to run correctly. > > 2. However, the conventional "sudo service couchdb stop" does not actually > stop couchdb correctly. I know this because I can kill the couchdb > processes with ps -U couchdb -o pid= | xargs kill -9 > > 3. We use chef for configuration, so at a set interval it will queue up a > "sudo service couchdb restart", which will try to stop the process (the > process won't stop) and then start a new process (this process will > actually try to start). However, the second process will not be able to > bind to the port (the first process never got killed and still holds it) so > will throw the error. > > I imagine that this is a configuration issue (and so not really a fault of > your guys) but welcoming any tips about how to deal with this short of > changing the init script to be a messy killer. > > > On Tue, Apr 29, 2014 at 6:54 PM, Adam Kocoloski <[email protected]> wrote: > > Hi Marty, the mailing list stripped out the attachments except for > > spike.txt. > > > I don't know if they're the cause of the load spikes that you see, but the > > eaddrinuse errors are not normal. They can be caused by another process > > listening on the same port as CouchDB. Fairly peculiar stuff. > > > The timeout trying to open the splits-v0.1.7 at 21:23 does line up with > > your report that the system was heavily loaded at the time, but there's > > really not too much to go on here. > > > Regards, Adam > > > On Apr 29, 2014, at 7:46 PM, Marty Hu <[email protected]> wrote: > > > Thanks for the follow-up. > > > I've attached nagios graphs (load, disk, and ping) of one such event, > > which occurred at 2:24pm (after the drop in disk) according to my nagios > > emails. I've also attached database logs (with some client-specific queries > > removed). The error was fixed around 2:30pm. Notably, the log files are in > > GMT. > > > Unfortunately I don't have any graphs for the event other than what's on > > nagios. > > > Are the connection errors with CouchDB normal? We get them continuously > > (around every minute) even during normal operation with the DB not crashing. > > > > On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin <[email protected]> > > wrote: > > Hi Marty, > > > thanks for following up! I see your problem, but what would we need: > > > 1. CouchDB stats graphs and your system disk, network and memory ones. > > If you cannot share them in public, feel free to send me in private. > > We need to know they are related. For instance, high memory usage may > > be caused by uploading high amount of big files: you'll easily notice > > that comparing CouchDB, network and memory graphs for the spike > > period. > > > 2. CouchDB log entries for spike event. Graphs can only show you > > that's something going wrong and we could only guess (almost we guess > > right, but without much precise) what's exactly going wrong. Logs will > > help to us to find out actual requests that causes memory spike. > > > After that we can start to think about the problem. For instance, if > > spikes are happens due to large attachments uploads, there is no much > > to do. On other hand, query server may easily eat quite big chunk of > > memory. We'll easily notice that by monitoring /_active_tasks resource > > (if problem is in views) or by looking through logs for the spike > > period. And this case can be fixed. > > > Not sure which tools you're using for monitoring and graphs drawing, > > but take a look on next projects: > > - https://github.com/gws/munin-plugin-couchdb - Munin plugin for > > CouchDB monitoring. Suddenly, it doesn't handles system metrics for > > CouchDB process - I'll only add this during this week, but make sure > > you have similar plugin for your monitoring system. > > - https://github.com/etsy/skyline - anomalies detector. spikes are so > > - https://github.com/etsy/oculus - metrics correlation tool. it would > > be very-very easily to compare multiple graphs for anomaly period with > > it. > > > -- > > ,,,^..^,,, > > > > On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu <[email protected]> wrote: > > We're been running CouchDB v1.5.0 on AWS and its been working fine. > > Recently AWS came out with new prices for their new m3 instances so we > > switched our CouchDB instance to use an m3.large. We have a relatively > > small database with < 10GB of data in it. > > > Our steady state metrics for it are system loads of 0.2 and memory > > usages > > of 5% or so. However, we noticed that every few hours (3-4 times per > > day) > > we get a huge spike that floors our load to 1.5 or so and memory usage > > to > > close to 100%. > > > We don't run any cronjobs that involve the database and our traffic > > flow > > about the same over the day. We do run a continuous replication from > > one > > database on the west coast to another on the east coast. > > > This has been stumping me for a bit - any ideas? > > > > <spike.txt>
