https://bugzilla.wikimedia.org/show_bug.cgi?id=68444
Bug ID: 68444
Summary: WMFLabs: Diamond not running / won't start
Product: Wikimedia Labs
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: critical
Priority: Unprioritized
Component: Infrastructure
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected], [email protected],
[email protected], [email protected]
Web browser: ---
Mobile Platform: ---
Looking at graphite, the values for cvn instances appear all constant (cpu,
memory, time since puppet run, everything).
For example:
http://graphite.wmflabs.org/render/?width=578&height=289&from=00%3A00_20140723&until=23%3A45_20140723&hideLegend=false&target=cvn.*.cpu.total.user.value
Checking the local instance (e.g. cvn-dev.eqiad.wmflabs) I see that the diamond
directory has been idle for the past 16 days:
$ l /var/log/diamond/
total 77M
drwxr-xr-x 2 diamond root 4.0K Jul 7 14:04 ./
drwxr-xr-x 16 root root 4.0K Jul 23 06:45 ../
-rw-r--r-- 1 diamond nogroup 6.6M Jul 7 15:42 archive.log
-rw-r--r-- 1 diamond nogroup 11M Jul 2 23:59 archive.log.2014-07-02
-rw-r--r-- 1 diamond nogroup 11M Jul 3 23:59 archive.log.2014-07-03
-rw-r--r-- 1 diamond nogroup 11M Jul 4 23:59 archive.log.2014-07-04
-rw-r--r-- 1 diamond nogroup 10M Jul 5 23:59 archive.log.2014-07-05
-rw-r--r-- 1 diamond nogroup 11M Jul 6 23:59 archive.log.2014-07-06
-rw-r--r-- 1 diamond nogroup 1.4M Jul 8 17:03 diamond.log
-rw-r--r-- 1 diamond nogroup 19M Jul 7 14:03 diamond.log.2014-07-06
And there is no diamond process running
$ ps -u diamond f
(empty)
$ ps aux | grep diamond | grep -v grep
(empty)
$ service diamond status
diamond stop/waiting
$ service diamond start
start: Rejected send message, 1 matched rules; type="method_call",
sender=":1.49" (uid=2008 pid=6910 comm="start diamond ")
interface="com.ubuntu.Upstart0_6.Job" member="Start" error name="(unset)"
requested_reply="0" destination="com.ubuntu.Upstart" (uid=0 pid=1
comm="/sbin/init")
$ service diamond status
diamond stop/waiting
Graphite continues to register data from the instance (the last known value
repeated), that seems like a bug in the aggregator because the instance hasn't
been producing any values for over 16 days.
And of course, aside from Graphite being lied to by the aggregator (making it
hard to monitor and see that it was down), the diamond process won't start?
Puppet is running fine (no errors), and the drives are fine too:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 7.6G 1.3G 5.9G 19% /
udev 2.0G 12K 2.0G 1% /dev
tmpfs 396M 288K 396M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 0 2.0G 0% /run/shm
/dev/vda2 1.9G 525M 1.3G 29% /var
labstore.svc.eqiad.wmnet:/dumps 9.1T 9.1T 0 100%
/public/dumps
labstore.svc.eqiad.wmnet:/project/cvn/project 30T 17T 14T 57%
/data/project
labstore.svc.eqiad.wmnet:/project/cvn/home 30T 17T 14T 57% /home
labstore.svc.eqiad.wmnet:/scratch 7.3T 2.6T 4.7T 36%
/data/scratch
labstore.svc.eqiad.wmnet:/keys 960M 39M 921M 5%
/public/keys
labstore.svc.eqiad.wmnet:/backups 20T 3.0G 20T 1%
/public/backups
/dev/mapper/vd-second--local--disk 29G 172M 27G 1% /srv
--
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l