https://bugzilla.wikimedia.org/show_bug.cgi?id=68444

            Bug ID: 68444
           Summary: WMFLabs: Diamond not running / won't start
           Product: Wikimedia Labs
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: critical
          Priority: Unprioritized
         Component: Infrastructure
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected], [email protected],
                    [email protected], [email protected]
       Web browser: ---
   Mobile Platform: ---

Looking at graphite, the values for cvn instances appear all constant (cpu,
memory, time since puppet run, everything).

For example:
http://graphite.wmflabs.org/render/?width=578&height=289&from=00%3A00_20140723&until=23%3A45_20140723&hideLegend=false&target=cvn.*.cpu.total.user.value

Checking the local instance (e.g. cvn-dev.eqiad.wmflabs) I see that the diamond
directory has been idle for the past 16 days:

$ l /var/log/diamond/
total 77M
drwxr-xr-x  2 diamond root    4.0K Jul  7 14:04 ./
drwxr-xr-x 16 root    root    4.0K Jul 23 06:45 ../
-rw-r--r--  1 diamond nogroup 6.6M Jul  7 15:42 archive.log
-rw-r--r--  1 diamond nogroup  11M Jul  2 23:59 archive.log.2014-07-02
-rw-r--r--  1 diamond nogroup  11M Jul  3 23:59 archive.log.2014-07-03
-rw-r--r--  1 diamond nogroup  11M Jul  4 23:59 archive.log.2014-07-04
-rw-r--r--  1 diamond nogroup  10M Jul  5 23:59 archive.log.2014-07-05
-rw-r--r--  1 diamond nogroup  11M Jul  6 23:59 archive.log.2014-07-06
-rw-r--r--  1 diamond nogroup 1.4M Jul  8 17:03 diamond.log
-rw-r--r--  1 diamond nogroup  19M Jul  7 14:03 diamond.log.2014-07-06

And there is no diamond process running

$ ps -u diamond f
(empty)

$ ps aux | grep diamond | grep -v grep
(empty)

$ service diamond status
diamond stop/waiting

$ service diamond start
start: Rejected send message, 1 matched rules; type="method_call",
sender=":1.49" (uid=2008 pid=6910 comm="start diamond ")
interface="com.ubuntu.Upstart0_6.Job" member="Start" error name="(unset)"
requested_reply="0" destination="com.ubuntu.Upstart" (uid=0 pid=1
comm="/sbin/init")

$ service diamond status
diamond stop/waiting



Graphite continues to register data from the instance (the last known value
repeated), that seems like a bug in the aggregator because the instance hasn't
been producing any values for over 16 days.

And of course, aside from Graphite being lied to by the aggregator (making it
hard to monitor and see that it was down), the diamond process won't start?

Puppet is running fine (no errors), and the drives are fine too:
$ df -h
Filesystem                                     Size  Used Avail Use% Mounted on
/dev/vda1                                      7.6G  1.3G  5.9G  19% /
udev                                           2.0G   12K  2.0G   1% /dev
tmpfs                                          396M  288K  396M   1% /run
none                                           5.0M     0  5.0M   0% /run/lock
none                                           2.0G     0  2.0G   0% /run/shm
/dev/vda2                                      1.9G  525M  1.3G  29% /var
labstore.svc.eqiad.wmnet:/dumps                9.1T  9.1T     0 100%
/public/dumps
labstore.svc.eqiad.wmnet:/project/cvn/project   30T   17T   14T  57%
/data/project
labstore.svc.eqiad.wmnet:/project/cvn/home      30T   17T   14T  57% /home
labstore.svc.eqiad.wmnet:/scratch              7.3T  2.6T  4.7T  36%
/data/scratch
labstore.svc.eqiad.wmnet:/keys                 960M   39M  921M   5%
/public/keys
labstore.svc.eqiad.wmnet:/backups               20T  3.0G   20T   1%
/public/backups
/dev/mapper/vd-second--local--disk              29G  172M   27G   1% /srv

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to