I've lamented for awhile that while swift/statsd provide a wealth of 
information, it's in a somewhat difficult to use format.  Specifically you have 
to connect to a socket and listen for messages.  Furthermore if you're 
listening, nobody else can.  I do realize there is a mechanism to send the data 
to graphite, but what if I'm not a graphite user OR want to look at the data at 
a finer granularity than is being sent to graphite?

What I've put together and would love to get some feedback on is a tool I'm 
calling 'statsdtee', specifically because you can configure statsd to send to 
the port it wants to listen on (configurable of course) and statsdtee will then 
process it locally AND tee it out another socket, making it possible to forward 
the data on to graphite and still allow local processing.

Local processing consists of calculating rolling counters and writing them to a 
file that looks much like most /proc entries, such as this:

$cat /tmp/statsdtee
V1.0 1403633349.159516
accaudt 0 0 0
accreap 0 0 0 0 0 0 0 0 0
accrepl 0 0 2100 0 0 0 1391 682 0 2100
accsrvr 1 0 0 0 0 2072 0
conaudt 0 0 0
conrepl 0 0 2892 0 0 0 1997 1107 0 2892
consrvr 2700 0 0 1 1 992 0
consync 541036 0 11 0 0
conupdt 0 17 17889
objaudt 0 0
objexpr 0 0
objrepl 0 0 0 0
objsrvr 117190 16325 0 43068 9 996 5 0 6904
objupdt 0 0 0 1704 0

In this format we're looking at data for account, container and object 
services.  There is a similar one for proxy.  The reason for the names on each 
line is what to report on is configurable in a conf file down to the 
granularity of a single line, thereby making it possible to report less 
information, though I'm not sure if one would really do that or not.

To make this mechanism really simple and avoid using internal timers, I'm 
simply looking at the time of each record and every time the value of the 
second changes, write out the current counters.  I could change it to every 
10th of  second but am thinking that really isn't necessary.  I could also 
drive it off a timer interrupt, but again I'm not sure that would really buy 
you anything.

My peeve with /proc is you never know what  each field means and so there is a 
second format in which headers are included and they look like this:

$ cat /tmp/statsdtee
V1.0 1403633339.410722
#       errs pass fail
accaudt 0 0 0
#       errs cfail cdel cremain cposs_remain ofail odel oremain oposs_remain
accreap 0 0 0 0 0 0 0 0 0
#       diff diff_cap nochg hasmat rsync rem_merge attmpt fail remov succ
accrepl 0 0 2100 0 0 0 1391 682 0 2100
#       put get post del head repl errs
accsrvr 1 0 0 0 0 2069 0
#       errs pass fail
conaudt 0 0 0
#       diff diff_cap nochg hasmat rsync rem_merge attmpt fail remov succ
conrepl 0 0 2793 0 0 0 1934 1083 0 2793
#       put get post del head repl errs
consrvr 2700 0 0 1 1 976 0
#       skip fail sync del put
consync 536193 0 11 0 0
#       succ fail no_chg
conupdt 0 17 17889
#       quar errs
objaudt 0 0
#       obj errs
objexpr 0 0
#       part_del part_upd suff_hashes suff_sync
objrepl 0 0 0 0
#       put get post del head repl errs quar async_pend
objsrvr 117190 16325 0 43068 9 996 5 0 6904
#       errs quar succ fail unlk
objupdt 0 0 0 1704 0

The important thing to remember about rolling counters is as many people who 
wish can read them simultaneously and be assured nobody is stepping on each 
other since they never get zeroed!  You simply read a sample, wait awhile and 
read another.  The result is the change in the counters over that interval and 
anyone can use any interval they choose.

So how useful people think this is?  Personally I think it's very useful...

The next step is how to calculate the numbers I'm reporting.  While statsd 
reports a lot of timing information, none of that really fits this model as all 
I want are counts.  So when I see a GET timing record, I count it as 1 GET.  
Seems to work so far. IS this a legitimate thing to be doing?  Feels right and 
from the preliminary testing I've been doing it seems pretty accurate.

One thing I've found missing is more detailed error information.  For example I 
can tell how many errors there were but I can't tell how many of each type 
there were.  Is this something that can easily be added?  I've found in our 
environment it can be useful when there's an increase in the number of errors 
on a particular server, knowing the type can be quite useful.

While I'm not currently counting everything, such as device specific data which 
would significantly increase the volume of output, I think I have covered quite 
a lot in my model.

Comments?

-mark
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to