Re: [openstack-dev] [ceilometer] [swift] Improving ceilometer.objectstore.swift_middleware

2014-07-31 Thread Seger, Mark (Cloud Services)
As a curiosity, are there any ballpark numbers around the volume of 
notifications ceilometers can handle?  Thinking about large scale swift 
deployments I expect to see thousands or even 10s of thousands of events per 
second.  And that's with today's technology.  Looking longer term I wouldn't be 
surprised to see 100s or even millions per second.  And that's just swift.  I'd 
think other services not yet invented will have their own firehoses of data to 
contribute.

-mark

-Original Message-
From: Julien Danjou [mailto:jul...@danjou.info] 
Sent: Thursday, July 31, 2014 5:24 AM
To: Chris Dent
Cc: OpenStack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [ceilometer] [swift] Improving 
ceilometer.objectstore.swift_middleware

On Wed, Jul 30 2014, Chris Dent wrote:

 What are other options? Of those above which are best or most 
 realistic?

I'm just thinking out loud and did not push that through, but I wonder if we 
should not try to use the oslo.messaging notifier middleware for that. It would 
be more standard (as it's the one usable on all HTTP
pipelines) and rely on notification and generates events, as anyway, HTTP 
requests are events.
Then it'd be up to Ceilometer to handle those notifications like it does for 
the rest of OpenStack.

--
Julien Danjou
/* Free Software hacker
   http://julien.danjou.info */

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [swift] - question about statsd messages and 404 errors

2014-07-25 Thread Seger, Mark (Cloud Services)
I'm trying to track object server GET errors using statsd and I'm not seeing 
them.  The test I'm doing is to simply do a GET on an non-existent object.  As 
expected, a 404 is returned and the object server log records it.  However, 
statsd implies it succeeded because there were no errors reported.  A read of 
the admin guide does clearly say the GET timing includes failed GETs, but my 
question then becomes how is one to tell there was a failure?  Should there be 
another type of message that DOES report errors?  Or how about including these 
in the 'object-server.GET.errors.timing' message?

Since the server I'm testing on is running all services, you get to see them 
together, but if I was looking at a standalone object server I'd never know:

account-server.HEAD.timing:1.85513496399|ms
proxy-server.account.HEAD.204.timing:21.3139057159|ms
proxy-server.account.HEAD.204.xfer:0|c
proxy-server.container.HEAD.204.timing:6.98900222778|ms
proxy-server.container.HEAD.204.xfer:0|c
account-server.HEAD.timing:1.72400474548|ms
proxy-server.account.HEAD.204.timing:19.4480419159|ms
proxy-server.account.HEAD.204.xfer:0|c
object-server.GET.timing:0.359058380127|ms
object-server.GET.timing:0.255107879639|ms
proxy-server.object.GET.404.first-byte.timing:7.84802436829|ms
proxy-server.object.GET.404.timing:8.13698768616|ms
proxy-server.object.GET.404.xfer:70|c

-mark

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [swift] Providing a potentially more open interface to statsd statistics

2014-06-24 Thread Seger, Mark (Cloud Services)
I've lamented for awhile that while swift/statsd provide a wealth of 
information, it's in a somewhat difficult to use format.  Specifically you have 
to connect to a socket and listen for messages.  Furthermore if you're 
listening, nobody else can.  I do realize there is a mechanism to send the data 
to graphite, but what if I'm not a graphite user OR want to look at the data at 
a finer granularity than is being sent to graphite?

What I've put together and would love to get some feedback on is a tool I'm 
calling 'statsdtee', specifically because you can configure statsd to send to 
the port it wants to listen on (configurable of course) and statsdtee will then 
process it locally AND tee it out another socket, making it possible to forward 
the data on to graphite and still allow local processing.

Local processing consists of calculating rolling counters and writing them to a 
file that looks much like most /proc entries, such as this:

$cat /tmp/statsdtee
V1.0 1403633349.159516
accaudt 0 0 0
accreap 0 0 0 0 0 0 0 0 0
accrepl 0 0 2100 0 0 0 1391 682 0 2100
accsrvr 1 0 0 0 0 2072 0
conaudt 0 0 0
conrepl 0 0 2892 0 0 0 1997 1107 0 2892
consrvr 2700 0 0 1 1 992 0
consync 541036 0 11 0 0
conupdt 0 17 17889
objaudt 0 0
objexpr 0 0
objrepl 0 0 0 0
objsrvr 117190 16325 0 43068 9 996 5 0 6904
objupdt 0 0 0 1704 0

In this format we're looking at data for account, container and object 
services.  There is a similar one for proxy.  The reason for the names on each 
line is what to report on is configurable in a conf file down to the 
granularity of a single line, thereby making it possible to report less 
information, though I'm not sure if one would really do that or not.

To make this mechanism really simple and avoid using internal timers, I'm 
simply looking at the time of each record and every time the value of the 
second changes, write out the current counters.  I could change it to every 
10th of  second but am thinking that really isn't necessary.  I could also 
drive it off a timer interrupt, but again I'm not sure that would really buy 
you anything.

My peeve with /proc is you never know what  each field means and so there is a 
second format in which headers are included and they look like this:

$ cat /tmp/statsdtee
V1.0 140369.410722
#   errs pass fail
accaudt 0 0 0
#   errs cfail cdel cremain cposs_remain ofail odel oremain oposs_remain
accreap 0 0 0 0 0 0 0 0 0
#   diff diff_cap nochg hasmat rsync rem_merge attmpt fail remov succ
accrepl 0 0 2100 0 0 0 1391 682 0 2100
#   put get post del head repl errs
accsrvr 1 0 0 0 0 2069 0
#   errs pass fail
conaudt 0 0 0
#   diff diff_cap nochg hasmat rsync rem_merge attmpt fail remov succ
conrepl 0 0 2793 0 0 0 1934 1083 0 2793
#   put get post del head repl errs
consrvr 2700 0 0 1 1 976 0
#   skip fail sync del put
consync 536193 0 11 0 0
#   succ fail no_chg
conupdt 0 17 17889
#   quar errs
objaudt 0 0
#   obj errs
objexpr 0 0
#   part_del part_upd suff_hashes suff_sync
objrepl 0 0 0 0
#   put get post del head repl errs quar async_pend
objsrvr 117190 16325 0 43068 9 996 5 0 6904
#   errs quar succ fail unlk
objupdt 0 0 0 1704 0

The important thing to remember about rolling counters is as many people who 
wish can read them simultaneously and be assured nobody is stepping on each 
other since they never get zeroed!  You simply read a sample, wait awhile and 
read another.  The result is the change in the counters over that interval and 
anyone can use any interval they choose.

So how useful people think this is?  Personally I think it's very useful...

The next step is how to calculate the numbers I'm reporting.  While statsd 
reports a lot of timing information, none of that really fits this model as all 
I want are counts.  So when I see a GET timing record, I count it as 1 GET.  
Seems to work so far. IS this a legitimate thing to be doing?  Feels right and 
from the preliminary testing I've been doing it seems pretty accurate.

One thing I've found missing is more detailed error information.  For example I 
can tell how many errors there were but I can't tell how many of each type 
there were.  Is this something that can easily be added?  I've found in our 
environment it can be useful when there's an increase in the number of errors 
on a particular server, knowing the type can be quite useful.

While I'm not currently counting everything, such as device specific data which 
would significantly increase the volume of output, I think I have covered quite 
a lot in my model.

Comments?

-mark
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev