On Sun, Dec 13, 2009 at 3:53 PM, Peter Tribble <[email protected]> wrote: > On Thu, Dec 10, 2009 at 5:29 AM, Mike Gerdts <[email protected]> wrote: >> >> Or rrdtool consumes the data instantly, but the raw data is kept >> around for a bit. > > You're heading a little further than I was originally. I was originally only > looking at the very bottom layer of the stack - just dumping enough raw > data both regularly enough and sufficiently completely that a range of > higher-level tools had something to chew on. > > My experience here is that munging data into rrd is relatively expensive, > at least on the scale we're looking at here. I suspect that for rrd collection > you would have to identify the subset of statistics of interest, and just keep > those. Or are you suggesting we rrd everything? (That won't work for any > meaningful definition of everything: just consider the I/O statistics for NFS > mounts in an environment with an active automounter.) And if just a > subset, can we identify that?
I have a centralized management box that gets performance data streamed from hundreds of hosts. That data is dropped into rrd files using the RRDs::update() method from an ancient version of rrdtool. Each update has between 2 and 10 data values. That is, an rrdupdate of a vmstat rrd file will have most of the columns from vmstat's output. One line of vmstat corresponds to one RRDs::update(). The processes that dump this data into the files is also doing a bit of other processing (mostly sanity checking data coming from the network - regular expressions are probably the most costly). This host is currently processing about 20,000 RRD updates per minute. This consumes about 85% of a single strand of a 1.2 GHz T2000, including the time spent in sanity checking, processing consolidation functions, etc. That is, 20,000 RRD updates per minute (about 100,000 data values per minute) consumes 2.6% of a three year old machine. How many data values do you think need to be stored per minute? What is the acceptable performance hit of measuring performance? My experience suggests parsing text is much more expensive than pulling data out of rrd databases. Sufficiently partitioned relational databases can can similarly efficient for updates and retrieval. Just to be clear, I don't think that rrd is the answer for everything, but it is certainly one of the most efficient data formats I have found. -- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ sysadmin-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/sysadmin-discuss
