Re: [tahoe-dev] running a stats gatherer

Jody Harris Tue, 22 Dec 2009 21:53:27 -0800

That's very helpful. Thank you!

On a quick read-through I have a question I didn't see answered:


Can I run the stats gatherer on a machine running one or more tahoe nodes?
Is there a way to reconfigure the port (:3456) to avoid a collision with an
existing node? (I know that I can reassign the other nodes local port, but I
wanted to ask this question here.)

Thank you for putting this documentation up!


----
- Think carefully.
- Contra mundum - "Against the world" (St. Athanasius)
- Credo ut intelliga - "I believe that I may know" (St. Augustin of Hippo)


On Tue, Dec 22, 2009 at 10:27 PM, Brian Warner <[email protected]> wrote:

> Jody Harris wrote:
> > I have looked all over the site, I've searched, I cannot find the
> > documentation on setting up a stats gathered.
>
> Sure. I just wrote up a lengthy note on Tahoe stats, the gatherer, and a
> few other topics, which (as of 30 seconds ago) now lives in the source
> tree in docs/stats.txt . I've attached a copy here. Please let me know
> if this answers your questions: if not, I'd like to update
> docs/stats.txt with additional information.
>
> cheers,
>  -Brian
>
> = Tahoe Statistics =
>
> Each Tahoe node collects and publishes statistics about its operations as
> it
> runs. These include counters of how many files have been uploaded and
> downloaded, CPU usage information, performance numbers like latency of
> storage server operations, and available disk space.
>
> The easiest way to see the stats for any given node is use the web
> interface.
> From the main "Welcome Page", follow the "Operational Statistics" link
> inside
> the small "This Client" box. If the welcome page lives at
> http://localhost:3456/, then the statistics page will live at
> http://localhost:3456/statistics . This presents a summary of the stats
> block, along with a copy of the raw counters. To obtain just the raw
> counters
> (in JSON format), use /statistics?t=json instead.
>
> = Statistics Categories =
>
> The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
> are strictly counters: they are reset to zero when the node is started, and
> grow upwards. 'stats' are non-incrementing values, used to measure the
> current state of various systems. Some stats are actually booleans,
> expressed
> as '1' for true and '0' for false (internal restrictions require all stats
> values to be numbers).
>
> Under both the 'counters' and 'stats' dictionaries, each individual stat
> has
> a key with a dot-separated name, breaking them up into groups like
> 'cpu_monitor' and 'storage_server'.
>
> The currently available stats (as of release 1.6.0 or so) are described
> here:
>
> counters.storage_server.*: this group counts inbound storage-server
>                           operations. They are not provided by client-only
>                           nodes which have been configured to not run a
>                           storage server (with [storage]enabled=false in
>                           tahoe.cfg)
>  allocate, write, close, abort: these are for immutable file uploads.
>                                 'allocate' is incremented when a client
> asks
>                                 if it can upload a share to the server.
>                                 'write' is incremented for each chunk of
>                                 data written. 'close' is incremented when
>                                 the share is finished. 'abort' is
>                                 incremented if the client abandons the
>                                 uploaed.
>  get, read: these are for immutable file downloads. 'get' is incremented
>             when a client asks if the server has a specific share. 'read'
> is
>             incremented for each chunk of data read.
>  readv, writev: these are for immutable file creation, publish, and
>                 retrieve. 'readv' is incremented each time a client reads
>                 part of a mutable share. 'writev' is incremented each time
> a
>                 client sends a modification request.
>  add-lease, renew, cancel: these are for share lease modifications.
>                            'add-lease' is incremented when an 'add-lease'
>                            operation is performed (which either adds a new
>                            lease or renews an existing lease). 'renew' is
>                            for the 'renew-lease' operation (which can only
>                            be used to renew an existing one). 'cancel' is
>                            used for the 'cancel-lease' operation.
>  bytes_freed: this counts how many bytes were freed when a 'cancel-lease'
>               operation removed the last lease from a share and the share
>               was thus deleted.
>  bytes_added: this counts how many bytes were consumed by immutable share
>               uploads. It is incremented at the same time as the 'close'
>               counter.
>
> stats.storage_server.*:
>  allocated: this counts how many bytes are currently 'allocated', which
>            tracks the space that will eventually be consumed by immutable
>            share upload operations. The stat is increased as soon as the
>            upload begins (at the same time the 'allocated' counter is
>            incremented), and goes back to zero when the 'close' or 'abort'
>            message is received (at which point the 'disk_used' stat should
>            incremented by the same amount).
>  disk_total
>  disk_used
>  disk_free_for_root
>  disk_free_for_nonroot
>  disk_avail
>  reserved_space: these all reflect disk-space usage policies and status.
>                 'disk_total' is the total size of disk where the storage
>                 server's BASEDIR/storage/shares directory lives, as
> reported
>                 by /bin/df or equivalent. 'disk_used',
> 'disk_free_for_root',
>                 and 'disk_free_for_nonroot' show related information.
>                 'reserved_space' reports the reservation configured by the
>                 tahoe.cfg [storage]reserved_space value. 'disk_avail'
>                 reports the remaining disk space available for the Tahoe
>                 server after subtracting reserved_space from disk_avail.
> All
>                 values are in bytes.
>  accepting_immutable_shares: this is '1' if the storage server is currently
>                             accepting uploads of immutable shares. It may
> be
>                             '0' if a server is disabled by configuration,
> or
>                             if the disk is full (i.e. disk_avail is less
>                             than reserved_space).
>  total_bucket_count: this counts the number of 'buckets' (i.e. unique
>                     storage-index values) currently managed by the storage
>                     server. It indicates roughly how many files are managed
>                     by the server.
>  latencies.*.*: these stats keep track of local disk latencies for
>                storage-server operations. A number of percentile values are
>                tracked for many operations. For example,
>                'storage_server.latencies.readv.50_0_percentile' records the
>                median response time for a 'readv' request. All values are
> in
>                seconds. These are recorded by the storage server, starting
>                from the time the request arrives (post-deserialization) and
>                ending when the response begins serialization. As such, they
>                are mostly useful for measuring disk speeds. The operations
>                tracked are the same as the counters.storage_server.*
> counter
>                values (allocate, write, close, get, read, add-lease, renew,
>                cancel, readv, writev). The percentile values tracked are:
>                mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
>                90_0_percentile, 95_0_percentile, 99_0_percentile,
>                99_9_percentile. (the last value, 99.9 percentile, means
> that
>                999 out of the last 1000 operations were faster than the
>                given number, and is the same threshold used by Amazon's
>                internal SLA, according to the Dynamo paper).
>
> counters.uploader.files_uploaded
> counters.uploader.bytes_uploaded
> counters.downloader.files_downloaded
> counters.downloader.bytes_downloaded
>
>  These count client activity: a Tahoe client will increment these when it
>  uploads or downloads an immutable file. 'files_uploaded' is incremented by
>  one for each operation, while 'bytes_uploaded' is incremented by the size
> of
>  the file.
>
> counters.mutable.files_published
> counters.mutable.bytes_published
> counters.mutable.files_retrieved
> counters.mutable.bytes_retrieved
>
>  These count client activity for mutable files. 'published' is the act of
>  changing an existing mutable file (or creating a brand-new mutable file).
>  'retrieved' is the act of reading its current contents.
>
> counters.chk_upload_helper.*
>
>  These count activity of the "Helper", which receives ciphertext from
> clients
>  and performs erasure-coding and share upload for files that are not
> already
>  in the grid. The code which implements these counters is in
>  src/allmydata/immutable/offloaded.py .
>
>  upload_requests: incremented each time a client asks to upload a file
>  upload_already_present: incremented when the file is already in the grid
>  upload_need_upload: incremented when the file is not already in the grid
>  resumes: incremented when the helper already has partial ciphertext for
>           the requested upload, indicating that the client is resuming an
>           earlier upload
>  fetched_bytes: this counts how many bytes of ciphertext have been fetched
>                 from uploading clients
>  encoded_bytes: this counts how many bytes of ciphertext have been
>                 encoded and turned into successfully-uploaded shares. If no
>                 uploads have failed or been abandoned, encoded_bytes should
>                 eventually equal fetched_bytes.
>
> stats.chk_upload_helper.*
>
>  These also track Helper activity:
>
>  active_uploads: how many files are currently being uploaded. 0 when idle.
>  incoming_count: how many cache files are present in the incoming/
> directory,
>                  which holds ciphertext files that are still being fetched
>                  from the client
>  incoming_size: total size of cache files in the incoming/ directory
>  incoming_size_old: total size of 'old' cache files (more than 48 hours)
>  encoding_count: how many cache files are present in the encoding/
> directory,
>                  which holds ciphertext files that are being encoded and
>                  uploaded
>  encoding_size: total size of cache files in the encoding/ directory
>  encoding_size_old: total size of 'old' cache files (more than 48 hours)
>
> stats.node.uptime: how many seconds since the node process was started
>
> stats.cpu_monitor.*:
>  .1min_avg, 5min_avg, 15min_avg: estimate of what percentage of system CPU
>                                  time was consumed by the node process,
> over
>                                  the given time interval. Expressed as a
>                                  float, 0.0 for 0%, 1.0 for 100%
>  .total: estimate of total number of CPU seconds consumed by node since
>          the process was started. Ticket #472 indicates that .total may
>          sometimes be negative due to wraparound of the kernel's counter.
>
> stats.load_monitor.*:
>  When enabled, the "load monitor" continually schedules a one-second
>  callback, and measures how late the response is. This estimates system
> load
>  (if the system is idle, the response should be on time). This is only
>  enabled if a stats-gatherer is configured.
>
>  .avg_load: average "load" value (seconds late) over the last minute
>  .max_load: maximum "load" value over the last minute
>
>
> = Running a Tahoe Stats-Gatherer Service =
>
> The "stats-gatherer" is a simple daemon that periodically collects stats
> from
> several tahoe nodes. It could be useful, e.g., in a production environment,
> where you want to monitor dozens of storage servers from a central
> management
> host.
>
> The stats gatherer listens on a network port using the same Foolscap
> connection library that Tahoe clients use to connect to storage servers.
> Tahoe nodes can be configured to connect to the stats gatherer and publish
> their stats on a periodic basis. (in fact, what happens is that nodes
> connect
> to the gatherer and offer it a second FURL which points back to the node's
> "stats port", which the gatherer then uses to pull stats on a periodic
> basis.
> The initial connection is flipped to allow the nodes to live behind NAT
> boxes, as long as the stats-gatherer has a reachable IP address)
>
> The stats-gatherer is created in the same fashion as regular tahoe client
> nodes and introducer nodes. Choose a base directory for the gatherer to
> live
> in (but do not create the directory). Then run:
>
>  tahoe create-stats-gatherer $BASEDIR
>
> and start it with "tahoe start $BASEDIR". Once running, the gatherer will
> write a FURL into $BASEDIR/stats_gatherer.furl .
>
> To configure a Tahoe client/server node to contact the stats gatherer, copy
> this FURL into the node's tahoe.cfg file, in a section named "[client]",
> under a key named "stats_gatherer.furl", like so:
>
>  [client]
>  stats_gatherer.furl = pb://
> [email protected]:49997/wxycb4kaexzskubjnauxeoptympyf45y
>
> or simply copy the stats_gatherer.furl file into the node's base directory
> (next to the tahoe.cfg file): it will be interpreted in the same way.
>
> Once running, the stats gatherer will create a standard python "pickle"
> file
> in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats
> information from every connected node and write them into the pickle. The
> pickle will contain a dictionary, in which node identifiers (known as
> "tubid"
> strings) are the keys, and the values are a dict with 'timestamp',
> 'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
> dictionary as made available at http://localhost:3456/statistics?t=json .
> The
> pickle file will only contain the most recent update from each node.
>
> Other tools can be built to examine these stats and render them into
> something useful. For example, a tool could sum the
> "storage_server.disk_avail' values from all servers to compute a
> total-disk-available number for the entire grid (however, the "disk
> watcher"
> daemon, in misc/spacetime/, is better suited for this specific task).
>
> = Using Munin To Graph Stats Values =
>
> The misc/munin/ directory contains various plugins to graph stats for Tahoe
> nodes. They are intended for use with the Munin system-management tool,
> which
> typically polls target systems every 5 minutes and produces a web page with
> graphs of various things over multiple time scales (last hour, last month,
> last year).
>
> Most of the plugins are designed to pull stats from a single Tahoe node,
> and
> are configured with the http://localhost:3456/statistics?t=json URL. The
> "tahoe_stats" plugin is designed to read from the pickle file created by
> the
> stats-gatherer. Some are to be used with the disk watcher, and a few (like
> tahoe_nodememory) are designed to watch the node processes directly (and
> must
> therefore run on the same host as the target node).
>
> Please see the docstrings at the beginning of each plugin for details, and
> the "tahoe-conf" file for notes about configuration and installing these
> plugins into a Munin environment.
>
> _______________________________________________
> tahoe-dev mailing list
> [email protected]
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
>
>

_______________________________________________
tahoe-dev mailing list
[email protected]
http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev

Re: [tahoe-dev] running a stats gatherer

Reply via email to