That's very helpful. Thank you! On a quick read-through I have a question I didn't see answered:
Can I run the stats gatherer on a machine running one or more tahoe nodes? Is there a way to reconfigure the port (:3456) to avoid a collision with an existing node? (I know that I can reassign the other nodes local port, but I wanted to ask this question here.) Thank you for putting this documentation up! ---- - Think carefully. - Contra mundum - "Against the world" (St. Athanasius) - Credo ut intelliga - "I believe that I may know" (St. Augustin of Hippo) On Tue, Dec 22, 2009 at 10:27 PM, Brian Warner <[email protected]> wrote: > Jody Harris wrote: > > I have looked all over the site, I've searched, I cannot find the > > documentation on setting up a stats gathered. > > Sure. I just wrote up a lengthy note on Tahoe stats, the gatherer, and a > few other topics, which (as of 30 seconds ago) now lives in the source > tree in docs/stats.txt . I've attached a copy here. Please let me know > if this answers your questions: if not, I'd like to update > docs/stats.txt with additional information. > > cheers, > -Brian > > = Tahoe Statistics = > > Each Tahoe node collects and publishes statistics about its operations as > it > runs. These include counters of how many files have been uploaded and > downloaded, CPU usage information, performance numbers like latency of > storage server operations, and available disk space. > > The easiest way to see the stats for any given node is use the web > interface. > From the main "Welcome Page", follow the "Operational Statistics" link > inside > the small "This Client" box. If the welcome page lives at > http://localhost:3456/, then the statistics page will live at > http://localhost:3456/statistics . This presents a summary of the stats > block, along with a copy of the raw counters. To obtain just the raw > counters > (in JSON format), use /statistics?t=json instead. > > = Statistics Categories = > > The stats dictionary contains two keys: 'counters' and 'stats'. 'counters' > are strictly counters: they are reset to zero when the node is started, and > grow upwards. 'stats' are non-incrementing values, used to measure the > current state of various systems. Some stats are actually booleans, > expressed > as '1' for true and '0' for false (internal restrictions require all stats > values to be numbers). > > Under both the 'counters' and 'stats' dictionaries, each individual stat > has > a key with a dot-separated name, breaking them up into groups like > 'cpu_monitor' and 'storage_server'. > > The currently available stats (as of release 1.6.0 or so) are described > here: > > counters.storage_server.*: this group counts inbound storage-server > operations. They are not provided by client-only > nodes which have been configured to not run a > storage server (with [storage]enabled=false in > tahoe.cfg) > allocate, write, close, abort: these are for immutable file uploads. > 'allocate' is incremented when a client > asks > if it can upload a share to the server. > 'write' is incremented for each chunk of > data written. 'close' is incremented when > the share is finished. 'abort' is > incremented if the client abandons the > uploaed. > get, read: these are for immutable file downloads. 'get' is incremented > when a client asks if the server has a specific share. 'read' > is > incremented for each chunk of data read. > readv, writev: these are for immutable file creation, publish, and > retrieve. 'readv' is incremented each time a client reads > part of a mutable share. 'writev' is incremented each time > a > client sends a modification request. > add-lease, renew, cancel: these are for share lease modifications. > 'add-lease' is incremented when an 'add-lease' > operation is performed (which either adds a new > lease or renews an existing lease). 'renew' is > for the 'renew-lease' operation (which can only > be used to renew an existing one). 'cancel' is > used for the 'cancel-lease' operation. > bytes_freed: this counts how many bytes were freed when a 'cancel-lease' > operation removed the last lease from a share and the share > was thus deleted. > bytes_added: this counts how many bytes were consumed by immutable share > uploads. It is incremented at the same time as the 'close' > counter. > > stats.storage_server.*: > allocated: this counts how many bytes are currently 'allocated', which > tracks the space that will eventually be consumed by immutable > share upload operations. The stat is increased as soon as the > upload begins (at the same time the 'allocated' counter is > incremented), and goes back to zero when the 'close' or 'abort' > message is received (at which point the 'disk_used' stat should > incremented by the same amount). > disk_total > disk_used > disk_free_for_root > disk_free_for_nonroot > disk_avail > reserved_space: these all reflect disk-space usage policies and status. > 'disk_total' is the total size of disk where the storage > server's BASEDIR/storage/shares directory lives, as > reported > by /bin/df or equivalent. 'disk_used', > 'disk_free_for_root', > and 'disk_free_for_nonroot' show related information. > 'reserved_space' reports the reservation configured by the > tahoe.cfg [storage]reserved_space value. 'disk_avail' > reports the remaining disk space available for the Tahoe > server after subtracting reserved_space from disk_avail. > All > values are in bytes. > accepting_immutable_shares: this is '1' if the storage server is currently > accepting uploads of immutable shares. It may > be > '0' if a server is disabled by configuration, > or > if the disk is full (i.e. disk_avail is less > than reserved_space). > total_bucket_count: this counts the number of 'buckets' (i.e. unique > storage-index values) currently managed by the storage > server. It indicates roughly how many files are managed > by the server. > latencies.*.*: these stats keep track of local disk latencies for > storage-server operations. A number of percentile values are > tracked for many operations. For example, > 'storage_server.latencies.readv.50_0_percentile' records the > median response time for a 'readv' request. All values are > in > seconds. These are recorded by the storage server, starting > from the time the request arrives (post-deserialization) and > ending when the response begins serialization. As such, they > are mostly useful for measuring disk speeds. The operations > tracked are the same as the counters.storage_server.* > counter > values (allocate, write, close, get, read, add-lease, renew, > cancel, readv, writev). The percentile values tracked are: > mean, 01_0_percentile, 10_0_percentile, 50_0_percentile, > 90_0_percentile, 95_0_percentile, 99_0_percentile, > 99_9_percentile. (the last value, 99.9 percentile, means > that > 999 out of the last 1000 operations were faster than the > given number, and is the same threshold used by Amazon's > internal SLA, according to the Dynamo paper). > > counters.uploader.files_uploaded > counters.uploader.bytes_uploaded > counters.downloader.files_downloaded > counters.downloader.bytes_downloaded > > These count client activity: a Tahoe client will increment these when it > uploads or downloads an immutable file. 'files_uploaded' is incremented by > one for each operation, while 'bytes_uploaded' is incremented by the size > of > the file. > > counters.mutable.files_published > counters.mutable.bytes_published > counters.mutable.files_retrieved > counters.mutable.bytes_retrieved > > These count client activity for mutable files. 'published' is the act of > changing an existing mutable file (or creating a brand-new mutable file). > 'retrieved' is the act of reading its current contents. > > counters.chk_upload_helper.* > > These count activity of the "Helper", which receives ciphertext from > clients > and performs erasure-coding and share upload for files that are not > already > in the grid. The code which implements these counters is in > src/allmydata/immutable/offloaded.py . > > upload_requests: incremented each time a client asks to upload a file > upload_already_present: incremented when the file is already in the grid > upload_need_upload: incremented when the file is not already in the grid > resumes: incremented when the helper already has partial ciphertext for > the requested upload, indicating that the client is resuming an > earlier upload > fetched_bytes: this counts how many bytes of ciphertext have been fetched > from uploading clients > encoded_bytes: this counts how many bytes of ciphertext have been > encoded and turned into successfully-uploaded shares. If no > uploads have failed or been abandoned, encoded_bytes should > eventually equal fetched_bytes. > > stats.chk_upload_helper.* > > These also track Helper activity: > > active_uploads: how many files are currently being uploaded. 0 when idle. > incoming_count: how many cache files are present in the incoming/ > directory, > which holds ciphertext files that are still being fetched > from the client > incoming_size: total size of cache files in the incoming/ directory > incoming_size_old: total size of 'old' cache files (more than 48 hours) > encoding_count: how many cache files are present in the encoding/ > directory, > which holds ciphertext files that are being encoded and > uploaded > encoding_size: total size of cache files in the encoding/ directory > encoding_size_old: total size of 'old' cache files (more than 48 hours) > > stats.node.uptime: how many seconds since the node process was started > > stats.cpu_monitor.*: > .1min_avg, 5min_avg, 15min_avg: estimate of what percentage of system CPU > time was consumed by the node process, > over > the given time interval. Expressed as a > float, 0.0 for 0%, 1.0 for 100% > .total: estimate of total number of CPU seconds consumed by node since > the process was started. Ticket #472 indicates that .total may > sometimes be negative due to wraparound of the kernel's counter. > > stats.load_monitor.*: > When enabled, the "load monitor" continually schedules a one-second > callback, and measures how late the response is. This estimates system > load > (if the system is idle, the response should be on time). This is only > enabled if a stats-gatherer is configured. > > .avg_load: average "load" value (seconds late) over the last minute > .max_load: maximum "load" value over the last minute > > > = Running a Tahoe Stats-Gatherer Service = > > The "stats-gatherer" is a simple daemon that periodically collects stats > from > several tahoe nodes. It could be useful, e.g., in a production environment, > where you want to monitor dozens of storage servers from a central > management > host. > > The stats gatherer listens on a network port using the same Foolscap > connection library that Tahoe clients use to connect to storage servers. > Tahoe nodes can be configured to connect to the stats gatherer and publish > their stats on a periodic basis. (in fact, what happens is that nodes > connect > to the gatherer and offer it a second FURL which points back to the node's > "stats port", which the gatherer then uses to pull stats on a periodic > basis. > The initial connection is flipped to allow the nodes to live behind NAT > boxes, as long as the stats-gatherer has a reachable IP address) > > The stats-gatherer is created in the same fashion as regular tahoe client > nodes and introducer nodes. Choose a base directory for the gatherer to > live > in (but do not create the directory). Then run: > > tahoe create-stats-gatherer $BASEDIR > > and start it with "tahoe start $BASEDIR". Once running, the gatherer will > write a FURL into $BASEDIR/stats_gatherer.furl . > > To configure a Tahoe client/server node to contact the stats gatherer, copy > this FURL into the node's tahoe.cfg file, in a section named "[client]", > under a key named "stats_gatherer.furl", like so: > > [client] > stats_gatherer.furl = pb:// > [email protected]:49997/wxycb4kaexzskubjnauxeoptympyf45y > > or simply copy the stats_gatherer.furl file into the node's base directory > (next to the tahoe.cfg file): it will be interpreted in the same way. > > Once running, the stats gatherer will create a standard python "pickle" > file > in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats > information from every connected node and write them into the pickle. The > pickle will contain a dictionary, in which node identifiers (known as > "tubid" > strings) are the keys, and the values are a dict with 'timestamp', > 'nickname', and 'stats' keys. d[tubid][stats] will contain the stats > dictionary as made available at http://localhost:3456/statistics?t=json . > The > pickle file will only contain the most recent update from each node. > > Other tools can be built to examine these stats and render them into > something useful. For example, a tool could sum the > "storage_server.disk_avail' values from all servers to compute a > total-disk-available number for the entire grid (however, the "disk > watcher" > daemon, in misc/spacetime/, is better suited for this specific task). > > = Using Munin To Graph Stats Values = > > The misc/munin/ directory contains various plugins to graph stats for Tahoe > nodes. They are intended for use with the Munin system-management tool, > which > typically polls target systems every 5 minutes and produces a web page with > graphs of various things over multiple time scales (last hour, last month, > last year). > > Most of the plugins are designed to pull stats from a single Tahoe node, > and > are configured with the http://localhost:3456/statistics?t=json URL. The > "tahoe_stats" plugin is designed to read from the pickle file created by > the > stats-gatherer. Some are to be used with the disk watcher, and a few (like > tahoe_nodememory) are designed to watch the node processes directly (and > must > therefore run on the same host as the target node). > > Please see the docstrings at the beginning of each plugin for details, and > the "tahoe-conf" file for notes about configuration and installing these > plugins into a Munin environment. > > _______________________________________________ > tahoe-dev mailing list > [email protected] > http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev > >
_______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
