suggesting storage-related metrics - storage device failure is sort of a big deal storage is where the valuable data sits, and device failures impacts everything suggest your goal be - identify which SSDs, HDDs and Servers are reliable, and which are unreliable.
there are tools - https://en.wikipedia.org/wiki/S.M.A.R.T. start of day - discover the server, server type, server thermals, also each HDD/SSD discover WWID Mac address - this is your unique server database attribute WWID - for each storage device (HDD or SSD) has a world wide id (WWID) a unique id like a mac address. this is your unique HDD/SSD database attribute also capture HDD/SSD manufacturer, drive type, drive total capacity, drive capacity consumed/empty monitor - device uptime, device failures, monitor for device failure rates (ex. drive 322 installed 2018-jan-20 and failed 2018-mar-20) get the idea? if possible capture what caused a device failure (sometimes possible if you capture errors leading up to the failure) (all HDDs are not the same. HDDs failure rates vary wildly. All SSDs are not the same, SSD failure rates and failure causes vary wildly. (hint. pay attention to SSD failures as compared to HDD failures and pay attention to the frequency or infrequency of SSD failures related to write endurance) (hint. all server systems are not the same. HDDs dont do well in high heat or high vibration environments, capture system thermals and system types, you will soon see a pattern showing well desgined and poorly designed server systems) its useful to know which HDD/SSDs fail a lot or fail a little, and which server systems contribute to HDD failures, and which ones dont just my two cents On Fri, Apr 6, 2018 at 4:13 AM, Mark Bonetti <mark.bonetti.sc...@gmail.com> wrote: > Hi, > I'm building a monitoring system for HBase and want to set up default > alerts (threshold or anomaly) on 2-3 key metrics everyone who uses HBase > typically wants to alert on, but I don't yet have production-grade > experience with HBase. > > Importantly, alert rules have to be generally useful, so can't be on > metrics whose values vary wildly based on the size of deployment. > > In other words, which metrics would be most significant indicators that > something went wrong with your HBase? > > I thought the best place to find experienced HBase users, who would find > answering this question trivial, would be here. > > Thanks very much, > Mark > -- hubb...@hubbertsmith.com | 385 321 0757 | LinkedIN <http://tinyurl.com/7v5eu2p> Linkedin Learning: Storage Foundations Cert Prep: SNCP Foundations S10-110 <https://www.linkedin.com/learning/cert-prep-sncp-foundations-s10-110/storage-and-business-and-career-path>