Re: Which monitoring metrics to alert on?

Hubbert Smith Fri, 06 Apr 2018 08:26:29 -0700

suggesting storage-related metrics - storage device failure is sort of a
big deal
storage is where the valuable data sits, and device failures impacts
everything
suggest your goal be - identify which SSDs, HDDs and Servers are reliable,
and which are unreliable.

there are tools - https://en.wikipedia.org/wiki/S.M.A.R.T.

start of day - discover the server, server type, server thermals, also each
HDD/SSD discover WWID
Mac address - this is your unique server database attribute
WWID - for each storage device (HDD or SSD) has a world wide id (WWID) a
unique id like a mac address. this is your unique HDD/SSD database attribute
also capture HDD/SSD manufacturer, drive type, drive total capacity, drive
capacity consumed/empty

monitor - device uptime, device failures, monitor for device failure rates
(ex. drive 322 installed 2018-jan-20 and failed 2018-mar-20) get the idea?
if possible capture what caused a device failure (sometimes possible if you
capture errors leading up to the failure)
(all HDDs are not the same. HDDs failure rates vary wildly. All SSDs are
not the same, SSD failure rates and failure causes vary wildly.
(hint. pay attention to SSD failures as compared to HDD failures and pay
attention to the frequency or infrequency of SSD failures related to write
endurance)
(hint. all server systems are not the same. HDDs dont do well in high heat
or high vibration environments, capture system thermals and system types,
you will soon see a pattern showing well desgined and poorly designed
server systems)
its useful to know which HDD/SSDs fail a lot or fail a little, and which
server systems contribute to HDD failures, and which ones dont

just my two cents

On Fri, Apr 6, 2018 at 4:13 AM, Mark Bonetti <[email protected]>
wrote:

> Hi,
> I'm building a monitoring system for HBase and want to set up default
> alerts (threshold or anomaly) on 2-3 key metrics everyone who uses HBase
> typically wants to alert on, but I don't yet have production-grade
> experience with HBase.
>
> Importantly, alert rules have to be generally useful, so can't be on
> metrics whose values vary wildly based on the size of deployment.
>
> In other words, which metrics would be most significant indicators that
> something went wrong with your HBase?
>
> I thought the best place to find experienced HBase users, who would find
> answering this question trivial, would be here.
>
> Thanks very much,
> Mark
>

-- 
[email protected] | 385 321 0757  |   LinkedIN
<http://tinyurl.com/7v5eu2p>
Linkedin Learning: Storage Foundations Cert Prep: SNCP Foundations S10-110
<https://www.linkedin.com/learning/cert-prep-sncp-foundations-s10-110/storage-and-business-and-career-path>

Re: Which monitoring metrics to alert on?

Reply via email to