There seem to be two main kinds of performance metrics, ones that measure trouble and ones that measure resources. I’ll call the first kind “alarm” metrics and the other kind “gauge” metrics. Alarm metrics are important, but I value gauge metrics more. Both are essential to an effective monitoring and alerting strategy.
Alarms are great for troubleshooting, they indicate that it’s time to react to something. They have names containing words like timeouts, alerts and errors, but also words like waits or queue length. And they tend to be spikey. For example, think about a common alarm metric: SQL Server’s Blocked Process Report (BPR). The report provides (actionable) information, but only after a concurrency issue is detected. Trouble can strike quick and SQL Server can go from generating zero BPR events per second to dozens or hundreds. Alarm metrics look like this:
Now contrast that with a gauge metric. Gauge metrics often change value gradually and allow earlier interventions because they provide a larger window of opportunity to make corrections.
If you pick a decent threshold value, then all gauges can generate alerts (just like alarms do!). As they approach trouble, gauges can look like this:
And the best kind of gauge metrics are the kind that have their own natural threshold. Think about measuring the amount of free disk space or available memory. Trouble occurs when those values hit zero and those guages look like this:
I compare different gauges and alarms to further explain what I mean.
|Avg. Disks Read Queue Length||Disk Reads/sec|
|Processor Queue Length||% Processor Time|
|Buffer Cache Hit Ratio||Page lookups/sec|
|“You are running low on disk space”||“10.3 GB free of 119 GB”|
|Number of shoppers waiting at checkout||Number of shoppers arriving per hour|
|Number of cars travelling slower than speed limit||Number of cars per hour|
|Number of rings of power tossed into mount doom||Ring distance to mount doom|
Hat tip to Daryl McMillan. Our conversations led directly to this post.