Monitoring Alerts SOP: How to Avoid a Massive Sea of Alerts

MSPs are drowning in a massive sea of alerts; the volume of which can be absolutely deafening.  And it's getting worse.  As technology increases, we can see and control more endpoints than ever.  And, as the frequency and damage from hacking increases, we continue to react - not knowing where they'll break through the defenses next - with more alerts.

It's not uncommon to see thousands of alert tickets created every day.  Unless you're properly staffed, most of these alerts are ignored and written off as noise.  Yeah, sure...one of them could be an embarrassment to the MSP (or even worse, a barrage of attacks), but we'll never know, as they're just lost in the noise.

When we talk about alert tickets, we're no longer limited to RMM-generated tickets in the default Monitoring Alert queue.  Many MSPs are creating different queues for different alerts such as BCDR, endpoint security, and Auvik network monitoring.  My guess is the reason for splitting them out is that different alerts carry a different level of risk. We're becoming numb to the RMM noise (having seeing it for years), but the new alerts seem to be more important. I suspect in time, these, too, will be treated with the same level of disregard.

So, what does it take to be properly staffed?  One FTE per 64 Managed Service customers.  This comes from spending 5-6 minutes per day per customer checking alerts, which adds up to 30 minutes per week per customer or an hour for every two customers.

When we see customers with a high alert ticket volume (5-8,000 tickets per day), we strongly recommend hiring someone who knows the RMM and the other alert generators to script the monitoring software to drive down or self-heal as many tickets as possible. It's worth the FTE's time to leverage Autotask Live Reports to look for recurrence of alert issues, where a Root Cause Analysis (RCA) can lead you to resolve the issue once and for all. 

MSPs constantly ask us, "What’s the best way to track monitoring alert time?”  Here, we’re talking about the checking of the alerts only. We recommend (and I think most MSPs already do this) that if a monitoring alert actually requires remediation, moving the ticket to the Triage queue and let the Service Coordinator fully-triage the ticket and move it into the proper workflow and assignment.

Where to track checking of monitoring alert time is not an easy question.  Tracking monitor alert checking time has four options:

1)    A daily or weekly recurring ticket per customer

2)    Not tracking the time at all

3)    Tracking the time as regular time

4)    Tracking the time in a zero account (your own MSP) daily recurring ticket

A daily or weekly recurring ticket per customer

If you create a daily monitoring alert ticket for each customer, it will take longer to document, add the time entry, and close the ticket than it does to actually check the alerts. 

Not tracking the time at all

This is not a good idea. Not only does it negatively impact the profitability report of the Managed Service offering, it leaves the MSP open to liability should there be an issue and the MSP is accused of ignoring alerts.  Then again, if you’re not checking the monitoring alerts in the first place, this would be your only option.

Tracking the time as regular time

Tracking customer-facing work as regular time has two downsides:  One is the lack of documentation, and the other is losing track of the profitability cost in the sea of other non-billable time.

Tracking the time in a daily recurring ticket in the zero account

Advanced Global recommends a daily recurring ticket in the MSP’s account (zero account).  This way, at the end of the monitoring alert checks or lunch, whichever comes first, a single time entry can document which customers alerts were checked.  Finding this time and reporting on it becomes fairly easy, including both for profitability, liability, and recurring incidents that need RCA.

Now, right about the time you’re thinking, “Great, we'll just add this to the Service Coordinator’s plate,” you need to think again.  There are two issues with this idea:

1)    Alerts mostly come through overnight and first thing in the morning is the busiest time (yes, we've checked the data to validate what your gut says) for both alert checking and Triage – there’s not enough bandwidth for the Service Coordinator to do both.

2)    It takes a tech to check alerts and most Service Coordinators are relational-type people not technical.

Widget:

To make it easier to monitor and check Monitoring Alert Tickets, you can just use a widget.  It’s as simple as copying/renaming the Ready to Engage widget or any other Ticket List widget and change the Ticket Type filter from “not equal to” to “equal to” Monitoring Alert.  You’ll see how many active Monitoring Alerts are in the system, and you can sort them by first in.  Expand the widget to see ALL Monitoring Alert tickets and start working them chronologically.  And any requiring full attention get moved to the Triage queue to swim with the rest of the issues/requests.

The Monitoring Alert widget can be added to any dashboard with less than 12 widgets.  (The Autotask limit of widgets per dashboard is 12).  If the dashboard is full, look for the least valuable widget and either delete it, move it to a secondary dashboard, or look for a spot where the widget can be a sub-widget in another widget.

I wish I’d known about all this in college.  I know in the business classes we bought and sold a lot of widgets, but I had no idea what we were talking about.  I still don’t know how you buy or sell a widget, unless it’s contacting us at AGMSPC and asking us to build you one, which in most cases is less than 30-minutes of work and we do it in a free 30-minute coaching call.  Feel free to schedule one here.

In summary, AGMSPC recommends:

1)    Create/copy a widget

2)    Check all monitoring alerts

3)    Reduce the monitoring alert noise with scripting and RCA

4)    Staff properly (one FTE tech for every 64 Managed Service customers)

5)    Issues needing remediation, move to Triage

6)    Fight the good fight and bring order to chaos

-Steve and Co