IT Monitoring: Fix the Problem, Not the Alert

Proactive IT monitoring should tell you what is genuinely happening on your network — not simply generate noise that teams learn to ignore. When we migrated our monitoring estate from PRTG to Zabbix, we expected a transition. What we got was a mirror held up to every device we manage. Zabbix’s standard templates were significantly more detailed than what we had been running. Items that PRTG had monitored through simplified probes were now being interrogated across every interface, every service, every status indicator. The result was immediate: between 260 and 320 alerts appeared across our client estate. For a moment, it looked like something had gone seriously wrong.It hadn’t. What had happened was that we could suddenly see everything, and “everything” included a long list of issues that had been sitting quietly in the background, unresolved but unmonitored.

The temptation that every MSP understands

The easiest response to a noisy monitoring dashboard is suppression. Disable the alert. Put the device into maintenance mode. Delete the monitoring component. Move on to the next urgent thing, because there is always a next urgent thing.

It’s understandable. Admin interfaces that aren’t shut because nothing is plugged into them feel like low priority. A licence showing as expired on a service the client might not even use feels like clutter. Error rates on virtual interfaces feel like noise. So teams suppress, exclude, and over time, the monitoring system tells you less and less about what is actually happening on your network.

The problem is that businesses change. Policies that made sense eighteen months ago get forgotten. Devices put into maintenance mode for a supplier investigation never come out because nobody remembers to check. And then a client phones asking why they weren’t told about something, and the honest answer is that the alert was turned off three years ago for a reason nobody can recall.

Root cause resolution: what we did instead

We made the decision to resolve every single alert at the source rather than suppress it on the monitoring side. The team divided the work based on expertise: Robin handled the MikroTik router fleet, where around 90 devices needed Engine ID configuration. Jacques worked through the firewall alerts, investigating each one to determine whether the issue was genuine. Rudie covered servers and Zabbix process items. Then there was a final concentrated push where everyone worked together to clear the remaining alerts.

The specifics varied. Firewall interfaces that were active but unused got admin shut, which isn’t just tidier monitoring; it’s a genuine managed cyber security improvement. If an interface is administratively disabled, someone can’t simply walk into a building, plug a laptop into an open port, and start probing the network. For compliance frameworks that require unused ports to be disabled, this moved clients from theoretical policy to actual enforcement.

Licences flagged as expired or unused prompted us to go back to clients and ask whether those services should actually be active. In several cases, they shouldn’t have been, meaning the client had a capability sitting dormant that they’d forgotten about. Monitoring that had been configured using ping got switched to SNMP where it made more sense, giving us richer data. Devices still running SNMP v2 got identified and queued for the v3 encryption upgrade that had been partially rolled out.

Not everything needed fixing in the traditional sense. TACACS tunnels that show as down on backup links, for example, are behaving exactly as designed. They only come up when the primary fails over. Error rates on virtual interfaces turned out to be normal behaviour when users connect and disconnect. We stopped monitoring those on virtual interfaces but kept it active on physical fibre connections, where error rates genuinely indicate a problem developing on a link.

What the clean monitoring baseline actually looks like

After the concentrated effort (roughly one day of coordinated work across the team), the monitoring estate went from over 300 alerts to six. Those six are known issues with active investigations underway. Five hosts are currently paused, each because a supplier is actively working on something and we’re waiting for resolution.

IT monitoring dashboard after root cause resolution showing reduction from over 300 alerts to clean baseline of 6 genuine alerts
Out of over 2,000 monitored items across the entire client estate, only a handful required intervention. That’s actually reassuring, because it validates that our deployment standards are working correctly. The cleanup didn’t reveal systemic problems with how we build environments; it revealed the accumulated small decisions that every busy team makes when something isn’t urgent enough to fix right now.

The difference is that now, when the dashboard shows an alert, it means something. The team’s daily experience has fundamentally changed from filtering signal out of noise to responding to genuine operational health information. And we’ve committed to running this exercise annually, making sure that the monitoring estate stays honest rather than slowly drifting back into comfortable suppression.

The proactive IT monitoring philosophy underneath

We’re not here to simply monitor networks and react when things go down. What we’re here to do is look at the overall health of the services we offer to our clients and make sure those services are genuinely providing value, not just generating dashboards that look busy but hide the things that actually matter.

When every alert on the dashboard represents a genuine issue rather than accumulated noise, monitoring becomes what it was always meant to be: an honest view of operational health that drives real decisions.

This approach to root cause resolution is central to how our threat detection and response services operate — fixing the underlying issue rather than masking the symptom. If your monitoring tools are generating more noise than insight, it might be time to speak with our team about what a clean baseline could look like for your organisation.

Why We Fix the Problem, Not the Alert: Root Cause Resolution

The temptation that every MSP understands

Root cause resolution: what we did instead

What the clean monitoring baseline actually looks like

The proactive IT monitoring philosophy underneath

Let's Connect

Categories

Let’s connect

Quick Links

South Africa

United Kingdom

Send Us a Message

Why We Fix the Problem, Not the Alert: Root Cause Resolution

The temptation that every MSP understands

Root cause resolution: what we did instead

What the clean monitoring baseline actually looks like

The proactive IT monitoring philosophy underneath

Let's Connect

Categories

Tags

Let’s connect