When the Data Centre Has a Bad Night: What Happens in the Hours Nobody Sees

May 22, 2026

Reading Time: 4 minutes

Most of the work that keeps a business running is invisible. The monitoring that catches something at 2am before anyone has called. The engineer who picks up a fault in a trace route and follows it somewhere the client would never have looked. The team that spends a weekend on a problem the client did not cause and will barely know happened.This is a story about one of those weekends.

Saturday morning

The call came in the early hours of Saturday. Multiple services offline. The data centre hosting a significant portion of a large financial services client’s infrastructure had scheduled power maintenance. Something in that maintenance had gone wrong.

The first step was confirming what we could and could not see. Our managed connectivity services into the client’s primary environment were up. The routers were responding. The cabinets were fine. That told us the problem was not our infrastructure and not the client’s. Something upstream of both had failed.

What we could not yet see was why users could not authenticate. VPN access was down. People trying to log in were being rejected, not because the network was broken, but because the servers their credentials were being checked against were offline. The domain controllers, the authentication backbone of the entire environment, were sitting on a storage cluster that had lost its identifiers when the power issue hit. All of them, corrupted or offline, on the same host.

The client’s own IT team could not confirm this because they could not get onto the VPN to investigate. A long backwards loop followed: requests from them to check equipment, responses from us that the equipment was completely offline, further investigation required. Eventually, people were sent physically to the data centre.

Managed IT support engineers troubleshooting a network authentication failure during a data centre outage

Saturday afternoon and Sunday

The storage cluster that had failed could not be restored through software. Every attempt to bring the domain controllers back from that storage failed. The client’s team spent the entire day Saturday trying. Sunday, they made the decision to rebuild the domain controllers from scratch.

In the meantime, our team had found a way to restore basic access. By pointing authentication to a different domain controller, we were able to get VPN access working for users, even without the primary servers. It was not a full solution, but it meant the environment was not completely dark.

Rebuilding domain controllers is not a quick process. It took the whole of Sunday.

At ten o’clock Sunday night, the call came through that the rebuild was complete. Could we help get FSSO authentication working so that users could access internal resources and the internet through the firewall?

We spent until half past two in the morning working through it. There was a permissions problem with the service account being used to install the authentication agent, one that had existed in the old environment through a workaround nobody had documented. The new domain controllers did not have that workaround. Jacques identified the issue through a Fortinet knowledge base article, walked the client’s team through creating the correct permissions, and confirmed the fix was working.

What this kind of situation actually demonstrates

The value of having an MSP that genuinely knows your environment is most visible when something goes wrong that you did not cause and cannot control. A standard ISP relationship is transactional: you report a problem, they investigate their own infrastructure, they tell you when it is fixed. The troubleshooting happens on their side, through their processes, on their timeline. The knowledge of your specific configuration, your dependencies, your authentication architecture, is not something they carry.

We were able to move through this weekend’s events as quickly as we did because we understood the environment. We knew the authentication chain. We knew where the domain controllers sat. We knew what FSSO configuration should look like, because we had built it. When a fault appeared, we could test against a mental model of the correct state rather than starting from scratch.

Proactive monitoring mattered too, not as a magic solution to an infrastructure failure we did not cause, but as a tool for quickly ruling out the things we could rule out. Within the first hour, we knew that the connectivity and firewalls were not the issue. That narrowed the investigation considerably and prevented a great deal of wasted time pointing fingers at the wrong layer. This is what the Trusted Response Centre model is built for: not replacing client IT teams, but working alongside them with knowledge of the environment that a transactional support relationship cannot carry.

The wider point

Value in managed IT is often invisible in normal weeks. A well-managed environment simply works. Services are available. Security updates are applied. Monitoring runs in the background. Clients experience the absence of problems, which is not the same as recognising the work that produces that absence.

Incidents like this one are not where any MSP wants to demonstrate its worth. But they are where the difference between embedded expertise and a service desk becomes most clear. Reporting, dashboards, and monthly summaries help make normal weeks visible. But when a carrier has a bad night, or a data centre has a maintenance event with consequences nobody planned for, the question is simple: is there a team that knows your environment well enough to get you through it?

On this weekend, for this client, the answer was yes.

If you are a business that relies on a single provider to manage your connectivity, your security, and your access infrastructure, the question worth asking is whether that provider knows your environment the way this story describes. Talk to our team about what embedded managed IT support looks like in practice.
author avatar
Nicholas Broderick

Let’s connect