Managed IT Support: What Happens at 2am

What Happens in the Hours Nobody Sees: Dissecting a Data Centre Failure and Identity Infrastructure Rebuild

Incident Response Engineering • Identity & Access Governance

Strategic Summary: The true value of an embedded Managed Service Provider (MSP) is realised during unexpected upstream infrastructure failures. When a catastrophic power maintenance failure at a third-party data center corrupted the primary authentication cluster of a major financial services organisation, the client’s internal IT team was left completely locked out of their environment. Nicholas Broderick details how the Si Futures Trusted Response Centre isolated the fault, rerouted emergency authentication pathways, and spent a weekend working alongside the client to rebuild domain controllers and troubleshoot legacy firewall integration issues.

The vast majority of the specialised engineering work that maintains corporate operational continuity remains completely invisible to end users. It is the automated threshold monitor that catches a route flap or a storage degradation at 2:00 AM before a single support ticket is logged. It is the network specialist who detects an anomaly in a traceroute and traces it across complex upstream layers where an in-house team would never have the visibility to look. It is an infrastructure team dedicating their entire weekend to resolving a systemic crash that the client did not trigger and will barely understand. This case study details one of those incidents.

Saturday 02:00 AM: The Anatomy of an Identity Blackout

The emergency alerts hit our operations center in the early hours of Saturday morning: multiple critical production environments had gone cold. A major tier-standard data center facility, which housed a significant portion of a large financial services client’s infrastructure, had initiated a scheduled power maintenance window. During the execution of that power cycle, a critical hardware or switching failure occurred downstream.

Our engineering team’s immediate priority was telemetry validation to confirm exactly what parts of the network fabric were functional. Our managed connectivity services feeding into the client’s primary environment were active, edge routers were responding to health pings, and our localised colocation cabinets were fully powered. This telemetry provided immediate clarity: the failure was not localised to our network nodes or the client’s dedicated hardware. Something further upstream in the data center infrastructure had failed.

What remained deeply concerning was that users could no longer authenticate across any endpoint. Corporate VPN access had completely collapsed. Inbound authorisation requests were being rejected—not because the network links were broken, but because the back-end servers responsible for validating user identities were down. The client’s active directory domain controllers, which form the authentication backbone of the entire organisation, resided on a legacy storage cluster that had lost its logical volume identifiers when the power failure occurred. Every single domain controller was corrupted or offline on the same physical host.

Because the authentication backbone was completely down, the client’s internal infrastructure engineers could not establish a secure VPN connection to audit their own systems. This triggered a complex, manual communication loop: their team requested remote-hands hardware checks, while our engineers provided telemetry showing the servers were completely dark. Ultimately, the client was forced to dispatch engineering personnel physically to the data center facility.

Managed IT support engineers troubleshooting a network authentication failure during a data centre outage

Figure 1: Si Futures Trusted Response Centre engineers validating network telemetry and firewall states during a critical data center power failure event.

Saturday Afternoon through Sunday: The Reconstruction Phase

By Saturday afternoon, it became clear that the corrupted storage cluster could not be recovered via standard software tools. Every attempt to mount the data stores or boot the domain controllers from the damaged SAN blocks failed. The client’s engineering group spent all of Saturday trying to recover the arrays. By Sunday morning, they had to make a tough tactical decision: abandon the corrupted volumes and rebuild the corporate active directory infrastructure entirely from scratch.

While the client’s team focused on rebuilding the directories, the Si Futures engineering team found a way to restore baseline corporate access. By reconfiguring the network edge and pointing the authentication requests to an alternative, isolated domain controller, we restored core VPN access for essential business functions. While this was not a full solution, it kept the enterprise environment from going completely dark during the system rebuild.

Rebuilding corporate directory structures, restoring system object permissions, and sync-matching structural metadata is a slow process. The reconstruction effort extended through the entirety of Sunday. At 10:00 PM Sunday night, the client confirmed the directory rebuild was complete. They reached out to our engineers to integrate Fortinet Single Sign-On (FSSO) authentication so users could safely access internal applications and the internet through the security perimeter.

Isolating the Legacy Permissions Defect

Our team worked through the night alongside the client’s engineers until 2:30 AM to systematically resolve a complex integration block:

The Symptoms: The newly deployed active directory agents were repeatedly failing to communicate sync-state tracking metrics back to the core FortiGate firewalls.
The Root Cause: A permissions bug existed within the active directory service account designated to run the authentication agent. In the old environment, this had been bypassed years ago via an undocumented workaround that was missing from the new servers.
The Resolution: Our engineer, Jacques, isolated the specific permission structure through an advanced Fortinet knowledge base review. He guided the client through the process of applying the exact security permissions required, resolving the block and confirming user authentication across the environment.

What This Incident Demonstrates About Co-Managed IT

The true value of an MSP with an intimate, architectural understanding of your environment becomes apparent during unexpected upstream failures that you did not cause and cannot control. A standard, commoditised internet provider relationship is purely transactional: you log an outage, they run a diagnostic script on their own interfaces, and they notify you when their links match standard parameters. Their troubleshooting stops at their physical demarc line, following rigid timelines and automated scripts. They do not maintain any understanding of your custom configurations, software dependencies, or authentication paths.

Our team moved through this data center incident efficiently because we understood the client’s environment. We knew the authentication chain, the location of the domain controllers, and exactly how the FSSO architecture had been built because we engineered it. When the authentication failure occurred, our team tested against a clear mental model of the verified operational state, rather than wasting hours diagnosing the environment from scratch.

Proactive monitoring was critical here—not as a magic fix for an upstream data center power failure, but as a triage tool to quickly isolate variables. Within the first hour, we verified that the connectivity links and perimeter firewalls were fully operational. This narrowed down the focus of the investigation and prevented engineers from wasting time troubleshooting the wrong infrastructure layers. This is the operational purpose of our Trusted Response Centre model: working as an integrated extension of internal IT teams, equipped with deep systems knowledge that simple service desks cannot replicate.

The Reality of Managed Service Value

The core work of managed IT operations is often completely invisible during normal business weeks. A well-engineered environment simply works. Network pathways remain stable, security updates deploy in the background, and monitoring agents scan for performance drops. Clients experience a consistent absence of problems, which can obscure the continuous engineering required to maintain that stability.

Critical incidents are never the preferred scenario to demonstrate technical capability. However, they are the exact moments where the difference between an integrated technology partner and a detached service desk becomes clear. While automated reporting and monthly performance dashboards help quantify regular operations, major upstream failures present a simple question: Does your technology provider know your environment well enough to guide you through a disaster?

For this financial services organisation, when it mattered most, the answer was yes.

When the Data Centre Has a Bad Night: What Happens in the Hours Nobody Sees

What Happens in the Hours Nobody Sees: Dissecting a Data Centre Failure and Identity Infrastructure Rebuild

Saturday 02:00 AM: The Anatomy of an Identity Blackout

Saturday Afternoon through Sunday: The Reconstruction Phase

Isolating the Legacy Permissions Defect

What This Incident Demonstrates About Co-Managed IT

The Reality of Managed Service Value

Let's Connect

Categories

Let’s connect

Quick Links

South Africa

United Kingdom

Send Us a Message

When the Data Centre Has a Bad Night: What Happens in the Hours Nobody Sees

Saturday 02:00 AM: The Anatomy of an Identity Blackout

Saturday Afternoon through Sunday: The Reconstruction Phase

Isolating the Legacy Permissions Defect

What This Incident Demonstrates About Co-Managed IT

The Reality of Managed Service Value

Let's Connect

Categories

Tags

Let’s connect