Humans, Engineering Shifts, Required Investment, and Commitment for Operational Security

By Ron Brash

New secure connectivity guidance describes a greenfield target architecture, but most OT environments are brownfield reality. True resilience isn't achieved through technology alone. Human expertise, manual operating capability, physical engineering controls, and sustained investment are equally critical. Without these foundations, digital security layers risk becoming expensive new failure modes.

Overview

I wanted to offer some field observations, review the NCSC and Joint Agency guidance, and compare them with Patrick Miller’s analysis. To everyone’s credit - this is great work, and a great conversation that is stronger than much of what has been published and appropriately acknowledges the contributors. 

The gritty reality of implementation is different and technology is not a silver bullet - we need to recognize that human and business aspects are essential to resilience, but also simultaneously inhibiting (e.g., if we don’t have resources, or, management does not hear the calls to invest).  Similarly, even a new facility is often eras of technology kludged together, and existing facilities, even 1 year old are beginning their “brownfield” journey (cue Radagast?). 

So onwards and upwards.

Quick Summary: "End-State Utopia" vs. "Brownfield Reality" 

  • The NCSC/Joint Agency guidance is essentially a "Target Architecture" document. It describes how a nuclear plant or a critical pipeline should be built if we were starting from scratch today, with unlimited budget and a staff of PhDs. It prescribes a world of outbound-only connections, protocol validation, and hardened intermediaries. It treats the OT environment as a fortress that must be architecturally distinct from the IT chaos. 

  • The Ampyx Cyber analysis is a “translation” for the operator. It correctly identifies that this guidance is not a checklist. but rather an indictment of current practices. He correctly calls out a not-so-shocking admission: "most brownfield OT networks violate nearly every assumption in this guidance." It acknowledges the terrifying gap between what is asked (cryptographic protocol validation) and what exists (a PLC from 1998 that crashes if you scan it too hard). 

Effectively, the NCSC guidance focuses entirely on Technological Resilience (how to keep the packets flowing securely) but largely ignores Operational Resilience in the absence of technology.  IMHO - Operational Resilience is a combination of Humans, Process, and Technology (and forth - organizational and governmental commitment to funding retrofits, technology evolution, and economic sovereignty). Particularly related to Principles: 1 (brief mention), 2, 6, 8.    

P1. Balance the Risks and Opportunities - Requires Often Spending Money to Make Money. 

Principle 1 asserts that connectivity and security should be driven by a "documented business case" where risks are weighed against opportunities. It assumes a rational equilibrium where safety and efficiency are equal stakeholders.  But is that the reality? 

The Reality

"Balance" is a Euphemism for "Profit" In a shareholder-driven environment, "balancing risk and opportunity" is rarely an engineering calculation; it is a financial one. 

  • The Efficiency Bias - The "opportunity" is almost always immediate OpEx reduction (e.g., remote access to eliminate a $500 truck roll). The "risk" is a theoretical, low-probability catastrophic event. Human nature and corporate incentives prioritize the immediate tangible gain over the hypothetical disaster. We are not "balancing" risks; we are discounting them to juice quarterly efficiency metrics.  Or cutting spending and letting gear get farther and farther EoL while waiting for retirement payouts or silly KPIs (make it someone elses problem syndrome (SEP) ). 

  • The Skills Deficit - Assumes the "Senior Risk Owner" understands the process well enough to make this judgment. These decisions are often made by leaders several layers removed, who lack the engineering, cyber knowledge and context to truly enable and secure.  Not to mention - corporate resource trimming and a lack of available hands for emergencies, maintenance, and constant migration/evolution tasks. Leadership often does things that are contrary to what reasonable and business aware experts would desire, and those with the best skill sets are often the first to leave... and after round after round of unpopular profit-focused-people-and-unfriendly-local-economic decisions... who wants to work there or secure it?  (Or wouldn’t be resistant to ransomware injection for a price?) 

Engineering Focus vs. Cutting Corners (or “lets come back later”) 

  • Erosion of Resilience - Proper engineering resilience (IMHO craftspersonship) requires "slack" - redundancy, manual overrides, and physical buffers. These are expensive and inefficient. When we "balance" based on business efficiency, we classify these safety margins as "waste" and strip them out. Prioritizing today’s utmost efficiency could very well be “penny wise, pound foolish” - imagine adding a bunch of AI, and IoT if the basics are not in place just to make a buck, but never considering cyber-total-cost-of-ownership (cyber-TCO).  This is straight out of the Consequence Informed Engineering, or ISA-84 type thought crowd – we need more of this! 

  • The Training Gap - We train operators for "Optimized Production" (using the connectivity), not for "Degraded Survival" (when the connectivity kills the process). By justifying connectivity through "business value," we implicitly de-value the manual, human competency required to save the plant when the digital layer betrays us. 

Hot-take

Principle 1 fails because it treats Safety as a variable that can be traded for Efficiency. From a sovereignty and resilience perspective, certain engineering controls must be non-negotiable constraints, immune to the fluctuations of the "business case." 

P2. Limit the Exposure of Your Connectivity - The "Fortress" Illusion 

Principle 2 advises operators to "Limit the exposure of your connectivity," advocating for removing direct internet access, utilizing DMZs, and reducing the "discoverability" of OT assets. It assumes that if the adversary cannot scan the asset from the public web, the asset is protected. 

The Reality

Exposure is Contractual, Not Just Architectural In the modern industrial ecosystem, "exposure" is often mandated by the supply chain, rendering architectural controls moot. 

  • SLA Extortion: We are seeing a shift where OEMs (Original Equipment Manufacturers) and SIs (Systems Integrators) require persistent, high-bandwidth "call-home" connectivity for predictive maintenance and licensing. If you "limit exposure" by cutting these links, you void the warranty or degrade the asset's performance. The "Business Reality" is that we are trading sovereignty for support contracts. 

  • The "Shadow Connectivity" Economy: When we rigidly limit legitimate connectivity without understanding the operator's workflow, human nature takes over. A frustrated maintenance engineer will install an undocumented 4G modem or tether a cell phone to the HMI to download a manual or patch a system. By making connectivity "hard," we drive it underground, creating a hidden exposure that is invisible to the SOC but wide open to the adversary.  (We recently did a pre-selection pentest for an O&G provider and we identified highly concerning third-party risk in its wireless communication chain: RF device-> gateway -> server -> historian). 

Engineering Focus vs. The "Hidden" Perimeter 

  • Transitive Exposure: The guidance treats the "OT Boundary" as the edge of the organization. The actual edge is the vendor's network. If your trusted integrator is compromised (a la SolarWinds or Kaseya), your "limited exposure" connection becomes a direct pipeline for malware. We are trusting the hygiene of a third party's network, which is often less secure than our own. 

  • Intrinsic Beaconing: We focus on blocking inbound traffic but fail to address the "decades of software dependency" inside the network that is hardcoded to beacon out. You cannot "limit exposure" when the software stack itself - often unpatchable firmware - is designed to aggressively seek the internet for NTP, DNS, or licensing checks.  Then get ready for device-to-cloud plant floor devices… good luck. 

  • Cloud (and Someone Elses):  The guidance treats the OT Boundary as one that is in your control, but what about devices talking to a SCADA in the cloud?  Device deployment & key management from the cloud? If you are developing products, or are using a product that is not hosted locally (even if its your “AWS VPC”) - where is this boundary? Zones and Conduits? Management interfaces? Its an increasing area of interest, and some of the hotest selling products appear to be moving to a SaaS model (cue Historians). 

Hot take

Principle 2 fails because it defines "exposure" as a network state (open ports) rather than a dependency state. If you limit the exposure harshly, resources and suppliers will work around this… or obscure access. Comically, direct to cloud, no local buffering or operational capabilities also conflict with limiting exposure and resilience (and we know how often AWS East goes down).  

P6. Limit the Impact of Compromise - The Missing Dimensions of "Impact Limitation" 

Principle 6 focuses on "Limiting the impact of compromise" primarily through network micro-segmentation and zone restrictions. It posits that by chopping the network into isolated digital cells, we can trap an adversary in a single segment, preventing them from reaching safety-critical systems.  It also misses the idea that resilience and limiting impact is network-only when it certainly is NOT. 

The Reality

Complexity Breeds Operational Blindness In practice; aggressive segmentation often fragments Operational Visibility more effectively than it fragments the adversary. 

  • The "Emergency Bypass" Effect: When a crisis occurs and data stops flowing due to a strict firewall rule, the first human reaction is to "drop the shields" to restore visibility. In the pursuit of limiting theoretical impact, we create actual fragility. To keep the plant running, administrators frequently insert "Any-Any" rules, rendering the segmentation architecture a costly piece of "compliance theater." 

  • The Insider/Supply Chain Loophole: Segmentation assumes the threat is moving laterally across the network. It fails to address the "Vertical Compromise" - where the attack arrives via a legitimate, trusted vendor update or a supplier doesn’t update a given component in the firmware. You cannot segment a PLC from its own code, or when your supplier has your original logic file somewhere (FYI - many of the copies miss the comments which can only found in the master). If the chain is poisoned, the "impact" is already inside the perimeter, behind the firewall. 

Engineering Focus vs. The "Network" Solution 

  • Physical Consequence: The guidance treats "Impact" as a data problem (e.g., loss of confidentiality/integrity). In OT, "Impact" is physical (e.g., a turbine over-speeding). A network firewall cannot stop a compromised controller from obeying a valid (but malicious) command to destroy physical equipment.  We need to also engineer out those possibilities with immutable and potentially analog controls - one should not be able to execute dangerous activities if its not “possible” in a well validated state-machine.  

  • Digital vs. Mechanical Safety: True impact limitation relies on Non-Programmable Failsafes - mechanical governors, pressure relief valves, and hard-wired interlocks. By over-focusing on digital containment, we divert budget from the only layer that actually limits the catastrophe: the physical engineering that ignores the bits and obeys the physics. 

  • Embed Humans and Engineer Out the Risk: Have the bodies there to save the day.  Combined with the right safety culture, engineers who know how to handle erroneous conditions and have adequate visibility FROM and engineering perspective.  Certainly, it can mean network visibility, or endpoints included, but largely - humans, knowledge, training, and process will also largely mitigate the disaster (as documented in many instances). 

Hot take

Principle 6 fails because it attempts to solve a Physical Safety problem with a typical cyber/technology solution. "Limiting Impact" is not about preventing packets from moving; it is about ensuring that even when the network is totally compromised, the physical process has a mechanical and human "will to survive" that the digital adversary cannot override. 

P8. The Unaddressed Resilience Gap: The "Lost Art" of Manual Ops 

The guidance (Principle 8) speaks of an "Isolation Plan", or the ability to sever the OT network from the internet to survive a nation-state attack. 

The unasked question

Once you isolate the network, can the humans actually run the plant? 

Over my career, I have almost constantly seen a trend where we have replaced mechanical knowledge with digital abstraction. To be clear: I’m not saying cloud is bad, neither is IoT, or new methods to operate a plant, but  we are building frameworks for systems that we have forgotten how to operate analog-style. 

  • The "Glasshole" Effect: If the "Secure Connectivity" stack fails - if the firewalls lock down or the "protocol validator" rejects a legitimate command because of a firmware bug - do the operators have the muscle memory to turn valves manually? Or are they staring at blank screens, helpless? 

  • Resilience is not a Firewall: True resilience isn't a stronger DMZ; it's the ability to maintain Safe State or Production when the DMZ is gone. The guidance pushes for complex digital intermediaries to protect the process, but in doing so, it increases our dependence on that digital layer. 

Human Nature: The "Unicorn" Fallacy 

The guidance implicitly assumes a workforce that does not exist. 

  • Skills Shortage: In reality, most water utilities are run by a small team where the "OT Security Person" is also the SCADA engineer and the person who drives out and fixes a valve.  In my experience that is true, but even some of the largest critical infrastructure providers have shaved their operational teams down to skeleton teams, and are not hiring (subtle note - its not a quantity vs. quality game, and we better do it before AI neuters being able to upskill junior resources, or we will be in a worse situation than the impending waves of retirement). 

  • Cognitive Overload: By demanding "protocol validation" and "continuous monitoring of anomalies," we are asking overwhelmed operators to interpret complex security signals. In a crisis, human nature is to bypass security to restore operations. If the "Hardened OT Boundary" prevents a technician from fixing a boiler remotely at 3 AM, they will bridge a cell hotspot to the PLC, rendering the millions of dollars in controls useless.  (I 100%^ recognize there is an opportunity for AI here – but this will be a very forward leaning and modern facility that has high quality documentation, and several high-maturity investments... the barrier to entry here will be high, and poor returns for investors expecting a large market & quick sales). 

The "Less is More" Argument

If you are budget-strapped, often adding technology decreases security

  • A complex network stack that is misconfigured because you can't afford a consultant is less secure than a simple air gap and a physically locked door. Perhaps a that is a bit more old school that maybe requires a human to check on it, but is physically limited to limit impact if it was hacked - perhaps that is the smarter choice?  

  • More Humans > More Tech: Instead of buying an AI-driven "Anomaly Detection" box that generates 10,000 false positives a day, hire 2 junior engineers to actually walk the floor, audit the assets, and learn how the physical process works. A human who understands the physics of the plant is the ultimate intrusion detection system, and they will need to be able to understand the data coming at them in the near future (as well as detect errors/ AI hallucinations). 

Hot take

This kind of “limiting impact” avoids the elephant in the room - business realities of “Running the Wheels Off" vs. The "Shiny Box" technology phenomenon (or Run to Fail).  The guidance suggests that if regulation is weak, operators should implement "compensating controls." 

  • Shareholder Value: The reality of the boardroom is that if the regulation doesn't force the expense, the expense is "efficiency loss." We are seeing a "Run to Failure" mentality in critical infrastructure. The guidance asks for "End-State Architecture," but the business model is "Minimum Viable Compliance." 

  • The AI Savior Delusion: Executives are hoping AI will solve the "Skills Gap." They want a tool that "auto-remediates" threats. In OT, "auto-remediation" means "auto-shutdown" or "auto-accident." We are attempting to use AI (which we don't understand) to secure Legacy Code (which we've forgotten) to protect physical processes (which we no longer manually control), company acquisitions, proprietary logging, lack of proper documentation, etc.  It is a tool, and can have great value - IF you have the basics and are ready (probably less than 10% of companies IMHO). 

The Final Question: Do We Understand Our Own Creations? 

The NCSC guidance asks us to "verify protocols" and "harden boundaries." But do we truly understand the technology (and inherently the business) we are trying to make more resilient? 

  • Technical Understanding: We are wrapping layers of "Secure Connectivity" around "Black Box" controllers. We are building a vault door on a cardboard box.  Systems of systems are more than just tech. 

  • Business Understanding: We view technology as an asset on a balance sheet. It is a liability. Every line of code, every smart sensor, and every firewall rule is a "debt" that requires maintenance and eventual transition (if you own a car, and its getting high mileage or increasing costs - what do you do? You start thinking fast about buying a new one in order to continue a given activity). 

  • Environmental Exposure Understanding: We do not sufficiently consider emerging severe weather trends adequately.  We are not planning to weather-harden what we have and also prepare to relocate or redistribute facilities in advance of “things getting bad.”  Between drones and nature – they have a high likelihood of destroying your buildings, causing a shutdown, or causing a revenue loosing event/disabling technology. 

Conclusion

The guidance is technically correct but utopian, and operationally idealistic.  It sets expectations that cannot be met without restoring emphasis on maintenance, manual skill, and practical engineering. If we continue to favor digital convenience over human capability and system sustainment, we will simply engineer new and more expensive modes of failure.

Investing in people and modernizing infrastructure is not optional. It’s the only realistic way to build resilient systems that can withstand both technological and human driven disruptions. 

Moving to secure OT means prescribing a level of digital hygiene that requires a cultural pivot, not just a technical one. Until we value sustainment over innovation, and manual competency over digital convenience, we are simply building more expensive ways to fail.  I don’t think investing with humans as part of the process, and re-investing while modernizing our critical infrastructure is a bad idea.  This effort is not a make work project.  Its much more, and I suspect the value will be immense long-term (plus persons will have work, a purpose, and contribute to society locally!); especially with Global warming, environmental, and non-cyber events (shout out to Andy Bochman here). 

 

Featured Posts

Patrick Miller