all writing cloud networking

Demystifying AWS Networking A Deep Dive Under the Hood

If you come from a traditional enterprise networking background—spending years configuring Catalyst chassis switches, tuning OSPF timers, architecting MP-BGP EVPN fabrics, or maximizing TCAM profiles—stepping into the public cloud can feel disorienting. AWS documentation tells you that subnets are “virtual,” security groups are “instance-level firewalls,” and routing tables simply direct traffic via a point-and-click console.

But as a network engineer, you know that packets don’t route on abstract promises. They route on physical cables, specialized silicon, and deterministic forwarding logic.

To understand what AWS is actually doing under the hood, you have to completely invert your architectural mental model. AWS does not use massive modular chassis switches, traditional VRF leaking, or hardware-based Layer 2 broadcast domains. Instead, AWS treats networking entirely as a massive distributed systems scale-out problem.

This guide strips away the cloud marketing abstractions and explores the physical underlay, the software-defined overlay, and the hardware-accelerated silicon that powers the global AWS cloud fabric.


1. The Physical Underlay: Custom Silicon and Clos Fabrics

In a high-end enterprise or private data center, you are likely running a standard 3-tier Core-Distribution-Access architecture or a modern spine-leaf fabric. These designs rely heavily on high-availability protocols (like MLAG, vPC, or stacking) and large multi-slot modular chassis switches loaded with supervisor engines and high-density line cards.

AWS threw this model out. At cloud scale, physical modular switches are an architectural dead-end. They introduce single points of failure, draw massive localized power, and force reliance on proprietary vendor operating systems.

Multi-Stage Clos Topologies (Fat-Trees)

Instead of monolithic switches, the physical network inside an AWS Availability Zone (AZ) is built entirely as a massive, multi-stage Clos network (a non-blocking fat-tree topology).

Every single tier of the fabric is built out of small, fixed-configuration, ultra-dense merchant silicon boxes. Rather than scaling up by buying a bigger switch, AWS scales out horizontally by adding more identical nodes to the Clos layers.

The primary design mandate of this underlay is simple: provide deterministic, non-blocking, ultra-low latency bisection bandwidth between any two physical server racks inside the data center. Every rack path is calculated to have equal cost and identical latency characteristics.

Stripped-Down Routing Protocols

In the underlay, there is no Layer 2 bridging across the data center. Everything is IP-routed straight down to the Top-of-Rack (ToR) switch.

Standard enterprise interior routing protocols like OSPF or IS-IS are heavily stripped down or replaced entirely. In a traditional network, a massive topology shift causes a flood of Link-State Advertisements (LSAs), forcing every router to rerun its Shortest Path First (SPF) algorithm simultaneously—a recipe for control-plane starvation at scale.

AWS utilizes highly optimized, custom protocol variants (often single-area, highly segmented BGP architectures or custom out-of-band IGPs) to handle rapid convergence and streaming telemetry across millions of endpoints without risking routing loops or control-plane collapse.

Jumbo Frames & The Nitro Foundations

Every single blade server in an AWS rack is equipped with custom PCIe hardware: the AWS Nitro card. The physical underlay fabric natively uses a large MTU—typically 9001 bytes (Jumbo Frames). This large MTU allows the Nitro cards to efficiently encapsulate virtual machine traffic within outer IP headers without fragmenting packets, ensuring maximum throughput across the physical wire.


2. The Software-Defined Overlay: The Death of MP-BGP EVPN

In a modern enterprise multi-tenant data center, the standard way to run isolated virtual networks over a physical Layer 3 underlay is MP-BGP EVPN with VXLAN. Your switches act as Virtual Tunnel Endpoints (VTEPs), running an interior BGP control plane to exchange MAC addresses and IP routing data across the fabric.

AWS does not use VXLAN, and your Virtual Private Clouds (VPCs) do not run a traditional BGP control plane.

The Illusion of Layer 2

When you launch an EC2 instance, its operating system receives a standard private IP and a MAC address. It looks and acts like an Ethernet segment. But Layer 2 broadcasts and multicasts do not actually exist in the AWS fabric.

When an EC2 instance sends an ARP request for its default gateway or a neighbor instance, the packet never hits a physical wire or a broadcast switch pool. The local Nitro hardware hypervisor intercepts the ARP frame instantly at the PCIe layer.

Nitro acts as a local ARP proxy. Because it is connected to a global infrastructure database, it already knows the MAC address of every virtual interface in your VPC. It responds to the ARP locally within microseconds. The network is completely unicast-routed under the hood, disguised as an Ethernet segment to keep standard operating systems happy.

The Mapping Service: Overlay Control Plane at Scale

If AWS were to use MP-BGP EVPN to distribute network reachability for millions of dynamic virtual machines spinning up and down every second, the BGP control plane would implode under the sheer volume of route updates.

To solve this, AWS replaced routing protocols in the overlay entirely with a globally distributed, highly available, transactional key-value store known fundamentally as the Mapping Service.

[ EC2 Instance ] 
       │  (Standard Ethernet Frame)

[ Nitro Card / Hypervisor ] <───(Queries Local Cache / Mapping Service)
       │  (Encapsulates: Inner VPC IP + Outer Physical IP)

[ Physical Underlay Fabric (Clos) ]

When an EC2 instance (10.0.0.5) wants to talk to another instance (10.0.1.20), the packet path behaves as a distributed lookup system rather than a hop-by-hop hardware routing table:

  1. Intercept: The instance transmits a standard IP packet. The local Nitro card catches it before it can hit the physical network.
  2. Lookup: Nitro reads the destination VPC IP (10.0.1.20). It checks its local, high-speed on-card cache. If it hits a cache miss, it sends an out-of-band query to the regional Mapping Service: “Where is VPC ID 4432, Target IP 10.0.1.20 physically located right now?”
  3. Response: The Mapping Service returns the physical underlay IP address of the destination host server blade along with the specific cryptographic tokens required for that workload.
  4. Encapsulate & Ship: Nitro caches the entry locally, encapsulates the original customer packet inside an outer IP header (using a protocol structurally similar to Geneve/VXLAN), assigns the destination physical server IP as the outer target, and drops it into the physical Clos fabric.
  5. Decapsulate: The destination host’s Nitro card receives the packet, validates the cryptographic token, strips the outer underlay header, and injects the raw, unencapsulated Ethernet frame straight into the target VM’s memory.

Your VPC “Route Tables” are not actual Routing Information Bases (RIBs) programmed into physical TCAM chips on a core router. They are simply policy configurations pushed down to the Mapping Service database to validate and direct these edge lookups.


3. The Evolution of Hypervisor Performance: Dom0 vs. SR-IOV vs. DPDK

To appreciate how Nitro achieves this architecture, we have to look at the history of virtualization performance. In the early days of the cloud, software-defined networking was incredibly expensive from a compute resource perspective.

The Legacy Dom0 Model (Process Switching)

In early Xen-based cloud virtualization, a physical server was divided into a hypervisor layer, guest virtual machines, and a privileged control domain known as Domain 0 (Dom0).

Because the hypervisor did not have a native networking stack or hardware driver architecture, Dom0 (a specialized Linux virtual machine) had to manage all physical I/O.

[ Guest VM (User Space) ] ──(Context Switch)──► [ Guest VM Kernel (vNIC Driver) ]
                                                        │ (Memory Copy via XenStore)

[ Dom0 Host Kernel (Vif / Tap Device) ] ◄───────────────┘

       ▼ (Software Bridge / eBPF / OVS Layer)
[ Linux Bridge / Open vSwitch ] 

       ▼ (Physical NIC Driver)
[ Physical NIC Hardware ] ───► Out to physical switch

When a guest VM wanted to send a packet, that packet had to be copied across virtual memory spaces from the Guest RAM, into hypervisor memory, and then into Dom0’s kernel network buffer (sk_buff). Dom0 then processed the packet via a software switch (like Open vSwitch or a standard Linux bridge) before passing it down to the physical NIC driver.

The CCIE Analogy: Dom0 networking is exactly like a legacy Cisco router running Process Switching. Every single packet triggers a CPU interrupt. The main processor must stop executing application logic, save its registers, switch context to the networking domain, parse the packet headers in software, and schedule the interface transmission. This introduces severe latency jitter, limits throughput, and burns up to 30% of the server’s actual CPU cores just running basic infrastructure.

Standard SR-IOV: Direct Access, Cloud Limitations

To bypass Dom0, the industry developed SR-IOV (Single Root I/O Virtualization). This PCIe standard allows a single physical network card to present itself to the motherboard as multiple independent virtual PCIe devices, known as Virtual Functions (VFs). You can map a VF directly into a guest VM’s memory space.

DPDK (Data Plane Development Kit)

To bridge the performance gap without losing control plane enforcement, software engineers turned to DPDK. DPDK bypasses the heavy Linux kernel network stack entirely, pulling packets directly into user-space applications. It introduces Poll Mode Drivers (PMDs). Instead of waiting for a slow hardware interrupt when a packet arrives, a dedicated CPU core runs in a tight, infinite while(true) loop, continuously polling the network card’s ring buffers at 100% capacity.

While DPDK can process tens of millions of packets per second at near-wire speed, it demands a massive tax: it forces you to completely sacrifice expensive compute cores to do nothing but loop and watch for packets, even when the network is totally idle.


4. The Nitro Architecture: True Hardware Isolation

AWS engineered Nitro to solve the structural limits of Dom0, SR-IOV, and DPDK simultaneously. They realized that to scale a secure cloud, they needed to completely offload infrastructure operations from the host processor.

A modern AWS bare-metal or virtual server does not run a massive software hypervisor or a resource-heavy Dom0. Instead, the main host CPU (Intel, AMD, or AWS Graviton) is 100% dedicated to running customer workloads. All networking, storage management, management telemetry, and encryption are offloaded to an array of dedicated, custom-built hardware SoCs (System-on-Chip) and ASICs plugged into the PCIe bus: the Nitro System.

[ DUAL-SOCKET HOST CPU ]  (100% Dedicated to Customer Workloads)

            ▼ (PCIe Bus Data Path)
┌────────────────────────────────────────────────────────┐
│                  NITRO CARD FOR VPC                    │
│                                                        │
│  ┌──────────────────┐      ┌────────────────────────┐  │
│  │  Custom Silicon  │      │ Embedded Microkernel   │  │
│  │ Packet Processor │ ◄──  │  (Local Cache of the   │  │
│  │   (ASIC/FPGA)    │      │    Mapping Service)    │  │
│  └────────┬─────────┘      └────────────────────────┘  │
└───────────┼────────────────────────────────────────────┘

   [ 100G/200G/400G Physical Underlay Network (Clos) ]

The Elastic Network Adapter (ENA)

When you run a virtual machine on Nitro, the hypervisor exposes a standardized virtual interface called the ENA (Elastic Network Adapter). The ENA driver is lightweight, open-source, and natively baked into every modern operating system kernel.

The ENA driver acts as a stable hardware contract. The guest operating system writes its network frames directly into a PCIe ring buffer using DMA, completely unaware of whether the underlying physical network is operating at 10G, 100G, or 400G. The packet bypasses the main system software and lands directly on the Nitro Card for VPC.

Hardware-Level Policy Enforcement

Once the packet drops into the Nitro card, specialized silicon processing pipelines take over:

High-Performance Networking: EFA & SRD

For advanced computing scales, such as distributed AI training or high-performance compute clusters, the standard TCP/IP protocol inside an overlay tunnel becomes a significant performance bottleneck. The high latency penalty caused by standard TCP packet drops and reordering stalls application performance.

To overcome this, AWS designed the Elastic Fabric Adapter (EFA), which introduces a custom transport protocol called SRD (Scalable Reliable Delivery) implemented directly within the Nitro silicon.

Traditional TCP vs. AWS SRD (Nitro)

[TCP Path]     Packet 1, 2, 3 ───►  Pinned to Single Link (ECMP Hash) ───► Out-of-order = Drop/Retransmit
                                                                           
[SRD Path]     Packet 1 ─────────►  Path A (Spine 1)  ─────────┐
               Packet 2 ─────────►  Path B (Spine 2)  ─────────┼─────────► Nitro Reassembles 
               Packet 3 ─────────►  Path C (Spine 3)  ─────────┘           at Line Rate (No TCP Drops)
  1. OS Bypass (User-Space Direct): EFA allows an application inside the VM to bypass the operating system’s heavy network kernel entirely. Using a user-space driver interface (via Libfabric), the application writes network payloads directly into the Nitro card’s hardware buffer.
  2. Multi-Path Flow Striping: In standard networking, Equal-Cost Multi-Pathing (ECMP) hashes a single TCP 5-tuple to pin it to a single physical network path. This prevents out-of-order packet delivery, but can lead to “hot spots” if one path gets heavily congested. SRD throws this rule away. Nitro intentionally breaks a single massive data stream apart, striping individual packets across hundreds of alternative physical paths through the Clos network simultaneously.
  3. Hardware Reassembly: Because packets travel along different physical paths, they inevitably arrive out of order. The receiving Nitro card catches these packets, reorders them instantly within dedicated on-chip hardware buffers at microsecond scale, and presents a clean, sequential data stream to the target application. If an underlay switch fails mid-stream, Nitro detects the drop and retransmits the missing packet down an alternate path within sub-milliseconds, completely hidden from the software layer.

5. The Virtual Gateway Layer and Scale-Out Architecture

As a CCIE, you know that placing an active/standby firewall pair or a monolithic hardware load balancer in a core data center design introduces a permanent architectural choke point. You are constantly forced to manage asymmetric routing paths, design complex state-synchronization fabrics, and calculate strict physical interface over-subscription ratios.

AWS completely abstracts physical gateways by shifting to a distributed, anycasted service platform called AWS Hyperplane.

Inside the Hyperplane Architecture

AWS does not run virtual appliances or single-node Linux routers to deliver managed network services like NAT Gateways, Network Load Balancers (NLBs), Transit Gateways (TGW), or PrivateLink.

Instead, Hyperplane is a massive, multi-tenant cluster of specialized, high-performance packet-forwarding nodes running custom software optimized with DPDK-style fast-path logic. When you provision a service like an AWS NAT Gateway, you aren’t launching a discrete virtual machine; you are carving out an allocated, distributed share of the regional Hyperplane cluster.

AWS PrivateLink allows a consumer in VPC A to directly access a service hosted in VPC B securely, without transiting the public internet and without utilizing traditional VPC Peering or Transit Gateways. Remarkably, PrivateLink allows this even if both VPCs use completely overlapping CIDR blocks (e.g., both networks are using 10.0.0.0/16), without forcing you to write a single complex NAT rule.

[ Consumer Application ] 

       ▼ (Sends raw packet to Local IP: 10.0.1.55)
[ Endpoint ENI (Consumer VPC) ]

====== BOUNDARY: Packet intercepted by Nitro ASIC ======

       ▼ (Encapsulated with Hyperplane-specific metadata)
[ AWS Physical Underlay Fabric ] ──► [ Hyperplane Fleet ] ──► [ Provider NLB / Service ]

This interception and routing pipeline operates seamlessly behind the scenes:

  1. The Phantom ENI: When you instantiate a PrivateLink Interface Endpoint, AWS provisions an Elastic Network Interface (ENI) inside your local consumer subnet. It is assigned a standard local private IP address (e.g., 10.0.1.55). Your application uses DNS to point toward this local IP address, believing it is communicating with a local server on its own subnet.
  2. Nitro Hardware Interception: As soon as the packet leaves your instance and hits the PCIe layer, the local Nitro card intercepts it. It recognizes that this specific target IP is mapped to a PrivateLink service. Nitro bypasses standard VPC routing logic entirely. Instead, it wraps the packet inside a custom overlay tunnel header and injects specific metadata: the Consumer Tenant ID, the Endpoint ID, and connection tracking tokens. It then routes the packet across the physical underlay directly to the regional Hyperplane fleet.
  3. Dual-Sided NAT Translation: When the packet lands on a Hyperplane node, the system executes a simultaneous Source NAT (SNAT) and Destination NAT (DNAT) operation inside its fast-path processing loop:
    • DNAT: It swaps the destination address from your local phantom IP (10.0.1.55) to the real, private backend IP of the provider’s service infrastructure in VPC B.
    • SNAT: It completely strips your original instance’s private source IP (e.g., 10.0.1.10) and replaces it with a Hyperplane proxy source IP allocated natively from within the provider’s local subnet space.
  4. Blinding the Routing Domains: Because both the original source and destination IPs are stripped and translated simultaneously mid-flight within the Hyperplane layer, the consumer’s routing table and the provider’s routing table are completely isolated from one another. The overlapping IP conflict is bypassed entirely because neither network domain ever sees the other’s raw headers.
  5. Consistent Hashing vs. State Synchronization: Because Hyperplane consists of hundreds of distributed nodes, standard underlay ECMP routing might cause sequential packets belonging to the exact same TCP flow to land on entirely different physical Hyperplane boxes. To prevent connection failure without running slow, over-the-wire state synchronization fabrics, Hyperplane utilizes a highly sophisticated consistent hashing algorithm based on the original packet’s 5-tuple. No matter which physical path a packet takes through the underlay Clos network, the mathematical hash guarantees it will land on the exact Hyperplane node holding the state table for that specific connection.

6. Where Security Policies Actually Execute: SGs vs. NACLs

AWS documentation states that Security Groups protect your instances, while Network ACLs (NACLs) protect your subnets. But as a network engineer, you know that subnets aren’t physical boxes capable of processing access lists.

Under the hood, both Security Groups and NACLs are enforced by the exact same underlying physical infrastructure: the Nitro card (for EC2 instances) and the Hyperplane fleet (for managed services). However, they are processed at completely different logical stages of the packet pipeline.

Security Groups: Hardware SRAM Lookup Engines

Security Groups are stateful filters. They do not run inside your guest operating system’s software firewall (iptables has no visibility into them), nor do they run on top-of-rack switches. They are programmed directly into the Nitro card’s custom ASIC/FPGA lookup engine on the physical blade server hosting your VM.

Traditional enterprise switches rely on TCAM (Ternary Content Addressable Memory) to execute ACL lookups at line rate. While TCAM is incredibly fast, it is notoriously power-hungry, expensive, and strictly limited in size—which is why high-end enterprise switches have rigid limits on the number of ACL entries you can write before exhausting hardware memory.

Nitro eliminates this constraint by running an optimized, state-tracking lookup engine directly in hardware silicon, evaluating rules against connection state tables stored in dedicated, high-speed onboard SRAM.

When a packet arrives from the underlay network, the Nitro card strips the overlay tunnel encapsulation. Before it passes the raw Ethernet frame across the PCIe bus to your instance, it checks the inner 5-tuple against its state table. If the packet belongs to an already established, verified connection, it bypasses the main rule table entirely and flows straight to your instance via the fast-path pipeline.

Network ACLs: The Stateless Next-Hop

Network ACLs are stateless and map to an entire subnet. In a traditional campus or data center network, this is exactly equivalent to applying an ip access-group 101 in or out on a VLAN Virtual Interface (SVI).

In AWS, the NACL is evaluated at the virtual gateway layer. Physically, however, this logic still executes directly on the local Nitro card of the host machine to optimize performance and drop unpermitted traffic as early as possible.

The True Packet Pipeline Matrix

To trace exactly how policies are evaluated when a packet moves from Instance A (Subnet A) to Instance B (Subnet B) located on separate physical hosts across the data center:

[ HOST A - NITRO ASIC ]
  1. Evaluate Security Group (Outbound Stateful Check)
  2. Evaluate Subnet NACL     (Subnet A Stateless Egress Check)

         ▼ (If Allowed: Encapsulate and Ship)
[ PHYSICAL UNDERLAY FABRIC ] -> Pure L3 IP Forwarding (No Policy Checks)

         ▼ (Arrives at Destination Host)
[ HOST B - NITRO ASIC ]
  3. Evaluate Subnet NACL     (Subnet B Stateless Ingress Check)
  4. Evaluate Security Group (Inbound Stateful Check)

         ▼ (If Allowed: Inject Frame via SR-IOV)
[ TARGET EC2 INSTANCE ]

If a packet violates the outbound NACL of its own subnet, the local Nitro card drops it right there on the source host. The packet never even touches the physical top-of-rack switch, saving valuable data center fabric bandwidth.

The Nitro Card Security Group Limits

Because these rules are executed within the fixed memory confines of the Nitro card’s onboard silicon, there are strict mathematical limits on how far you can scale your security policies.

AWS enforces a hard boundary on network interfaces:

$$ ext{Security Groups per ENI} imes ext{Rules per Security Group} \le 1,000$$

This is a strict zero-sum game. The total product of attached groups and rules can never exceed 1,000 total rules on a single network interface. If you request a quota increase to run 200 rules inside a single Security Group, AWS will automatically throttle your maximum attached Security Groups per ENI down to 5 to protect the onboard hardware memory space.

Furthermore, because ingress and egress traffic run through separate pipeline passes on the Nitro ASIC, this 1,000-rule constraint is calculated independently per direction. You can have 1,000 rules evaluating inbound traffic and a separate 1,000 rules evaluating outbound traffic simultaneously on the exact same interface.

Connection Tracking Table Saturation

Because Security Groups are stateful, Nitro must log every active flow inside its hardware connection table. This introduces an independent scaling boundary based entirely on your EC2 instance size.

If an ENI experiences an unmitigated flash flood of millions of concurrent, microsecond-long connections (such as a massive SYN flood or high-frequency distributed scraping engines), the Nitro card’s connection tracking table can become completely saturated.

Once the state table is full, Nitro will automatically drop all new connection attempts at the hardware line-card layer. This traffic drop happens entirely outside the guest operating system—it will never appear in your Linux kernel logs, application error files, or tcpdump captures, because the packet is killed in silicon before it ever reaches the PCIe bus.


7. Conclusion: The New Networking Mental Model

To master AWS networking as a CCIE, you must stop looking for physical boxes and start looking at where the data plane boundaries are drawn.

AWS did not change the laws of physics or the core requirements of routing packets securely at high speed. Instead, they took the heavy lifting of software-defined networking—the encapsulations, the stateful firewalling, the NAT routing, and the lookup tables—and decoupled them entirely from centralized hardware routers.

By shifting the control plane to a highly scalable distributed database (The Mapping Service), offloading the data plane to custom hardware line cards plugged into every server motherboard (The Nitro System), and building scale-out anycasted stateless processing clusters (Hyperplane), AWS built a multi-tenant environment that delivers the predictability of hardware performance with the infinite flexibility of software. Once you look past the cloud console abstractions, it is a masterpiece of distributed systems engineering.

Written by Paul Carvill

Enterprise → cloud → AI networking. I write the breakdowns I wish I’d had. New field notes roughly twice a month.

keep reading

More writing