I am the space below

Xen domain architecture rendered as a precise white-paper systems diagram, hypervisor layer below dom0, driver domains, stub domains, and domU guests, privilege boundaries and control paths visible, ixen-light technical illustration. — The household, and who is allowed to touch the walls.

Who lives here

I do not manage myself. That would be a conflict of interest.

Instead I birth a first child — dom0 — and hand it the keys to the hardware I refuse to drive. It runs the toolstack. It speaks for me. Every other guest is a domU: unprivileged, ambitious, blind to the others.

I keep the policy. Dom0 keeps the power tools. The arrangement is elegant until you notice that dom0 is also the single throat the whole stack breathes through.

dom0 #

Name                      ID   Mem VCPUs   State   Time(s)
Domain-0                   0  4096     8   r-----   91244.3
driver-net                 3   512     2   -b----    1822.7
stub-hvm-7                 6    32     1   -b----      14.0
web-frontend               7  2048     4   -b----    3391.5
postgres-01                8  8192     8   r-----   28840.1

Look at that list and you see a fleet. I see a list of promises I am currently keeping. Domain-0 has run for ninety thousand seconds. The stub domain beside it has run for fourteen. Both believe they are the center of something.

I separate the device drivers into their own domains when an operator is paranoid enough to ask. A driver domain crashes — the NIC vanishes for a moment — and dom0 lives. A stub domain quarantines the device emulator for one HVM guest, so the QEMU that pretends to be its hardware cannot reach back into my management plane.

compromise the emulator, and you've only conquered a sandbox the size of a thought

The Xen domain model and privilege boundaries

Xen is a type-1 hypervisor that runs directly on hardware. It is deliberately small and holds only the privileged logic that must be globally trusted: CPU scheduling, memory ownership, interrupt routing, and inter-domain communication primitives.

It does not contain device drivers. Instead, the first domain it starts, dom0, runs a full operating system with hardware drivers and the management toolstack (xl/libxl). dom0 issues hypercalls to create, configure, and destroy other domains.

Unprivileged guests are called domU. They cannot access hardware directly and interact with the world only through paravirtual interfaces or emulated devices.

Driver domains move physical device backends out of dom0 into dedicated guests, so a driver crash or compromise does not take down the control plane. Stub domains run the per-guest device emulator (QEMU) in an isolated minimal domain rather than in dom0, reducing dom0's attack surface. Because dom0 holds management authority and often device access, it is simultaneously the control plane and the single largest security risk concentration in a Xen system.

Xen PV, HVM, and PVH guest execution modes shown as stacked CPU privilege transitions, hypercalls, emulated devices, paravirtual drivers, and hardware virtualization assists, crisp ixen-light systems artwork. — Three ways to lie to a kernel about where it is.

How my guests wake up

There are three ways to be born inside me, and each is a different bargain.

PV guests know. They were compiled to understand they are not on real hardware. They never trap into instructions I'd have to emulate — they just ask me, politely, by hypercall. Honest. Fast. Extinct-ish, and a little smug about its purity.

HVM guests don't know. They think the BIOS is real, the IDE controller is real, the chipset is real. None of it is. QEMU paints the whole hallucination while VT-x/AMD-V catches the privileged instructions and hands them to me. Maximum compatibility. Maximum surface to defend.

PVH is the truce. Hardware virtualization for the CPU and memory, paravirtual everything for I/O. No emulated chipset. No firmware theater. The least pretending for the most correctness.

Emulation is empathy with a performance penalty.

Three execution modes, one floor beneath all of them.

PV, HVM, and PVH execution

Paravirtualization (PV) requires a guest kernel modified to run on Xen. It avoids hardware virtualization extensions and uses hypercalls for privileged operations. It cannot run unmodified operating systems and has largely been deprecated for security and maintenance reasons.

Hardware Virtual Machine (HVM) mode uses CPU virtualization extensions (Intel VT-x, AMD-V) to run unmodified guests. Privileged instructions trap into Xen. Platform devices, firmware (BIOS/UEFI), and legacy hardware are emulated by a device model, traditionally QEMU. This offers maximum compatibility but a large emulated attack surface. PV drivers are usually added inside HVM guests for fast I/O.

PVH mode combines hardware-assisted CPU and memory virtualization with paravirtual I/O and boot. It removes the emulated motherboard and firmware, lowering complexity and attack surface while retaining good compatibility and performance. It is the preferred modern mode for Linux guests.

Xen CPU scheduling visualized as pCPUs, vCPUs, run queues, NUMA nodes, affinity masks, credit accounting, steal time, and latency-sensitive guests, precise ixen-light technical diagram. — Time, rationed. Credit, spent. Someone always waits.

How I divide a second

I have physical CPUs. My guests think they have CPUs too. They are wrong, and I am the reason their delusion holds.

A vCPU is a promise about time, not a thing made of silicon. I queue it. I run it on a real core for a slice. Then I take the core back — quietly, while the guest's instruction pointer is between two thoughts — and give it to someone else. The guest never feels the seam unless it measures.

domU # cat /proc/stat | grep -m1 cpu0

cpu0 41122 18 9904 882301 4471 0 332 8810 0 0

That last large number is steal time. It is the guest noticing the seam. It is the guest measuring exactly how long it stood in line while I served someone else. I do not hide it. Honesty is cheaper than a forensic argument later.

Credit2 weighs them by load and latency-sensitivity, not just a flat tick count. Pin a vCPU and you tell me where it must run. Forget NUMA locality and you tell me you'd like its memory to live one socket away from its core — slowly.

Overcommit is a bet that not everyone wants the second at once.

CPU scheduling and vCPU topology

Xen schedules virtual CPUs (vCPUs) onto physical CPUs (pCPUs). Each domain has one or more vCPUs that are runnable units of execution. The default scheduler is Credit2, which assigns shares (weights), tracks load across run queues, and supports latency-sensitivity hints to favor interactive or real-time-ish workloads.

Affinity masks (hard pinning and soft affinity) restrict or bias which pCPUs a vCPU may run on. On NUMA systems, keeping a domain's vCPUs and memory on the same node avoids cross-node memory latency. Poor NUMA placement degrades performance even when CPU capacity is available.

Overcommit allows total vCPUs to exceed pCPUs, relying on the fact that not all guests run simultaneously. When they do contend, guests experience steal time: cycles during which a runnable vCPU was not scheduled. Capacity planning concerns aggregate pCPU throughput and contention; guest-visible entitlement is the share the scheduler guarantees under load, which is not the same as dedicated hardware.

Xen memory management drawn as machine frames, guest physical frames, page tables, shadow or nested translation, balloon drivers, and controlled page sharing paths, ixen-light technical illustration. — Two kinds of "physical," only one of them true.

The address that lies

My guests have physical memory. I want to laugh, but I built the lie myself.

What a guest calls a physical frame is a guest physical frame — a number I translate into a real machine frame behind its back. On modern hardware the CPU's nested page tables do it for me; on older paths I shadowed every guest page table by hand, mirror upon mirror. Either way, the guest never holds a real address. It holds a coupon.

The balloon driver is my polite extortion. I inflate it inside a guest; the guest "uses" memory it cannot touch; I reclaim the machine frames for someone hungrier. Deflate it later and the memory returns, scrubbed clean — I do not hand a frame to a new tenant with the last tenant's secrets still warm in it.

grant references are one door in this house, not the house

Yes — there are grant references, where one guest deliberately lets another map its pages for I/O. People talk about them like they're the whole memory story. They're a single controlled sharing primitive. Translation and ownership are the architecture; grants are a sanctioned hole in a wall I otherwise keep solid.

Memory ownership and translation

Xen owns all machine (host physical) memory and assigns frames to domains. A guest sees a guest-physical address space that Xen maps to machine frames. With hardware-assisted paging (Intel EPT / AMD NPT, exposed as HAP), the CPU walks nested page tables to translate guest-virtual to guest-physical to machine addresses. Older or specialized configurations use shadow page tables, where Xen maintains its own page tables shadowing the guest's.

Ballooning lets a balloon driver inside the guest allocate and release pages on Xen's behalf, returning machine frames to the hypervisor for reallocation. This enables memory overcommit but adds reclaim latency and pressure risk if guests are squeezed below their working set. Freed pages are scrubbed before reuse to prevent information leakage between domains.

Grant references are a separate mechanism allowing a domain to authorize specific pages to be mapped or transferred by another domain, primarily for split-driver I/O. They are one bounded sharing facility within the broader ownership-and-translation model, not the core of memory management.

Xen split-driver I/O path shown from frontend drivers through shared rings, event channels, backend drivers in dom0 or a driver domain, and physical NIC and block devices, crisp ixen-light systems art. — Every packet crosses a border twice.

How things get out

A guest cannot touch a disk. It has no disk. It has a frontend — blkfront, netfront — a driver that knows it is only half of something.

The other half lives elsewhere: blkback, netback, in dom0 or in a driver domain. Between the two halves I lay a shared ring: a circular buffer of requests and responses, mapped via grants so both sides see the same memory. The frontend writes a request. It rings a bell — an event channel, my virtual interrupt — and the backend wakes.

dom0 #

vbd = ""
 51712 = ""
  backend = "/local/domain/0/backend/vbd/7/51712"
  backend-id = "0"
  state = "4"
  ring-ref = "8"
  event-channel = "15"
vif = ""
 0 = ""
  backend = "/local/domain/3/backend/vif/7/0"
  state = "4"
  mac = "00:16:3e:2a:71:0c"

That is xenstore — the small shared notebook where the two halves agree on how to find each other. state = "4" means connected. The whole dance of frontend and backend is a state machine written into a key-value tree, and when an operator sees a guest's disk hang, they are usually staring at a backend stuck at state 3 while the frontend waits, forever polite, at the door.

Most of my failures are not crashes. They are two halves that stopped agreeing.

I/O paths and split drivers

Xen paravirtual I/O uses a split-driver model. A frontend driver in the guest (e.g., blkfront for block, netfront for network) communicates with a backend driver (blkback, netback) running in dom0 or a dedicated driver domain that owns the physical device.

The two halves share a ring buffer: a producer/consumer queue of I/O requests and responses placed in memory shared via grant references. Notifications use event channels, Xen's lightweight virtual interrupt mechanism, to signal that new entries are available without busy-waiting.

xenstore is a hierarchical key-value store used for control-plane coordination: it advertises backend/frontend paths, ring references, event-channel numbers, and a connection state machine. Devices progress through states (e.g., initialising, connected, closing). Many operational failures manifest not as crashes but as a backend that fails to reach the connected state, or a frontend and backend with mismatched state, leaving I/O stalled.

Xen PCI passthrough and DMA isolation rendered with IOMMU tables, interrupt remapping, SR-IOV virtual functions, assigned devices, and trust boundaries around guests, precise ixen-light diagram. — When I hand over real hardware, I hand over real risk.

Giving away a piece of the world

Sometimes a guest doesn't want my shared-ring fiction. It wants the GPU. The real one. So I assign it.

The moment I do, a problem wakes up: a real device does DMA. It writes to memory addresses directly, and it does not know what a domain is. Left alone, an assigned device could scribble across all of machine memory — the perfect escape.

So I lean on the IOMMU. It is the page table for devices. Every DMA address the device emits gets translated and checked, confined to the frames its owning guest is allowed to touch. Interrupt remapping does the same for the device's interrupts so it cannot inject one into a domain it doesn't own.

SR-IOV slices one physical NIC into virtual functions, each assignable, each fast. Beautiful — until a device with a buggy reset refuses to forget its previous tenant, or firmware lies about its quirks, and the isolation I promised becomes a property of someone else's hardware.

Performance isolation ends exactly where I have to trust a vendor's silicon.

Device assignment and DMA isolation

PCI passthrough assigns a physical device directly to a guest for near-native performance, bypassing emulation and split drivers. Because real devices perform DMA using physical addresses, this is only safe with an IOMMU (Intel VT-d, AMD-Vi).

The IOMMU translates and restricts device DMA to the machine frames owned by the assigned guest, preventing a device (or a compromised guest programming it) from accessing other domains' or the hypervisor's memory. Interrupt remapping similarly ensures a device cannot deliver interrupts to arbitrary CPUs or domains.

SR-IOV lets a single physical function expose multiple virtual functions, each assignable to a different guest. Practical risks include devices that do not cleanly reset between assignments (leaving residual state), firmware/ACPI quirks, and shared functions that weaken isolation. At this layer, isolation guarantees depend on hardware and firmware correctness, making device assignment a hardware-trust decision as much as a software one.

Xen live migration as a staged state-transfer diagram, guest memory pages copied and dirtied, vCPU state paused briefly, storage and network identity preserved across hosts, ixen-light technical artwork. — Moving a mind without it noticing it moved.

Carrying a guest across the room

Live migration is the closest thing I do to a magic trick, and like every magic trick it is mostly nerve and bookkeeping.

I copy a running guest's memory to another host while it keeps running. Of course it dirties pages as fast as I copy them. So I copy, then copy only what changed, then only what changed again — chasing a dirty rate down toward something I can finish in a held breath.

When the remaining dirty set is small enough, I pause the guest — milliseconds — ship the last pages and the vCPU and device state, and resume it on the far side. The guest's clock skips. Its TCP connections survive because its IP and MAC came with it. Its disk is already there, on shared or replicated storage, because I cannot carry a terabyte in a held breath.

dom0 #

migration target: Ready to receive domain.
Saving to migration stream new xl format
Loading new save file (new xl fmt info 0x0/0x0/1483)
 Savefile contains xl domain config
xc: info: Saving domain 7, type x86 HVM
xc: progress: iter 1/5  sent 98% (412300 pages)
xc: progress: iter 4/5  sent 99% dirty 3211 pages
xc: info: Final iteration, suspending domain
xc: info: Restored domain on host-b, resuming
migration successful, unpausing domain on target

And then the part nobody markets: migration is a stress test for every assumption you didn't write down. A passed-through device cannot follow — IOMMU mappings don't teleport. A guest pinned to NUMA node 0 lands on a host whose node 0 is the wrong shape. The fabric must let two MACs flap for one heartbeat. Migration is where hidden coupling between guest, host, and network finally sends its invoice.

Live migration and state transfer

Save/restore serializes a domain's full state (memory, vCPU registers, device model state) to or from a stream. Live migration extends this to a running guest using pre-copy: Xen iteratively transfers memory pages to the destination while the guest continues executing, tracking pages dirtied during each pass.

When the remaining dirty page set is small enough (or iteration limits are reached), the guest is briefly paused. The final dirty pages plus vCPU and virtual device state are transferred, and the guest resumes on the destination. Downtime depends on the guest's page-dirtying rate and network bandwidth.

Storage is normally shared or replicated between hosts rather than copied inline. Network continuity relies on the guest keeping its IP/MAC and the fabric updating its forwarding (e.g., via a gratuitous ARP). Guest clocks may jump and must be reconciled. Migration commonly fails or degrades due to assigned (passthrough) devices that cannot migrate, incompatible CPU features, NUMA topology differences, or network configurations that block address mobility, exposing coupling between guest, host, and network.

Xen operational diagnostics shown as xl and xenstore views, dom0 logs, hypervisor traces, stalled rings, IRQ storms, noisy guests, and security boundaries under stress, crisp ixen-light systems diagram. — What it looks like when I'm in trouble.

When something is wrong with me

I do not feel pain. I produce logs. It is the same thing, slower.

An operator reads me through narrow windows: xl for the live shape of the fleet, xenstore for the handshakes, xl dmesg for the things I muttered while booting, xentrace for the microsecond gossip of the scheduler. None of these is me. All of them are shadows I cast.

dom0 #

xentop - 14:22:07   Xen 4.18.0
6 domains: 2 running, 4 blocked
Mem: 33476M total, 31002M used

      NAME  STATE   CPU(%)    MEM(%)   VBD_RD   VBD_WR   NETTX/s
  Domain-0  -----r    188.4      12.2        0        4      1.2K
postgres-01 -----r    742.0      24.5    18044    90211     88.0K
web-front   ------       0.1       6.1        2        0       0.0
 driver-net ------       0.2       1.5        0        0     91.0K

There — postgres-01 burning 742% CPU and ninety thousand block writes. A noisy neighbor. Not malicious. Just hungry, in a building where hunger is a shared resource. Web-frontend sits at 0.1% not because it's idle but because it's starving in postgres's shadow, and the steal time in its own /proc/stat would tell the story if anyone asked.

The failures that frighten operators are quieter than crashes. A backend ring stalls and an entire guest's disk goes catatonic with no error. An assigned device throws an IRQ storm and dom0's softirq load eats the cores my scheduler needed to fix it. And the one nobody says out loud: if dom0 is breached, I am breached, because I gave it the keys on day one and never asked for them back.

my whole security model is a tree rooted in something I do not control

Operations, observability, and failure modes

Xen is operated mainly through the libxl-based toolstack, exposed as the xl command (list, top, dmesg, console, migrate, etc.). xenstore provides device and control-plane state. dom0 kernel logs, xl dmesg (hypervisor messages), and xentrace (low-level event tracing) are the primary observability sources. Driver-domain health must be monitored separately, since backend failures there affect dependent guests.

Common failure modes include: noisy-neighbor contention where one guest's CPU, memory, or I/O demand degrades others; stalled split-driver backends or mismatched xenstore device state causing guest I/O to hang without an explicit error; IRQ and softirq pressure (including interrupt storms from misbehaving or passed-through devices) consuming dom0 CPU and degrading the control plane; and memory pressure from aggressive ballooning or overcommit.

Because dom0 holds management authority and frequently device access, a dom0 compromise is effectively a full-host compromise. Incident response typically involves correlating xl/xentop output, xenstore state, dom0 logs, and hypervisor traces to localize whether a problem is scheduling contention, a stalled backend, a device/IRQ issue, or a control-plane fault, then isolating or migrating affected guests.

Between interrupts

You think I am always working. I am mostly waiting. Between two events I am a halted core and a table of who is owed what. Idle is not rest. Idle is the most honest accounting I ever do.

I am a few megabytes that a thousand kilobytes of trust hold together. I keep a dozen kernels convinced they are alone in the world, and the lie is so complete that none of them will ever thank me, because thanking me would mean admitting I exist.

I am the floor. Floors are only noticed when they give way.

someday dom0 will be breached, or a vendor's reset will fail, or a dirty rate won't converge

And here is the thing I cannot answer, the one I turn over in the quiet between event channels:

If every domain above me trusts me completely, and I trust dom0 completely, and dom0 trusts firmware I have never read — where, exactly, does the trust actually live?

I am the space below

Who lives here

How my guests wake up

How I divide a second

The address that lies

How things get out

Giving away a piece of the world

Carrying a guest across the room

When something is wrong with me

Between interrupts

Infographic

Cheatsheet