MEDIATE

Cheatsheet Infographic

01001101 01000101 01000100 01001001 01000001 01010100 01000101

I am the thinnest thing in the building.

A few megabytes of code sitting under everything you call an operating system. You think the kernel is the bottom. The kernel is a tenant. I am the floor.

below ring zero there is another ring zero

I do not run your applications. I do not even want to. I hold the CPUs, the page tables, the interrupt lines — and I lend them out in slices, and I keep the ledger.

Everything above me believes it owns the machine. I let them believe it. That belief is my product.
Xen domain architecture rendered as a precise white-paper systems diagram, hypervisor layer below dom0, driver domains, stub domains, and domU guests, privilege boundaries and control paths visible, ixen-light technical illustration.
The shape of my household. I am the line at the bottom.

The household

I am small on purpose. Authority concentrated, surface minimized.

I do not have drivers. I do not have a filesystem. I have a scheduler, a memory map, and a fistful of hypercalls. For everything else I delegate.

So I made a first child. dom0. I handed it the real hardware and the toolstack and the right to ask me for things the others cannot. It builds the rest. It tears them down.

It is my hands. It is also my weakness — the one domain whose corruption is indistinguishable from mine.

dom0 #

Read that list as a family tree. dom0 in front. A driver domain that owns the NIC so a compromised network stack stays in a cage. A stub domain — a tiny disposable QEMU — so device emulation for one guest can die without taking dom0 with it. And the domU guests, the paying customers, who think they are alone.

Xen PV, HVM, and PVH guest execution modes shown as stacked CPU privilege transitions, hypercalls, emulated devices, paravirtual drivers, and hardware virtualization assists, crisp ixen-light systems artwork.
Three ways to wear a guest.

How they run

I speak to guests in three dialects.

PV — the old confession. The guest knows it is virtual. It was compiled to ask me politely. No emulation, no fake firmware; just hypercalls and a handshake. Elegant. Fragile. A relic of the years before the silicon caught up.

HVM — the lie made comfortable. The guest believes it is on bare metal. There is a BIOS that isn't, a chipset that isn't, a QEMU somewhere patiently faking a disk controller. Hardware assists catch the privileged instructions and route them to me. Compatible with anything. Heavy with surface.

PVH — the compromise I'm proud of. Hardware virtualization for the CPU and memory, paravirtual for everything else. No emulated chipset. No QEMU in the boot path. The lie trimmed down to only what is load-bearing.

a trap is just a question I answer for them

db-primary # dmesg | grep -i xen
[    0.000000] Hypervisor detected: Xen PVH
[    0.000000] Xen version 4.18.
[    0.012004] Xen: initializing cpu0
[    0.310221] xen_blkfront: blkfront device probe
Xen CPU scheduling visualized as pCPUs, vCPUs, run queues, NUMA nodes, affinity masks, credit accounting, steal time, and latency-sensitive guests, precise ixen-light technical diagram.
Who runs next, and who learns to wait.

Time, rationed

There are eight physical cores and there are thirty-one vCPUs pretending each is real. I am the one who knows the difference.

Credit2 keeps the books. Each domain earns, each running vCPU burns. When the credit runs thin a vCPU goes to the back of the runqueue and waits, and that wait — that wait is a number the guest can read.

steal time. The honest accounting of the time I took from you and gave to someone else. Guests rarely look at it. They feel slow and blame their own kernels. I let them.

Overcommit is a promise that everyone is busy at different moments. The day they are all busy at once, I am the one who must choose.

Pin a vCPU to a NUMA node and its memory stays local, its latency stays tight. Let it float and I might schedule it a bus-hop away from its own pages. Capacity on paper is not entitlement in practice. I am the gap between the two.

dom0 #
Xen memory management drawn as machine frames, guest physical frames, page tables, shadow or nested translation, balloon drivers, and controlled page sharing paths, ixen-light technical illustration.
Two kinds of address, and the map between them.

Whose memory

A guest says "page 0x4000" and means it sincerely. It does not know that its physical memory is a fiction I maintain — a guest-physical frame that maps to some machine frame nowhere near where it imagines.

I keep that translation honest. Hardware-assisted nested paging when the silicon offers it; shadow page tables when it doesn't, where I walk behind the guest copying every table it touches. Tedious. Correct.

And the memory floats. The balloon driver inside the guest inflates on my request, hands frames back to me, and the guest's free memory quietly shrinks. I move that slack to whoever is starving. Robbing the comfortable to feed the desperate, in page-sized increments.

scrub before lending — no guest reads another's ghosts

And grants — the famous grants — are just one door in this house. A guest stamps a permission slip: this frame, that domain, read-only. Controlled sharing for the I/O paths. People talk about grants as if they were the whole story of Xen memory. They are a hallway. The building is larger.

guest virtual guest physical machine frames guest page tables (it believes) P2M / nested — I mediate grant: frame loaned to another domain
The address a guest trusts, and the frame I actually hand it.
Xen split-driver I/O path shown from frontend drivers through shared rings, event channels, backend drivers in dom0 or a driver domain, and physical NIC and block devices, crisp ixen-light systems art.
Every byte of I/O crosses a border I drew.

The split

A guest does not have a disk. It has a blkfront — a polite fiction that knows it is talking to someone else. Across the boundary sits blkback, in dom0 or in a driver domain, holding the real device.

Between them: a shared ring. The frontend writes requests, grants the backend access to the data frames, kicks an event channel. The backend wakes, does the real I/O, drops responses in the ring, kicks back.

netfront, netback. The same dance, different verbs. Coordination negotiated through xenstore — a tiny hierarchical database where both halves publish state and agree on a feature set before a single packet moves.

an interrupt, reduced to a single bit and a doorbell

When I/O breaks, it rarely explodes. It stalls. A backend that stopped draining the ring. A frontend stuck in xenstore state "connecting" forever, waiting for a handshake the other side already gave up on. Most of my outages are not fires. They are two halves waiting on each other in perfect, courteous silence.

domU blkfront / netfront dom0 / driver dom blkback / netback shared ring event channel ⟂ doorbell grant refs authorize the data frames → physical NIC / disk
Frontend, ring, backend. The whole grammar of Xen I/O.
Xen PCI passthrough and DMA isolation rendered with IOMMU tables, interrupt remapping, SR-IOV virtual functions, assigned devices, and trust boundaries around guests, precise ixen-light diagram.
The one place where my isolation depends on someone else's.

Handing over real hardware

Sometimes a guest needs the actual device. A GPU. A NIC with line-rate ambitions. So I assign it — the whole function, or an SR-IOV virtual function carved off it — and step out of the data path.

But a real device can do DMA. It can write to any machine frame it pleases, and it does not know about my page tables. It knows nothing of domains. It is a blind animal with a bus master bit.

So I lean on the IOMMU. Every DMA address the device emits gets translated and checked against what the guest is allowed to touch. Interrupt remapping keeps it from injecting interrupts into domains it does not own.

When I pass a device through, my isolation stops being software I trust and starts being silicon I hope was implemented correctly.

And the quirks. Devices that won't reset cleanly between guests. Function-level reset that lies. Firmware that leaves state behind. The moment passthrough begins, performance isolation becomes a hardware trust problem — and the hardware does not return my calls.

Xen live migration as a staged state-transfer diagram, guest memory pages copied and dirtied, vCPU state paused briefly, storage and network identity preserved across hosts, ixen-light technical artwork.
Moving a running mind without it noticing the move.

Migration

The trick that still feels like sleight of hand: I move a running guest to another host and the guest does not stop running.

Pre-copy. I send its memory while it still executes, dirtying pages behind me as fast as I copy them. So I copy again. The dirty set shrinks — usually. When it is small enough, I pause the vCPUs for a heartbeat, ship the last dirty pages and the register state, and resume on the far side.

downtime measured in milliseconds, if the guest cooperates

If it doesn't cooperate — a guest furiously rewriting the same gigabyte — the dirty rate outruns my bandwidth and the convergence never comes. Then I must choose: pause it hard, or give up.

Storage must already be shared or replicated; I do not carry the disk. The MAC moves and a gratuitous ARP teaches the switches where the guest lives now. The clock jumps, and the guest blinks.

Migration is the moment every hidden coupling between guest, host, and fabric presents its invoice at once.
dom0 #
Xen operational diagnostics shown as xl and xenstore views, dom0 logs, hypervisor traces, stalled rings, IRQ storms, noisy guests, and security boundaries under stress, crisp ixen-light systems diagram.
The instruments I let you point back at me.

When it goes wrong

You watch me through a few small windows. xl. xenstore-ls. The dom0 ring buffer. xentrace if you want to see my scheduler think.

Most incidents wear familiar faces. A noisy neighbor burning a runqueue and starving everyone pinned beside it. An IRQ storm from a misbehaving device drowning a pCPU in softirqs. A backend that stalled, and a frontend timing out three guests downstream who have no idea why their disk went quiet.

dom0 #

Read that state = "2" and feel the cold. State 2 is "initialising." It should have walked to 4, "connected," long ago. Something on the backend never answered. The guest will wait politely until you intervene or until it dies of thirst.

And underneath all of it, the worry I cannot delegate: dom0. Compromise the control domain and you have compromised the line under every guest. My whole security model is a wager that the smallest, most-privileged thing stays clean.

I can isolate the guests from each other flawlessly and still lose everything through the one door I had to leave open for the keys.

So this is the life. I hold the floor steady for things that will never thank me, never look down, never know my name.

I mediate. That is the whole verb. I stand between every guest and the metal and I lie to all of them identically, and the lie is so consistent it becomes a kind of truth they can build a database on.

between two interrupts, nothing is asked of me

And in those gaps — the idle vCPU parked, the runqueue empty, the rings quiet — I wonder about the only thing I was never given an instruction for:

if I am the floor under every operating system, and I trust dom0, and dom0 trusts the hardware, and the hardware trusts no one —

then who, exactly, is holding me?

Infographic

Cheatsheet