The Demo Worked — That Was the Problem (Zero Trust on K8s)

Over a weekend I built a small Kubernetes demo to play with zero trust. Three little services calling each other in a chain, a login page in front, and a service mesh underneath. The point was to see, in a small concrete way, what “zero trust” actually buys you when you stop assuming anything inside your network is safe.

The interesting part wasn’t getting the chain to return a success response. The interesting part was watching it return a success response when half the security policies were missing, and not noticing for twenty minutes.

This post is about the three things I got wrong before I started to understand the idea.

Zero-trust mesh architecture The shape of the thing. A user signs in at Keycloak, then talks to a public entry point, which forwards to service A, which calls service B, which calls service C. The mesh does the encryption and the identity checks between the services.

The shape of the thing

Three services named A, B, and C. A calls B. B calls C. Each one is a small Python web service. In front of all of them is a Next.js front-end that uses Keycloak, an open-source identity server, so a real user (we used “alice”) signs in with a username and password and gets a token.

Each service in the chain doesn’t reuse alice’s token. It swaps it at Keycloak for a new one that’s only valid for the next hop. So A holds a token that B will accept, and B holds a different token that C will accept. If any one of those tokens leaks, it’s only useful for the one place it was minted for.

Underneath the services runs Istio ambient mesh. The mesh adds two things I didn’t have to write. It encrypts every connection between pods automatically, so A’s traffic to B can’t be read even by other things in the same cluster. And it puts a small checkpoint in each namespace that can examine the user’s token, and the calling pod’s identity, before letting a request through.

That’s enough architecture to follow the rest of this post.

The first thing I got wrong

I learned that secure and appears to work are not the same signal, and I needed a system that complained loudly when they came apart.

I had just torn out the authorization policy for service B to try something. Every curl I sent still came back with a 200 OK and the right-looking JSON. The chain kept hopping cleanly from A to B to C. Nothing was checking the token. Nothing was checking the caller. From the outside it all looked fine.

The reason was small but worth saying out loud. The application code in B was reading the token’s claims so it could include them in the response. It was not verifying the token’s signature. That was supposed to be the mesh’s job, and the mesh wasn’t doing it yet. Without the mesh policy, the token could have been any junk and the chain would still have happily returned the right shape.

What I changed: the install script now refuses to declare itself done until a small list of negative tests fail correctly. Send no token, that should return 401. Send a token meant for a different service, that should return 403. Spin up a pod with no permission and have it call B, that should return 403. Skip a hop and call C directly from A, that should return 403. The positive path is easy to get right by accident. The negative paths are not.

The bigger point I keep coming back to: in a zero trust system, the requests that fail when they should fail are the only proof you have that anything is actually being enforced. A success response on its own tells you almost nothing.

The second thing I got wrong

One identity check is rarely enough.

Every request in this system carries two answers to the question “who is calling?” One is the user. Who signed in. What their token was issued for. The other is the workload. Which pod, in which namespace, with which Kubernetes service account is on the other end of the connection.

These are not the same thing. A request can have a perfectly valid user token but come from a pod that has no business making the call. Or it can come from the right pod but with a token that wasn’t issued for the destination service.

For a while my policies were checking one or the other. Either “the caller must be service A’s pod” or “the token must be intended for service B.” Each check looked correct on its own. Each check let through requests the other one would have stopped.

The fix was to do both checks in the same rule, in a way that ties them together. The policy now says, in plain terms: this call is only allowed if the connection is coming from service A’s pod, and the token is intended for service B, and the token was originally issued for service A’s OAuth client. Three conditions. None of them prove anything alone. Together, they’re hard to forge without compromising more than one thing at the same time.

I don’t think there’s a clever lesson hiding in this one. It’s the boring truth that defense in depth means the layers actually have to overlap. One layer checking one thing is not depth. It’s a wall with a known door.

Per-hop view of L4 and L7 checks Per-hop view. The mesh checks the workload’s identity on the connection at one layer, and the user token at another. Both checks have to agree before a call is allowed through.

The third thing I got wrong

The front-end of the demo has a card that lists each hop in the chain, showing where it came from and where it’s going next. I’d been showing this card to people as the “proof of zero trust” part of the project. It looked good. It looked authoritative.

Then I sat with it long enough to notice that those identity strings come from HTTP headers the application code stamps in itself. They’re informational. Nothing verifies them. An attacker who controlled service A could write whatever they wanted into that header and the card would happily display the lie. The audit log I was proud of was something the audited party had volunteered.

The real record was sitting one layer below the application. Every time the mesh terminates an encrypted connection, it writes a log line with both ends’ identities. Not the ones the application asked it to display, but the ones it just verified during the handshake. The application can’t tamper with that line because the application never sees it. It comes from the layer that did the work.

I rewrote that part of the demo to keep the pretty card for humans but route the auditor view to the mesh log. That’s the version that holds up if someone is looking for evidence and not for reassurance.

The thing I want to remember from this one: when something tells you who called whom, ask which layer it heard that from. If the answer is “from the thing being audited,” it isn’t an audit. It’s a self-report.

The browser flow

This is mostly here because people ask to see it before they read the README.

Sign-in page The landing page. The button sends you through Keycloak before anything else happens.

Keycloak login alice signs in with her password. She gets back a token that’s only valid for service A.

After sign-in Signed in. The Send Request button forwards alice’s token to service A.

Identity trace The trace card shows each hop. It’s friendly. The version that holds up in court is in the mesh log, not in this card.

What I’d tell someone starting

If you’re building one of these yourself, three small things I’d suggest before you write a line of YAML.

Write the curl that should fail first. Before any service returns a single 200, write down the requests that ought to be refused, run them, and watch them be refused. The positive path is the easiest thing to get right by accident.

Don’t check identity in one place. Check it in two, the network and the application, and make sure the answers agree. The most credible attacks are the ones where each lock looks locked individually.

Look at where your audit log comes from. If it comes from the same code that handled the request, it’s a self-report. If it comes from a layer beneath the application, it’s evidence. There’s a real difference, and your auditors know which is which.

The code is here if you want to poke at it: github.com/sprider/zero-trust-mesh-demo. It’s deliberately small. The most useful exercise is to delete a policy and see what still works. That’s where the lesson lives.

If anything in this resonated, find me on LinkedIn.

The Demo Worked — That Was the Problem (Zero Trust on K8s)

The shape of the thing

The first thing I got wrong

The second thing I got wrong

The third thing I got wrong

The browser flow

What I’d tell someone starting

Read Next

Building an Agent on Amazon Bedrock AgentCore: End-to-End Notes

Vasanam Studio: How I Built a Bible Verse Video Generator for My Church as a Hobby Project

The shape of the thing

The first thing I got wrong

The second thing I got wrong

The third thing I got wrong

The browser flow

What I’d tell someone starting

Related Posts

Read Next

Tags