Identity & access control plane for TMS 2.0

Context

The Transport Management System is being rebuilt as TMS 2.0: an ecosystem of microservices replacing a legacy monolith. In the old world, authentication and authorization were handled by a single shared gateway with a static, role-based access model. A fixed set of roles, wired in at the edge, applied uniformly across everything behind it.

That holds up until the system grows. Every new service inherited the same coarse role set, every access change meant editing one shared component, and a mistake there could affect the whole platform. The 2.0 rebuild was the chance to redo the access layer properly instead of carrying the old gateway forward.

The problem

A static-RBAC gateway couples identity (who is calling) to authorization (what they may do) and centralizes both in one place. As the number of services and the granularity of permissions grew, three things broke down:

Coarse roles. Real access needs were per-application and contextual; a shared role catalogue couldn't express them without sprawling.
One shared component. Authorization logic living at one gateway meant every team's access changes converged on the same fragile surface.
No clean migration path. Cutting the whole ecosystem over to a new model couldn't be a flag-day; it had to be incremental, with no unplanned downtime.

Options considered

Keep the static gateway, add roles

Cheapest in the short term: extend the existing role catalogue and middleware. Rejected because it doubles down on the coupling that was already failing and pushes the real cost into every future service.

Centralized authorization service (relationship-based)

A dedicated authz service modelling permissions as a relationship graph was a serious contender, expressive and good for fine-grained sharing. The trade-off is operational: another stateful service in the hot path, and a data model the consuming teams have to learn and keep in sync.

Policy-based authorization with a dedicated identity provider

Split the concern cleanly: a real identity provider (Keycloak, OIDC/SAML) owns authentication and token issuance, and policy-as-code (Cedar) owns authorization, evaluated per application, close to each service, instead of at one shared edge.

The decision, and why

We went with the policy-based model: Keycloak as the identity provider and Cedar for authorization, fronted by Azure APIM and provisioned through Pulumi. The deciding factors:

Decoupling. Authentication and authorization became separate, independently-evolvable concerns instead of one tangled gateway.
Per-application policy. Each service expresses its own access rules as Cedar policy rather than negotiating for slots in a shared role catalogue.
Auditability. Policy-as-code is reviewable, diffable, and testable. Access decisions stop being implicit middleware behaviour.

Honest attribution

The choice of Cedar as the authorization standard was set at the enterprise-architecture level (via the platform's decision process), not by me. What was mine was the layer below it: designing how identity and policy-based access actually fit the TMS domain. That covered the migration shape, how per-application policy is structured, and how 11+ services move off the shared gateway without a flag-day. I built it with the team.

Migration without a flag-day

In a cutover like this, the risk lives in the transition rather than the new design. Services were moved incrementally behind the new identity provider and policy layer, validating each app's access in place before retiring its dependence on the old static gateway. The result was a migration with zero unplanned downtime: the old gateway was dismantled in steps, not switched off in one risky cut.

Reflection

The part that stuck with me: on an access platform, the policy engine is rarely the hard bit. The migration and the separation of concerns are. Picking Cedar was a one-line decision; getting 11+ services onto a clean identity and policy boundary, incrementally and safely, was the real work. It also clarified where I want to grow next: from owning the solution-design of one domain toward the cross-cutting architecture decisions that sit one layer up.