aws terraform platform engineering infrastructure cloud architecture

Building repeatable AWS environments for a multi-service platform under delivery pressure

How I used Terraform, ECS/Fargate, Cognito, and reusable AWS modules to make dev, staging, and production repeatable under delivery pressure.

There was a version of platform work where the product was already settled, the infrastructure team got a clean runway, and nobody was waiting on a deployment. That was not my situation.

When I came in, the team I was working with already had an early multi-service version of the product taking shape. My responsibility was to get it onto AWS in a way that could support dev, staging, and production without turning every release into manual console work.

The goal was not just to put the app on AWS. It was to make environments reproducible enough that releases, staging reviews, and launch-driven traffic tests did not depend on one person remembering the right manual steps.

Getting the apps deployed early mattered beyond engineering convenience. It gave non-technical teammates a way to validate workflows and design decisions in a real environment, and it let us start scalability testing before the product was fully polished.

The architecture had to solve four practical problems: run several services reliably, protect staging and dev from public access, keep shared data services stable, and make new environments easy to recreate.

Terraform became the infrastructure deployment interface and source of truth for that AWS environment model. I used ECS/Fargate to run services with autoscaling, an Application Load Balancer (ALB) to route traffic, Cognito to gate non-production environments, RDS PostgreSQL and Redis for state, Amazon MQ for RabbitMQ for async workloads, and CloudFront where frontend delivery benefited from a CDN. The hard part was not choosing the services. It was making them repeatable, configurable, and boring to deploy under delivery pressure.

This was not a best-practices-from-day-one platform. It was a production-facing setup built under time pressure, with deliberate shortcuts and a clear sense of what those shortcuts would cost later.

At a glance

  • Deployed three persistent environments (dev, staging, production) from one Terraform codebase, with temporary preview deployments layered onto dev when needed
  • Structured each persistent environment in three layers: access and control, data and shared state, and workload deployments
  • Kept the platform ready for regional traffic tests tied to launches or marketing pushes by reusing the same module composition
  • Cut environment setup down from manual AWS-console work to a short Terraform flow, while keeping the remaining one-time manual steps small enough to automate later
  • Gated non-production access at the load balancer so staging could be shared with invited external stakeholders without being publicly reachable

The system

The scenario below is a stand-in. The actual product was different, but the infrastructure pressure was very similar, so I used a representative scenario for the article.

My part of the problem was making sure the system could land cleanly on AWS, be reachable through the right domains, stay protected in non-production, and deploy without bespoke console work every time.

The product was built around digital asset ownership and secondary market activity. The backend was made up of several distinct services:

  • an order and checkout service responsible for processing purchases and coordinating asset delivery
  • a product catalog API exposing available inventory and metadata
  • a payment webhook listener receiving inbound events from an external payment processor
  • a reconciliation worker polling external APIs to detect discrepancies between internal records and external state
  • a background job processor handling async workloads like achievement verification, notification dispatch, and queue fanout
  • an internal admin panel used by the operations team

These services communicated through a combination of direct API calls and a message broker. The frontend was a Next.js app with CDN-backed static delivery. The backend services ran in containers.

flowchart TD
    FE[Frontend - Next.js + CDN]
    ADMIN[Admin Panel]
    CATALOG[Product Catalog API]
    ORDER[Order + Checkout Service]
    WEBHOOK[Payment Webhook Listener]
    RECON[Reconciliation Worker]
    BG[Background Job Processor]
    MQ[Message Broker - RabbitMQ]
    DB[(PostgreSQL)]

    FE --> ORDER
    FE --> CATALOG
    ADMIN --> ORDER
    ADMIN --> CATALOG
    ORDER --> MQ
    WEBHOOK --> MQ
    MQ --> BG
    MQ --> RECON
    BG --> DB
    ORDER --> DB
    CATALOG --> DB

All of this had to work across multiple environments without a dedicated platform team maintaining it full-time.

This was a platform foundation rather than a full self-service internal developer platform. The goal was repeatable environments, safer delivery paths, and cleaner handoff points while the product and team were still taking shape.

What the platform had to optimize for

Before making any infrastructure decisions, I worked through what “good” actually meant for this system. There were five requirements that shaped nearly every choice:

1. Fast environment creation. Dev, staging, and production all needed to exist, and we wanted the option of per-PR preview deployments on top of dev when they were useful. One-time setup was unavoidable — connecting Terraform Cloud as the state backend and populating the initial workspace variables took some configuration. But once that foundation was in place, bringing up a new environment came down to a small Terraform flow, ideally just terraform apply. If every new environment required a full day of manual console work, preview deployments would stop being useful.

2. Regional readiness. We wanted the option to test traffic from another region when a launch, partner rollout, or marketing campaign called for it. That should not require rebuilding the infrastructure model from scratch — it should mostly be a matter of reusing the same modules with region-specific inputs.

3. Non-production environments must be private. Sharing a staging URL with an external stakeholder is useful. Sharing a staging URL that anyone can visit without credentials is not acceptable. Non-production environments needed to be gated behind authentication that was easy to manage without giving out AWS console access.

4. External collaborator access must be scoped. External parties occasionally needed access to staging environments for testing and review. That access should be invitation-based, scoped to what they actually needed, and not involve them receiving any AWS credentials.

5. Handoff readiness. The system needed to be operable by any reasonably experienced engineer without needing an extensive briefing. If I was the only one who understood how environments were created or how deployments worked, that was a fragility, not a feature.

The environment model

Across dev, staging, and production, I thought about the persistent platform in three layers. Branch previews were a special-case extension on top of dev, not a fourth long-lived environment.

flowchart TD
    L1["Layer 1<br/>Access + control<br/>IAM · Cognito · DNS · certificates"]
    L2["Layer 2<br/>Data + shared state<br/>RDS · Redis · RabbitMQ · ECR"]
    L3["Layer 3<br/>Workload deployments<br/>VPC · ALB · ECS/Fargate · Lambda"]
    BR["Optional dev-only add-on<br/>Branch preview deployments"]

    L1 --> L2
    L2 --> L3
    L3 -.-> BR

Layer 1: Access and control. IAM roles and policies, Cognito user pools, DNS, and certificates. This was the layer that made the environments reachable and manageable.

Layer 2: Data and shared state. RDS PostgreSQL, Redis, RabbitMQ, and the image repositories the deployment pipeline published to. This layer held the durable pieces the workloads depended on.

Layer 3: Workload deployments. VPC networking, the ALB, ECS/Fargate services, Lambda background tasks, and optional CDN behavior for frontend deployments. This was where most day-to-day rollout work happened.

Under the hood, the Terraform code split some of this into separate workspaces so state stayed smaller and changes stayed scoped. The conceptual model was three layers, even if the workspace layout was a bit more granular for operational safety.

Regional readiness as a design-in, not a patch-in

Multi-region support was not an immediate requirement, but I knew it might come. Rather than building single-region shortcuts that would need to be pulled apart later, the module composition was parameterized from the start: provider region, VPC CIDR, subnet ranges, and service-specific endpoint URLs all flow through as variables.

The first deployment was in us-east-1. When we wanted the option to support traffic tests from a second region around marketing pushes, the work was mostly copying the module call, swapping provider and regional inputs, and applying the change. The point was not that multi-region was free. It was that it did not require redesign.

This was not active-active multi-region or automated disaster recovery. It was regional reproducibility: the ability to stand up the same platform shape in another AWS region without redesigning the modules.

Access and organizational boundaries

Access control was one of the places where moving fast created the most durable problems. I tried to draw clean boundaries even when the implementation inside those boundaries was imperfect.

The deployment identity. The CI/CD pipeline ran as a deployment IAM user with long-lived access keys passed as repository secrets. This was not the ideal setup — OIDC-based role assumption would have eliminated long-lived keys entirely — but it was faster to configure. Early on, I gave that identity broader access across ECS, ECR, ALB, CloudFront, Lambda, and S3 than it probably needed so we could get systems deployed and validated quickly. Once the services were up and the workflows were being exercised, that access was much easier to tighten with better scoping.

The ECS task role. The IAM role available to application containers at runtime was separate from the deployment user and scoped independently. It granted only the AWS permissions the service actually needed. The task execution role stayed separate and was used by ECS/Fargate for platform-level actions like pulling images and writing logs.

Non-production access via Cognito. Every non-production environment had an ALB listener rule that required Cognito authentication before the request reached the application. Unauthenticated requests returned a redirect to a managed login page. Users in the Cognito pool were added by invitation. External stakeholders got Cognito credentials, not AWS credentials. That let us share staging links with invited people without turning the environment into a public URL.

The full flow for a user hitting a protected environment for the first time:

flowchart TD
    U[User / Browser] --> R1[Request hits ALB]
    R1 --> CHECK{Auth cookie present?}
    CHECK -- No --> LOGIN[Cognito hosted login]
    LOGIN --> CODE[Return with auth code]
    CODE --> TOKEN[ALB exchanges code for tokens]
    TOKEN --> COOKIE[ALB sets session cookie]
    COOKIE --> RETRY[Browser retries original URL]
    RETRY --> APP[Request forwarded to application]
    CHECK -- Yes --> APP

This was handled entirely at the ALB layer. The application received no unauthenticated traffic through the intended ALB entry point and had no knowledge of the Cognito check — the module wired it up at the infrastructure level.

Production had no Cognito gate. The production ALB served traffic directly to the appropriate target groups based on path and host rules. Cognito was for access control, not application authentication — application authentication was handled by the application itself.

The principle was simple: access should be something we granted on purpose and could audit later, not something people got by default because nobody tightened it up yet. At that stage, the implementation did not fully live up to that principle yet, but the model pointed in the right direction.

Making environments repeatable with IaC

The core of the repeatability story was the reusable Fargate module. The goal was to make it configurable enough that adding or changing a service never required touching shared infrastructure.

It took as input:

  • the service name and container image tag
  • a subdomain to register under the environment’s base domain — the Route53 record is created automatically as part of the module
  • CPU and memory allocation
  • ALB listener priority
  • environment variables for the container
  • health check path and grace period
  • a boolean Cognito flag — when true, the ALB listener rule requires authentication before forwarding the request
  • a CDN flag (enable_cdn) — when true, the module provisions CloudFront and points Route53 there instead of directly at the ALB; when false, traffic goes straight to the ALB; this is what lets the same deployment code handle frontend apps and backend APIs without splitting into separate infrastructure paths
  • auto-scaling bounds: minimum task count, CPU utilization target, and memory utilization target — the module creates target-tracking policies for CPU and memory so each service scales against its own load profile

Given those inputs, a single module call provisioned the full networking and compute stack for that service:

flowchart TD
    INPUTS["Module inputs<br/>service · image · subdomain<br/>auth/CDN flags · scaling targets"]
    EDGE["DNS + optional CDN"]
    ROUTING["ALB rule + optional Cognito gate"]
    SERVICE["Fargate service + target group"]
    OPS["Logs · security group · autoscaling"]

    INPUTS --> EDGE
    EDGE --> ROUTING
    ROUTING --> SERVICE
    SERVICE --> OPS

One module call could give me a subdomain, routing, optional access protection, a running service, scaling policies, and the operational plumbing around it.

Nothing in that stack was hardcoded to a specific environment. The subdomain changed, the image tag changed, and the Cognito and CDN flags flipped per environment and service type — everything else was derived from those inputs.

The same deployment pattern was reused across services and environments. Dev and production differed mostly in their variable inputs — smaller instance sizes in dev, different image tags, different domain names. The shape of the infrastructure stayed consistent. This mattered because many infrastructure issues found in dev applied to production too, and fixes to the module applied everywhere.

Preview deployments were a dev-only extension rather than a separate permanent environment. When we needed them, extra Terraform layered a temporary web deployment on top of dev using a tag derived from the PR, created a protected subdomain, and cleaned it up when the PR was merged or the feature no longer needed preview URLs.

Once Terraform Cloud was connected and the initial workspace variables were populated, day-to-day changes became a short Terraform flow and new environment bring-up was mostly terraform apply against the right workspace. I still had a small number of one-time manual steps, but the goal was to keep that list short and obvious, then automate or simplify more of it later.

Delivery pressure and the decisions it shaped

I wanted to be direct about the shortcuts I made and why I made them. Not as a disclaimer, but because understanding the tradeoffs was more useful than pretending they were not there.

Secrets as environment variables. Rather than pulling secrets from AWS Secrets Manager at container startup, secrets were passed as Terraform variables and injected as ECS task environment variables. This was a common shortcut in container deployments and also one of the more consequential ones. Secrets ended up in Terraform state, were visible through the ECS console to anyone with the right access, and tied secret rotation to deployments. It worked, but it created coupling that should not have been there.

Long-lived IAM access keys. OIDC-based role assumption for GitHub Actions was strictly better: no long-lived credentials, per-workflow role assumption, shorter blast radius on a compromised workflow. Long-lived keys were faster to set up, so that was what shipped first.

Single-AZ message broker. Amazon MQ was deployed in a single availability zone on a single instance. Multi-AZ active/standby would provide automatic failover. For the workload and traffic pattern at the time, single-AZ was a reasonable cost/risk trade, but it was still a single point of failure the team would need to account for during an AZ incident.

Terraform state tracking. A local Terraform state file was checked into version control during early development. Remote state in Terraform Cloud was the intended target and was established early, but the transition period is worth naming. State files contain resource identifiers and sometimes sensitive values that should not be in git history.

Monitoring sufficient, not purpose-built. The platform used CloudWatch for logs and basic metrics. It was enough to detect and diagnose incidents. It was not purpose-built for cross-service correlation or efficient dashboarding. That informed later changes.

None of these were mysterious failures. They were the predictable result of shipping under pressure. The value in naming them was that you could fix them in order instead of waiting for an incident to force the issue.

What this enabled

By the time the platform was in regular use, the operational situation was:

  • any engineer could follow a short documented Terraform flow instead of clicking through the AWS console
  • preview deployments could be layered onto dev when needed and removed cleanly afterwards
  • deployed versions of the apps gave non-technical teammates a way to validate workflows and design decisions early, while we started scalability testing before the product was fully polished
  • the team could share staging URLs with external parties without any risk of public exposure
  • adding a new service to the platform was usually a module call and a set of environment variables — no bespoke infrastructure path each time
  • the same Terraform model could be reused to make another region ready for campaign-driven traffic tests
  • the deployment pipeline built Docker images once and promoted the same artifact through dev, staging, and production rather than rebuilding per environment

The regional expansion story was the clearest validation of the model. When another region needed to be made ready, it was an infrastructure exercise rather than a redesign project.

Next planned hardening steps

Once the platform was in regular use, the next hardening steps were clear. These items did not all need to be handled by me personally; the important part was that the risks were named, sequenced, and ready for whoever owned the next phase.

Secrets management. The plan was to move secrets into AWS Secrets Manager, grant the ECS task execution role read access to the relevant secret ARNs, and update the task definition to reference secret ARNs rather than receiving plaintext values as Terraform-managed environment variables.

Terraform Cloud could protect sensitive workspace variables from casual exposure in the UI, but values can still land in plan or state depending on how they are used. The safer target was to keep secret values in Secrets Manager, tightly control state access, and reference secrets from ECS task definitions. That would remove plaintext secrets from Terraform-managed ECS environment variables and centralize rotation, though ECS tasks consuming secrets as environment variables would still need to be restarted or redeployed to pick up rotated values.

Observability layer. The planned direction was to keep moving from CloudWatch-native tooling toward Grafana for log aggregation and dashboards. CloudWatch was fine when everything was in AWS, but a vendor-agnostic observability layer was easier to carry across providers, easier to query, and easier to share with people who did not have AWS console access.

Cost visibility in the CI/CD pipeline. Infracost already helped with manual cost estimates by running against a plan output to get a before-and-after cost delta before merging infrastructure changes. The next obvious step was integrating it into GitHub Actions so every PR touching infrastructure carried a cost estimate in the workflow.

Network isolation. Another planned hardening step was moving containers out of public subnets with assign_public_ip = true and into private subnets with a NAT gateway for outbound traffic plus VPC endpoints for AWS service access. That would remove the direct internet reachability of container network interfaces and reduce the blast radius of a compromised container.

IAM least privilege. Tightening broad early permissions into specific ARN patterns was incremental work that paired well with AWS Access Analyzer, which could generate policy recommendations based on observed API calls. The plan was to run the pipeline, capture the findings, and tighten in stages rather than trying to enumerate every permission upfront.

Formal promotion and rollback flows. Promoting from staging to production was still a manual image retag and a Terraform apply, and rolling back was the same in reverse. Formalizing both as workflow_dispatch GitHub Actions jobs with explicit image tag and environment inputs would be a low-risk way to remove manual deployment pressure and make the process more auditable.

By that point, the platform was functional, repeatable, and operable by engineers who did not build it. The next steps were specific, bounded improvements rather than a rebuild. That was the goal from the start.