From Custom Resource to Certificate: Engineering a Kubernetes Operator for AWS ACM
Why Operators Matter
Building a Kubernetes operator is more than automation. It’s codifying domain knowledge.
Every organization has operational practices that live in runbooks, Slack threads, and the heads of senior engineers. “When you create an ACM certificate, remember to also create the Route53 validation record. Wait for it to propagate. Check the status. If it fails, check the DNS. If it’s still pending after an hour, something’s wrong.” Although this example is very trivial one and has mostly to do with AWS this kind of knowledge accumulates over years. It leaves when people leave and without formal documentation it is rarely written down and validated.
An operator captures this knowledge in code. The state machine encodes the expected lifecycle. The retry logic encodes the timing expectations. The validation webhook encodes the constraints learned from past mistakes. The metrics encode what’s worth monitoring.
The process of building an operator forces you to articulate what you actually do. You’ll discover gaps: “What happens if the DNS record already exists?” “How long should we wait before declaring failure?” “What if someone deletes the certificate while it’s attached to a load balancer?” These questions have answers scattered across your team. Writing an operator consolidates them. It reveals weak spots or missing steps with gruesome efficiency. It makes you experience pain but in controlled setting. Pain that would be much worse in actual emergency or in situation where the last engineer who knows how system operates leaves the company.
The result is operational excellence made concrete. New team members read the state machine and understand the lifecycle. On-call engineers look at the metrics and know what’s healthy. The validation webhook prevents the mistakes that used to cause incidents.
If you’re already running Kubernetes, operators extend the platform in a native way. Users get resources that behave like any other Kubernetes object: declarative, reconciled, observable. The operational complexity is absorbed by the operator, not pushed to users. You will be doing actual platform engineering instead of just keeping fancy runtime operating. You will make the lives of your development teams much easier and empower them.
This post walks through the patterns we found necessary to make that work in production.
Part 1: The Problem and Vision
The Pain of Manual Certificate Management
AWS ACM certificates require DNS validation. You request a certificate, ACM gives you a CNAME record to create, and once it detects that record, it issues the certificate. Simple in theory.
In practice, development teams need to:
- Request the certificate in ACM (console, CLI, or Terraform)
- Copy the validation CNAME details
- Create the record in Route53 (or wait for someone with access to do it)
- Wait for validation to complete
- Grab the certificate ARN and wire it into their load balancer configuration
This workflow requires understanding of both AWS IAM permissions and DNS. Usually also organization, who’s responsible of each piece of the underlying infrastructure, terraform modules used and so on. Most teams have neither the access nor the desire to deal with it. They want HTTPS on their service. Everything else is friction.
Terraform modules exist for this, but they shift the problem rather than solve it. Teams still need to write HCL, understand the module inputs, run terraform apply, and manage state. For organizations running Kubernetes, this means developers context-switch constantly between kubectl and terraform and more importantly between their code and platform where is should be running.
What We’re Building
We want an all-in-Kubernetes environment. Development teams deploy manifests. That’s it. No AWS console, no Terraform runs, no DNS tickets.
The pattern: custom resources backed by operators that handle cloud provider interactions. Developers declare intent in YAML, operators reconcile that intent against AWS. We do not want to expose developers “how to do it” instead we want developers to think “I want this” and operator does the heavy lifting with domain knowledge codified into it’s logic.
ACM certificate management is our first implementation of this model. A developer creates a Cert resource:
|
|
The operator:
- Requests an ACM certificate for
my-service-prod.k8s.example.com - Creates the Route53 validation record
- Waits for ACM to issue the certificate
- Exposes the certificate ARN in the resource status
The developer runs kubectl get cert my-service and gets the ARN when ready. No AWS knowledge required.
|
|
Who This Is For
This post is for platform engineers building internal platforms on Kubernetes. If you’re considering building operators that wrap cloud provider APIs, this covers the patterns we found necessary for production use:
- State machines for async workflows
- Rate limiting to stay within API quotas
- Caching to reduce API calls
- Metrics with bounded cardinality
- Validation webhooks to catch errors early
We’ll walk through the design decisions, show the code, and explain what worked and what we’d reconsider.
Part 2: API Design – The Cert Custom Resource
Designing for Developer Experience
The API surface determines adoption. If developers need to understand ACM internals to use the resource, we’ve failed. We want to hide complexity, we do not want to push it to developers.
The CertSpec has two required fields:
|
|
From these, the operator generates the FQDN: {serviceName}-{environment}.{zoneName}. A service called api in prod becomes api-prod.k8s.example.com. No DNS knowledge required.
For cases where auto-generation doesn’t fit, there’s an optional override:
|
|
The DeleteOnRemoval default is false deliberately. Certificates attached to load balancers shouldn’t disappear because someone ran kubectl delete and removing certificate in use causes API error on AWS. The operator also checks if a certificate is in use before deletion—if it’s attached to an ALB or CloudFront distribution, the delete is blocked.
The status exposes everything a developer needs:
|
|
Custom printer columns make kubectl get cert useful:
|
|
Multi-Zone DNS Support
Organizations don’t always have just one DNS zone. They have prod.example.com, staging.example.com, internal.example.com. Running a separate operator instance per zone doesn’t scale.
The operator supports multiple zones via a zone registry. Configuration is a comma-separated list:
|
|
Zone resolution follows a priority order:
- Explicit in spec: If
spec.dnsZone.idorspec.dnsZone.nameis set, use that zone - Auto-detect from FQDN: Longest suffix match against registered zones
- Default: First zone in the configuration
The auto-detection handles most cases. Given FQDN api.staging.example.com and zones example.com and staging.example.com, the algorithm picks staging.example.com over example.com. More specific wins.
|
|
One constraint: all Subject Alternative Names must resolve to the same zone as the primary domain. DNS validation records for all domains are created in one hosted zone. Cross-zone SANs would require validation records in multiple zones, which adds complexity we chose to defer. The webhook rejects these requests with a clear error.
The Validation Webhook
Validation webhooks catch errors before they reach AWS API. Without one, users create a Cert, wait for reconciliation, and discover the error in the status message minutes later. With a webhook, kubectl apply fails immediately with an actionable message. We can reduce errors and any errors we do encounter have higher propability to be actual errors or problems needing platform team investigation.
The webhook validates:
Field format: serviceName and environment must be RFC 1123 labels (lowercase alphanumeric, hyphens allowed, 63 chars max). This ensures the generated FQDN is valid DNS.
|
|
Domain validity: If domainName is specified, it must be a valid DNS name. Same for each SAN.
SAN count: ACM allows up to 100 SANs. The webhook enforces this limit.
Zone consistency: All domains (primary + SANs) must resolve to the same zone.
Duplicate detection: Two Cert resources requesting the same FQDN would create a conflict in ACM. The webhook queries an index on status.domainName to detect duplicates before admission.
|
|
The webhook runs in fail-closed mode by default. If the webhook itself errors (can’t reach API server, internal bug), the admission is rejected. This prioritizes safety over availability. For clusters where availability matters more, --webhook-fail-closed=false switches to warn-and-allow.
Part 3: The State Machine – Lifecycle Management
Why a State Machine?
ACM certificate provisioning is asynchronous. You request a certificate, create DNS records, and wait. The wait can be seconds or minutes depending on DNS propagation and ACM’s internal processing.
Without explicit state tracking, debugging becomes guesswork. “Why isn’t my certificate ready?” requires digging through logs to figure out where in the process things are stuck. With a state machine, kubectl get cert shows exactly where each certificate is in its lifecycle.
The state machine also enables differentiated retry behavior. A certificate waiting for DNS propagation needs different requeue timing than one waiting for initial ACM response.
The Five States
|
|
| State | Description | Requeue Interval |
|---|---|---|
| Pending | Initial state. No AWS resources created yet. | 30s with backoff |
| Created | ACM certificate requested. DNS validation record needs creation. | 1m with backoff |
| Validated | DNS validation succeeded. Waiting for ACM to issue. | 5m fixed |
| Ready | Certificate issued. ARN available in status. | 1h (expiry monitoring) |
| Failed | Permanent failure. Validation timeout, revocation, or unrecoverable error. | 5m (allows retry after manual fix) |
The reconciler checks the current state and executes the appropriate action:
|
|
Exponential Backoff with Jitter
Fixed requeue intervals cause thundering herd problems. If 100 certificates are created simultaneously, they all hit Pending at t=0, all request ACM certificates at t=30s, all check status at t=90s. The AWS API gets hammered in synchronized bursts.
Exponential backoff with jitter spreads the load:
|
|
For Pending state (base=30s, max=5m):
|
|
The jitter ensures that even certificates created at the exact same moment will drift apart in their retry timing.
Attempt counts are tracked via annotation and reset on state transition:
|
|
The Three Return Types
Controller-runtime reconcilers return (ctrl.Result, error). How you use these determines retry behavior:
1. Expected wait (transient state)
|
|
Use when waiting for an external process: DNS propagation, ACM issuance. This is not an error—it’s expected behavior. The controller requeues after the specified duration.
2. Unexpected error
|
|
Use for actual failures: AWS API errors, permission issues, network problems. Controller-runtime applies its own exponential backoff. The error is logged and the request is requeued.
3. Permanent failure
|
|
Use when recovery requires manual intervention: certificate revoked, validation expired. Set state to Failed, update the message, and return with a requeue. The requeue allows automatic recovery if someone fixes the underlying issue.
The distinction matters. Returning an error for expected waits pollutes logs and triggers unnecessary backoff. Returning a requeue for actual errors hides problems.
Part 4: AWS API Layer – The Decorator Pattern
The Challenge
AWS APIs have rate limits. ACM allows roughly 10 requests per second. Route53 allows 5 requests per second. Exceed these and you get throttled—requests fail with ThrottlingException.
A naive operator hammers AWS on every reconciliation. With 100 certificates reconciling every minute, you’re making hundreds of API calls. Add concurrent reconciliation and you’ll hit limits quickly.
We need:
- Rate limiting to stay within quotas
- Caching to avoid redundant lookups
- Retry logic for transient throttling
- Testability for unit tests
Putting all of this in one client creates a mess. The decorator pattern keeps concerns separated.
The Decorator Stack
Each layer wraps the one below, adding specific behavior:
|
|
The interface is simple:
|
|
Composition happens at startup:
|
|
For tests, inject a mock directly:
|
|
Token Bucket Rate Limiting
Token bucket is straightforward: a bucket holds tokens, each request consumes one, tokens refill at a fixed rate. If the bucket is empty, requests wait. This is very much the same logic that applies to multiple AWS services having burst capability (EBS disks, T-class EC2 VMs, etc).
|
|
We ratelimit per hosted zone. A single limiter for all Route53 calls is wrong—it would throttle zone A because of traffic to zone B. Per-zone limiters are created lazily:
|
|
This solution is a little bit overengineered. It’s extremely rare to have more than 1-2 concurrent certification requests in-flight at the same time. Default configuration uses 50% of AWS limits as a safety margin:
|
|
Cache Design
The cache reduces API calls for certificate lookups. Without caching, every reconciliation calls DescribeCertificate and we want to avoid hitting AWS API’s if possible. With caching, repeated lookups hit memory.
Structure:
|
|
Dual indexing matters. Sometimes we have the FQDN (initial lookup), sometimes we have the ARN (from status). Both should hit cache:
|
|
LRU eviction keeps memory bounded. When maxSize is reached, the least recently accessed entry is removed.
Throttling Detection and Retry
Even with rate limiting, throttling can occur—other processes may share the API quota, or AWS may have internal limits we’re not aware of.
Detection checks for known error codes:
|
|
Retry uses exponential backoff:
|
|
Part 5: Observability – Metrics Done Right
The Cardinality Challenge
The obvious metric for certificate expiry:
|
|
Works fine with 50 certificates. With 5,000, Prometheus struggles. Each unique label combination creates a time series. High cardinality means high memory usage, slow queries, and expensive storage.
The tension: per-certificate metrics are actionable (you know exactly which cert is expiring), but unbounded cardinality is unsustainable. Having high cardinality is very common problem with prometheus and we try to alleviate it before it even becomes a problem.
Dual-Layer Metrics Strategy
We use two layers:
Layer 1: Aggregate buckets (fixed cardinality)
All certificates grouped by expiry window:
|
|
Six time series regardless of certificate count. Useful for dashboards showing overall health.
Layer 2: Per-certificate (bounded cardinality)
Individual tracking only for certificates that need attention:
|
|
A certificate is tracked individually only if:
- It expires within 90 days (configurable)
- Total tracked count is under 1,000 (configurable)
Overflow is counted:
|
|
A non-zero overflow counter signals that the threshold or limit needs adjustment.
|
|
The Full Metrics Suite
Beyond expiry, we track:
Certificate state distribution:
|
|
Reconciliation performance:
|
|
AWS API health:
|
|
Rate limiter state:
|
|
Cache effectiveness:
|
|
Pre-Built Alerting Rules
The operator ships with PrometheusRule definitions:
|
|
We use Grafana for visualization.
Part 6: Production Hardening
Security Considerations
Error sanitization: AWS error messages often contain account IDs and ARNs. These shouldn’t appear in Kubernetes status fields or logs that users can access.
|
|
RBAC minimization: The operator’s ServiceAccount only gets access to Cert CRDs, Events (create), and Leases (leader election). No access to Secrets, ConfigMaps, or other resources.
No credentials in code: AWS credentials come from IRSA (IAM Roles for Service Accounts) on EKS, or environment variables. Never hardcoded, never in ConfigMaps.
Webhook TLS: The validation webhook requires TLS. We recommend cert-manager for automatic certificate management. Self-signed works but requires manual rotation.
High Availability
For production, run multiple replicas with leader election:
|
|
Only the leader reconciles. If the leader dies, another replica takes over. Leader election uses the standard Kubernetes Lease mechanism.
PodDisruptionBudget prevents simultaneous termination during node maintenance:
|
|
Certificate Protection
Multiple layers prevent accidental certificate deletion:
1. DeleteOnRemoval defaults to false
Deleting the Cert CR doesn’t delete the ACM certificate by default. Users must explicitly opt in:
|
|
2. In-use detection
Even with deleteOnRemoval: true, the operator checks if the certificate is attached to an ALB, NLB, or CloudFront distribution:
|
|
If in use, deletion is blocked with a clear message.
3. Finalizer
The operator adds a finalizer to every Cert:
|
|
This ensures the controller gets a chance to run cleanup logic before the resource is removed from etcd.
Configuration Knobs
The operator exposes configuration for tuning:
| Flag | Default | Description |
|---|---|---|
--max-concurrent-reconciles |
3 | Parallel reconciliation limit |
--acm-rate-limit |
5.0 | ACM requests/second |
--route53-rate-limit |
3.0 | Route53 requests/second/zone |
--cache-ttl |
5m | Certificate cache TTL |
--cache-max-size |
1000 | Maximum cached certificates |
--aws-default-timeout |
30s | AWS API call timeout |
--metrics-expiry-threshold |
90 | Days threshold for per-cert metrics |
Recommendations by scale:
| Deployment Size | Concurrent Reconciles | Rate Limits |
|---|---|---|
| Small (<100 certs) | 3 | Default |
| Medium (100-500) | 5 | Default |
| Large (>500) | 5-10 | Default, monitor throttling |
Part 7: Lessons Learned and Patterns
What Worked Well
State machine over ad-hoc logic: Every debugging session starts with “what state is it in?” Having explicit states made troubleshooting straightforward. Users report issues as “stuck in Created state” rather than “it’s not working.”
Decorator pattern for AWS client: When we needed to add caching, it was one new wrapper. No changes to existing code. When we needed per-zone rate limiting, same pattern. Each feature is isolated and testable.
Per-zone rate limiting: Early versions had a single Route53 rate limiter. Multi-zone deployments hit throttling because traffic to zone A consumed quota needed for zone B. Per-zone limiters solved this without changing the rate limiting logic.
Cardinality-aware metrics from day one: We’ve seen operators with unbounded label causing huge memory usage for Prometheus in production. Building the dual-layer strategy upfront avoided this.
Validation webhook: Catching errors at admission rather than during reconciliation improved user experience significantly. “Your request was rejected” is better than “check the status in 5 minutes to see what went wrong.”
What We’d Do Differently
Consider controller-runtime’s built-in rate limiting earlier: We implemented custom backoff before fully understanding what controller-runtime provides. There’s overlap. The custom backoff is still useful for per-state tuning, but we could have started simpler.
Circuit breaker pattern: Currently, if AWS is having issues, we keep retrying with backoff. A circuit breaker would fail fast after repeated failures and probe periodically for recovery. This would reduce load during AWS outages.
More structured logging from the start: We added structured logging keys later. Starting with a consistent logging schema (internal/logging/keys.go) would have made log aggregation cleaner from the beginning.
Certificate ARN exposed to user: We still have leaks in our abstraction. At the moment development team needs the ARN for the certification in order to use it with ALB. This is more issue in our roadmap since we do not yet have abstraction (operator) for deploying rest of the infrastructure. Problem will be alleviated in the future when the whole lifecycle for application/service is handled by custom operators.
ADRs as Living Documentation
We maintain Architecture Decision Records for significant decisions:
- ADR-001: Decorator pattern for AWS client
- ADR-002: Certificate state machine
- ADR-003: Metrics cardinality management
- ADR-004: Multi-zone DNS support
Format:
|
|
ADRs prevent re-litigating settled decisions. When someone asks “why don’t we just use a single rate limiter?”, point them to ADR-001. When onboarding new team members, ADRs explain why the code is structured the way it is.
Part 8: Conclusion and Next Steps
Summary
Building a production-grade Kubernetes operator for AWS integration requires more than a reconciliation loop. The patterns that made this operator production-ready:
- State machine: Explicit lifecycle states with differentiated retry behavior
- Decorator pattern: Composable AWS client with rate limiting, caching, and retry logic
- Bounded metrics: Dual-layer strategy that scales without unbounded cardinality
- Validation webhook: Fail-fast error detection before resources reach the controller
- Safety defaults: Protection against accidental certificate deletion
These patterns apply beyond certificate management. Any operator wrapping cloud provider APIs will face similar challenges: rate limits, async workflows, observability at scale.
Deployment
We use Helm to install the operator:
|
|
Future Work
Multi-zone SANs: Currently all SANs must be in the same zone. Supporting SANs across zones requires creating validation records in multiple hosted zones. Deferred due to complexity, but on the roadmap.
Automatic renewal integration: ACM certificates auto-renew, but the operator could proactively trigger renewal or alert when renewal fails.
AWS Private CA support: For internal certificates that don’t need public DNS validation.
Certificate rotation for in-cluster workloads: Currently the operator exposes the ARN for use in AWS resources (ALB, CloudFront). Exporting the certificate for use inside the cluster (e.g., mounting in pods) is a different use case that warrants a separate controller (ACM doesn’t allow exporting certificates private key so completely new logic is needed).
Appendix
A. Key Code Snippets
CertSpec and CertStatus (api/v1/cert_types.go)
|
|
Jitter function (internal/awsapi/ratelimit.go)
|
|
Throttling detection (internal/awsapi/ratelimit.go)
|
|
B. Architecture Diagrams
Component Architecture
|
|
State Machine
|
|
C. Operator Framework Choice
We used Kubebuilder (controller-runtime) rather than Operator SDK. Both are valid choices; the differences are mainly in scaffolding and CLI tooling.
Key controller-runtime features used:
- Reconciler interface: Standard
Reconcile(ctx, Request) (Result, error)pattern - Client: Cached Kubernetes client with Get, List, Create, Update, Delete
- Event recorder: For creating Kubernetes events
- Leader election: Built-in via manager options
- Metrics: Automatic controller metrics, plus custom metrics registration
- Webhook support: Admission webhook scaffolding
The learning curve is moderate. Understanding the reconciliation loop, watch predicates, and status subresource updates takes time. The controller-runtime book (https://book.kubebuilder.io/) is the canonical resource.
For AWS SDK, we use AWS SDK for Go v2. Key packages:
github.com/aws/aws-sdk-go-v2/service/acmgithub.com/aws/aws-sdk-go-v2/service/route53github.com/aws/aws-sdk-go-v2/config(for credential loading)