Kubernetes operators and custom resources as platform engineering building blocks

From Custom Resource to Certificate: Engineering a Kubernetes Operator for AWS ACM

Why Operators Matter

Building a Kubernetes operator is more than automation. It’s codifying domain knowledge.

Every organization has operational practices that live in runbooks, Slack threads, and the heads of senior engineers. “When you create an ACM certificate, remember to also create the Route53 validation record. Wait for it to propagate. Check the status. If it fails, check the DNS. If it’s still pending after an hour, something’s wrong.” Although this example is very trivial one and has mostly to do with AWS this kind of knowledge accumulates over years. It leaves when people leave and without formal documentation it is rarely written down and validated.

An operator captures this knowledge in code. The state machine encodes the expected lifecycle. The retry logic encodes the timing expectations. The validation webhook encodes the constraints learned from past mistakes. The metrics encode what’s worth monitoring.

The process of building an operator forces you to articulate what you actually do. You’ll discover gaps: “What happens if the DNS record already exists?” “How long should we wait before declaring failure?” “What if someone deletes the certificate while it’s attached to a load balancer?” These questions have answers scattered across your team. Writing an operator consolidates them. It reveals weak spots or missing steps with gruesome efficiency. It makes you experience pain but in controlled setting. Pain that would be much worse in actual emergency or in situation where the last engineer who knows how system operates leaves the company.

The result is operational excellence made concrete. New team members read the state machine and understand the lifecycle. On-call engineers look at the metrics and know what’s healthy. The validation webhook prevents the mistakes that used to cause incidents.

If you’re already running Kubernetes, operators extend the platform in a native way. Users get resources that behave like any other Kubernetes object: declarative, reconciled, observable. The operational complexity is absorbed by the operator, not pushed to users. You will be doing actual platform engineering instead of just keeping fancy runtime operating. You will make the lives of your development teams much easier and empower them.

This post walks through the patterns we found necessary to make that work in production.


Part 1: The Problem and Vision

The Pain of Manual Certificate Management

AWS ACM certificates require DNS validation. You request a certificate, ACM gives you a CNAME record to create, and once it detects that record, it issues the certificate. Simple in theory.

In practice, development teams need to:

  1. Request the certificate in ACM (console, CLI, or Terraform)
  2. Copy the validation CNAME details
  3. Create the record in Route53 (or wait for someone with access to do it)
  4. Wait for validation to complete
  5. Grab the certificate ARN and wire it into their load balancer configuration

This workflow requires understanding of both AWS IAM permissions and DNS. Usually also organization, who’s responsible of each piece of the underlying infrastructure, terraform modules used and so on. Most teams have neither the access nor the desire to deal with it. They want HTTPS on their service. Everything else is friction.

Terraform modules exist for this, but they shift the problem rather than solve it. Teams still need to write HCL, understand the module inputs, run terraform apply, and manage state. For organizations running Kubernetes, this means developers context-switch constantly between kubectl and terraform and more importantly between their code and platform where is should be running.

What We’re Building

We want an all-in-Kubernetes environment. Development teams deploy manifests. That’s it. No AWS console, no Terraform runs, no DNS tickets.

The pattern: custom resources backed by operators that handle cloud provider interactions. Developers declare intent in YAML, operators reconcile that intent against AWS. We do not want to expose developers “how to do it” instead we want developers to think “I want this” and operator does the heavy lifting with domain knowledge codified into it’s logic.

ACM certificate management is our first implementation of this model. A developer creates a Cert resource:

1
2
3
4
5
6
7
apiVersion: acm.ops.example.dev/v1
kind: Cert
metadata:
  name: my-service
spec:
  serviceName: my-service
  environment: prod

The operator:

  1. Requests an ACM certificate for my-service-prod.k8s.example.com
  2. Creates the Route53 validation record
  3. Waits for ACM to issue the certificate
  4. Exposes the certificate ARN in the resource status

The developer runs kubectl get cert my-service and gets the ARN when ready. No AWS knowledge required.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
┌─────────────────────────────────────────────────────────────────────────────┐
│                           Kubernetes Cluster                                │
│  ┌─────────────┐     ┌──────────────────────┐     ┌─────────────────────┐   │
│  │   Cert CR   │────▶│  Cert Operator       │────▶│  Status Updates     │   │
│  │  (User)     │     │  (Controller)        │     │  (Ready/Failed)     │   │
│  └─────────────┘     └──────────┬───────────┘     └─────────────────────┘   │
│                                 │                                           │
└─────────────────────────────────┼───────────────────────────────────────────┘
                    ┌─────────────┼─────────────┐
                    │             │             │
                    ▼             ▼             ▼
              ┌──────────┐ ┌──────────┐ ┌──────────────┐
              │ AWS ACM  │ │ Route53  │ │ Prometheus   │
              │          │ │          │ │ Metrics      │
              └──────────┘ └──────────┘ └──────────────┘

Who This Is For

This post is for platform engineers building internal platforms on Kubernetes. If you’re considering building operators that wrap cloud provider APIs, this covers the patterns we found necessary for production use:

  • State machines for async workflows
  • Rate limiting to stay within API quotas
  • Caching to reduce API calls
  • Metrics with bounded cardinality
  • Validation webhooks to catch errors early

We’ll walk through the design decisions, show the code, and explain what worked and what we’d reconsider.


Part 2: API Design – The Cert Custom Resource

Designing for Developer Experience

The API surface determines adoption. If developers need to understand ACM internals to use the resource, we’ve failed. We want to hide complexity, we do not want to push it to developers.

The CertSpec has two required fields:

1
2
3
4
type CertSpec struct {
    ServiceName string `json:"serviceName"`
    Environment string `json:"environment"`
}

From these, the operator generates the FQDN: {serviceName}-{environment}.{zoneName}. A service called api in prod becomes api-prod.k8s.example.com. No DNS knowledge required.

For cases where auto-generation doesn’t fit, there’s an optional override:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
type CertSpec struct {
    ServiceName string `json:"serviceName"`
    Environment string `json:"environment"`

    // Optional: override auto-generated FQDN
    DomainName string `json:"domainName,omitempty"`

    // Optional: additional domains on the same certificate
    SubjectAlternativeNames []string `json:"subjectAlternativeNames,omitempty"`

    // Optional: delete ACM cert when CR is deleted (default: false)
    DeleteOnRemoval bool `json:"deleteOnRemoval,omitempty"`
}

The DeleteOnRemoval default is false deliberately. Certificates attached to load balancers shouldn’t disappear because someone ran kubectl delete and removing certificate in use causes API error on AWS. The operator also checks if a certificate is in use before deletion—if it’s attached to an ALB or CloudFront distribution, the delete is blocked.

The status exposes everything a developer needs:

1
2
3
4
5
6
status:
  state: Ready
  certificateArn: arn:aws:acm:eu-west-1:123456789012:certificate/abc-123
  domainName: api-prod.example.com
  expirationDate: "2025-12-01T00:00:00Z"
  certReady: true

Custom printer columns make kubectl get cert useful:

1
2
NAME       STATE   DOMAIN                        READY   EXPIRES
api-prod   Ready   api-prodexample.com           true    2025-12-01

Multi-Zone DNS Support

Organizations don’t always have just one DNS zone. They have prod.example.com, staging.example.com, internal.example.com. Running a separate operator instance per zone doesn’t scale.

The operator supports multiple zones via a zone registry. Configuration is a comma-separated list:

1
--dns-zones="prod.example.com:Z123,staging.example.com:Z456,internal.example.com:Z789"

Zone resolution follows a priority order:

  1. Explicit in spec: If spec.dnsZone.id or spec.dnsZone.name is set, use that zone
  2. Auto-detect from FQDN: Longest suffix match against registered zones
  3. Default: First zone in the configuration

The auto-detection handles most cases. Given FQDN api.staging.example.com and zones example.com and staging.example.com, the algorithm picks staging.example.com over example.com. More specific wins.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func (zr *ZoneRegistry) GetZoneForDomain(fqdn string) (Zone, bool) {
    var bestMatch Zone
    var bestLen int

    for _, zone := range zr.zones {
        if strings.HasSuffix(fqdn, zone.Name) && len(zone.Name) > bestLen {
            bestMatch = zone
            bestLen = len(zone.Name)
        }
    }
    return bestMatch, bestLen > 0
}

One constraint: all Subject Alternative Names must resolve to the same zone as the primary domain. DNS validation records for all domains are created in one hosted zone. Cross-zone SANs would require validation records in multiple zones, which adds complexity we chose to defer. The webhook rejects these requests with a clear error.

The Validation Webhook

Validation webhooks catch errors before they reach AWS API. Without one, users create a Cert, wait for reconciliation, and discover the error in the status message minutes later. With a webhook, kubectl apply fails immediately with an actionable message. We can reduce errors and any errors we do encounter have higher propability to be actual errors or problems needing platform team investigation.

The webhook validates:

Field format: serviceName and environment must be RFC 1123 labels (lowercase alphanumeric, hyphens allowed, 63 chars max). This ensures the generated FQDN is valid DNS.

1
var dns1123LabelRegex = regexp.MustCompile(`^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$`)

Domain validity: If domainName is specified, it must be a valid DNS name. Same for each SAN.

SAN count: ACM allows up to 100 SANs. The webhook enforces this limit.

Zone consistency: All domains (primary + SANs) must resolve to the same zone.

Duplicate detection: Two Cert resources requesting the same FQDN would create a conflict in ACM. The webhook queries an index on status.domainName to detect duplicates before admission.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func (r *CertCustomValidator) checkDuplicateFQDN(ctx context.Context, cert *Cert) error {
    var certList CertList
    if err := r.Client.List(ctx, &certList,
        client.MatchingFields{"status.domainName": fqdn}); err != nil {
        return err
    }

    for _, existing := range certList.Items {
        if existing.UID != cert.UID {
            return fmt.Errorf("certificate for %s already exists: %s/%s",
                fqdn, existing.Namespace, existing.Name)
        }
    }
    return nil
}

The webhook runs in fail-closed mode by default. If the webhook itself errors (can’t reach API server, internal bug), the admission is rejected. This prioritizes safety over availability. For clusters where availability matters more, --webhook-fail-closed=false switches to warn-and-allow.


Part 3: The State Machine – Lifecycle Management

Why a State Machine?

ACM certificate provisioning is asynchronous. You request a certificate, create DNS records, and wait. The wait can be seconds or minutes depending on DNS propagation and ACM’s internal processing.

Without explicit state tracking, debugging becomes guesswork. “Why isn’t my certificate ready?” requires digging through logs to figure out where in the process things are stuck. With a state machine, kubectl get cert shows exactly where each certificate is in its lifecycle.

The state machine also enables differentiated retry behavior. A certificate waiting for DNS propagation needs different requeue timing than one waiting for initial ACM response.

The Five States

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
┌─────────┐    ┌─────────┐    ┌───────────┐    ┌───────┐
│ Pending │───▶│ Created │───▶│ Validated │───▶│ Ready │
└─────────┘    └─────────┘    └───────────┘    └───────┘
     │              │               │
     └──────────────┴───────────────┘
               ┌────────┐
               │ Failed │
               └────────┘
State Description Requeue Interval
Pending Initial state. No AWS resources created yet. 30s with backoff
Created ACM certificate requested. DNS validation record needs creation. 1m with backoff
Validated DNS validation succeeded. Waiting for ACM to issue. 5m fixed
Ready Certificate issued. ARN available in status. 1h (expiry monitoring)
Failed Permanent failure. Validation timeout, revocation, or unrecoverable error. 5m (allows retry after manual fix)

The reconciler checks the current state and executes the appropriate action:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
func (r *CertReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var cert acmv1.Cert
    if err := r.Get(ctx, req.NamespacedName, &cert); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    switch cert.Status.State {
    case "", acmv1.CertStatePending:
        return r.handlePending(ctx, &cert)
    case acmv1.CertStateCreated:
        return r.handleCreated(ctx, &cert)
    case acmv1.CertStateValidated:
        return r.handleValidated(ctx, &cert)
    case acmv1.CertStateReady:
        return r.handleReady(ctx, &cert)
    case acmv1.CertStateFailed:
        return r.handleFailed(ctx, &cert)
    }
    return ctrl.Result{}, nil
}

Exponential Backoff with Jitter

Fixed requeue intervals cause thundering herd problems. If 100 certificates are created simultaneously, they all hit Pending at t=0, all request ACM certificates at t=30s, all check status at t=90s. The AWS API gets hammered in synchronized bursts.

Exponential backoff with jitter spreads the load:

1
interval = min(maxInterval, baseInterval × 2^attempt) ± 10% jitter

For Pending state (base=30s, max=5m):

1
2
3
4
5
Attempt 0: 30s  ± 3s  → [27s, 33s]
Attempt 1: 60s  ± 6s  → [54s, 66s]
Attempt 2: 120s ± 12s → [108s, 132s]
Attempt 3: 240s ± 24s → [216s, 264s]
Attempt 4: 300s ± 30s → [270s, 330s] (capped)

The jitter ensures that even certificates created at the exact same moment will drift apart in their retry timing.

Attempt counts are tracked via annotation and reset on state transition:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
const (
    AnnotationReconcileAttempts = "acm.ops.example.dev/reconcile-attempts-in-state"
    AnnotationLastState         = "acm.ops.example.dev/last-reconcile-state"
)

func (r *CertReconciler) getBackoffInterval(cert *acmv1.Cert, base, max time.Duration) time.Duration {
    attempts := getAttemptCount(cert)

    interval := base * time.Duration(1<<attempts) // 2^attempts
    if interval > max {
        interval = max
    }

    // Add ±10% jitter
    jitter := time.Duration(float64(interval) * (rand.Float64()*0.2 - 0.1))
    return interval + jitter
}

The Three Return Types

Controller-runtime reconcilers return (ctrl.Result, error). How you use these determines retry behavior:

1. Expected wait (transient state)

1
return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil

Use when waiting for an external process: DNS propagation, ACM issuance. This is not an error—it’s expected behavior. The controller requeues after the specified duration.

2. Unexpected error

1
return ctrl.Result{}, err

Use for actual failures: AWS API errors, permission issues, network problems. Controller-runtime applies its own exponential backoff. The error is logged and the request is requeued.

3. Permanent failure

1
2
3
4
cert.Status.State = acmv1.CertStateFailed
cert.Status.Message = "validation timed out after 72 hours"
r.Status().Update(ctx, cert)
return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil

Use when recovery requires manual intervention: certificate revoked, validation expired. Set state to Failed, update the message, and return with a requeue. The requeue allows automatic recovery if someone fixes the underlying issue.

The distinction matters. Returning an error for expected waits pollutes logs and triggers unnecessary backoff. Returning a requeue for actual errors hides problems.


Part 4: AWS API Layer – The Decorator Pattern

The Challenge

AWS APIs have rate limits. ACM allows roughly 10 requests per second. Route53 allows 5 requests per second. Exceed these and you get throttled—requests fail with ThrottlingException.

A naive operator hammers AWS on every reconciliation. With 100 certificates reconciling every minute, you’re making hundreds of API calls. Add concurrent reconciliation and you’ll hit limits quickly.

We need:

  • Rate limiting to stay within quotas
  • Caching to avoid redundant lookups
  • Retry logic for transient throttling
  • Testability for unit tests

Putting all of this in one client creates a mess. The decorator pattern keeps concerns separated.

The Decorator Stack

Each layer wraps the one below, adding specific behavior:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
┌─────────────────────────────────────────────────────────────┐
│                    CachedAWSClient                          │
│  • TTL-based caching for certificate lookups                │
│  • LRU eviction when cache is full                          │
│  • Dual-key index: FQDN and ARN                             │
└────────────────────────────┬────────────────────────────────┘
                             │ wraps
┌─────────────────────────────────────────────────────────────┐
│                 RateLimitedAWSClient                        │
│  • Token bucket rate limiting                               │
│  • Per-zone limiting for Route53                            │
│  • Exponential backoff on throttle                          │
└────────────────────────────┬────────────────────────────────┘
                             │ wraps
┌─────────────────────────────────────────────────────────────┐
│                     RealAWSClient                           │
│  • AWS SDK v2 calls                                         │
│  • Configurable timeouts                                    │
└─────────────────────────────────────────────────────────────┘

The interface is simple:

1
2
3
4
5
6
7
8
9
type AWSClient interface {
    GetCertificate(ctx context.Context, fqdn string) (Certificate, error)
    GetCertificateByArn(ctx context.Context, arn string) (Certificate, error)
    RequestCertificate(ctx context.Context, fqdn string, sans []string) (string, error)
    DeleteCertificate(ctx context.Context, arn string) error
    CertificateInUse(ctx context.Context, arn string) (bool, error)
    CreateValidationRecord(ctx context.Context, arn, zoneId string) error
    DeleteValidationRecord(ctx context.Context, arn, zoneId string) error
}

Composition happens at startup:

1
2
3
4
5
6
realClient := awsapi.NewRealAWSClient(acmClient, route53Client, cfg)
rateLimited := awsapi.NewRateLimitedAWSClient(realClient, rateLimitConfig)
cached := awsapi.NewCachedAWSClient(rateLimited, cacheConfig)

// Use cached client in reconciler
reconciler.AWSClient = cached

For tests, inject a mock directly:

1
reconciler.AWSClient = &MockAWSClient{...}

Token Bucket Rate Limiting

Token bucket is straightforward: a bucket holds tokens, each request consumes one, tokens refill at a fixed rate. If the bucket is empty, requests wait. This is very much the same logic that applies to multiple AWS services having burst capability (EBS disks, T-class EC2 VMs, etc).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
type RateLimitedAWSClient struct {
    client     AWSClient
    acmLimiter *rate.Limiter
    r53Limiter *rate.Limiter

    // Per-zone Route53 limiters
    r53ZoneMu       sync.RWMutex
    r53ZoneLimiters map[string]*rate.Limiter

    config RateLimitConfig
}

We ratelimit per hosted zone. A single limiter for all Route53 calls is wrong—it would throttle zone A because of traffic to zone B. Per-zone limiters are created lazily:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func (r *RateLimitedAWSClient) getOrCreateZoneLimiter(zoneId string) *rate.Limiter {
    if zoneId == "" {
        return r.r53Limiter // fallback
    }

    r.r53ZoneMu.RLock()
    limiter, exists := r.r53ZoneLimiters[zoneId]
    r.r53ZoneMu.RUnlock()
    if exists {
        return limiter
    }

    r.r53ZoneMu.Lock()
    defer r.r53ZoneMu.Unlock()

    // Double-check after acquiring write lock
    if limiter, exists = r.r53ZoneLimiters[zoneId]; exists {
        return limiter
    }

    limiter = rate.NewLimiter(rate.Limit(r.config.Route53RateLimit), r.config.Route53BurstSize)
    r.r53ZoneLimiters[zoneId] = limiter
    return limiter
}

This solution is a little bit overengineered. It’s extremely rare to have more than 1-2 concurrent certification requests in-flight at the same time. Default configuration uses 50% of AWS limits as a safety margin:

1
2
3
4
5
6
7
8
func DefaultRateLimitConfig() RateLimitConfig {
    return RateLimitConfig{
        ACMRateLimit:     5,    // AWS limit: 10 TPS
        ACMBurstSize:     10,
        Route53RateLimit: 3,    // AWS limit: 5 TPS per zone
        Route53BurstSize: 5,
    }
}

Cache Design

The cache reduces API calls for certificate lookups. Without caching, every reconciliation calls DescribeCertificate and we want to avoid hitting AWS API’s if possible. With caching, repeated lookups hit memory.

Structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
type CachedAWSClient struct {
    client AWSClient
    cache  *CertificateCache
}

type CertificateCache struct {
    mu       sync.RWMutex
    byFQDN   map[string]*cacheEntry  // primary index
    byARN    map[string]string       // ARN -> FQDN mapping
    ttl      time.Duration
    maxSize  int
}

type cacheEntry struct {
    cert       Certificate
    expiresAt  time.Time
    lastAccess time.Time
}

Dual indexing matters. Sometimes we have the FQDN (initial lookup), sometimes we have the ARN (from status). Both should hit cache:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func (c *CachedAWSClient) GetCertificateByArn(ctx context.Context, arn string) (Certificate, error) {
    if cert, ok := c.cache.GetByArn(arn); ok {
        metrics.RecordCacheHit("arn")
        return cert, nil
    }
    metrics.RecordCacheMiss("arn")

    cert, err := c.client.GetCertificateByArn(ctx, arn)
    if err == nil {
        c.cache.Set(cert)
    }
    return cert, err
}

LRU eviction keeps memory bounded. When maxSize is reached, the least recently accessed entry is removed.

Throttling Detection and Retry

Even with rate limiting, throttling can occur—other processes may share the API quota, or AWS may have internal limits we’re not aware of.

Detection checks for known error codes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func isThrottlingError(err error) bool {
    var apiErr smithy.APIError
    if errors.As(err, &apiErr) {
        switch apiErr.ErrorCode() {
        case "Throttling", "ThrottlingException", "RequestLimitExceeded",
             "TooManyRequestsException", "SlowDown", "PriorRequestNotComplete":
            return true
        }
    }
    return false
}

Retry uses exponential backoff:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
func (r *RateLimitedAWSClient) withRetry(ctx context.Context, operation string, fn func() error) error {
    backoff := r.config.BaseBackoffDelay // 1s

    for attempt := 0; attempt <= r.config.MaxRetries; attempt++ {
        err := fn()
        if err == nil {
            return nil
        }
        if !isThrottlingError(err) {
            return err // not throttling, don't retry
        }

        metrics.RecordThrottlingEvent(operation)

        if attempt < r.config.MaxRetries {
            jittered := addJitter(backoff, r.config.JitterFactor)
            select {
            case <-ctx.Done():
                return ctx.Err()
            case <-time.After(jittered):
            }
            backoff *= 2
            if backoff > r.config.MaxBackoffDelay {
                backoff = r.config.MaxBackoffDelay
            }
        }
    }
    return fmt.Errorf("max retries exceeded: %w", lastErr)
}

Part 5: Observability – Metrics Done Right

The Cardinality Challenge

The obvious metric for certificate expiry:

1
cert_operator_certificate_expiry_days{namespace="prod", name="api-cert"} 45

Works fine with 50 certificates. With 5,000, Prometheus struggles. Each unique label combination creates a time series. High cardinality means high memory usage, slow queries, and expensive storage.

The tension: per-certificate metrics are actionable (you know exactly which cert is expiring), but unbounded cardinality is unsustainable. Having high cardinality is very common problem with prometheus and we try to alleviate it before it even becomes a problem.

Dual-Layer Metrics Strategy

We use two layers:

Layer 1: Aggregate buckets (fixed cardinality)

All certificates grouped by expiry window:

1
2
3
4
5
6
cert_operator_certificate_expiry_buckets{bucket="0-7d"}   5
cert_operator_certificate_expiry_buckets{bucket="7-14d"}  12
cert_operator_certificate_expiry_buckets{bucket="14-30d"} 23
cert_operator_certificate_expiry_buckets{bucket="30-60d"} 45
cert_operator_certificate_expiry_buckets{bucket="60-90d"} 67
cert_operator_certificate_expiry_buckets{bucket="90d+"}   148

Six time series regardless of certificate count. Useful for dashboards showing overall health.

Layer 2: Per-certificate (bounded cardinality)

Individual tracking only for certificates that need attention:

1
2
3
4
const (
    DefaultExpiryTrackingThresholdDays = 90
    DefaultMaxTrackedCertificates      = 1000
)

A certificate is tracked individually only if:

  1. It expires within 90 days (configurable)
  2. Total tracked count is under 1,000 (configurable)

Overflow is counted:

1
cert_operator_certificates_not_tracked_total 42

A non-zero overflow counter signals that the threshold or limit needs adjustment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func SetCertificateExpiryDays(namespace, name string, days float64) {
    key := namespace + "/" + name
    threshold := expiryTrackingThresholdDays.Load()
    maxTracked := maxTrackedCertificates.Load()

    trackedCertsMutex.Lock()
    defer trackedCertsMutex.Unlock()

    shouldTrack := days <= float64(threshold)
    _, alreadyTracked := trackedCertificates[key]

    if shouldTrack {
        if !alreadyTracked && len(trackedCertificates) >= int(maxTracked) {
            CertificatesNotTrackedTotal.Inc()
            return
        }
        trackedCertificates[key] = struct{}{}
        CertificateExpiryDays.WithLabelValues(namespace, name).Set(days)
    } else if alreadyTracked {
        delete(trackedCertificates, key)
        CertificateExpiryDays.DeleteLabelValues(namespace, name)
    }
}

The Full Metrics Suite

Beyond expiry, we track:

Certificate state distribution:

1
2
3
4
cert_operator_certificates_total{state="Pending"}   2
cert_operator_certificates_total{state="Created"}   1
cert_operator_certificates_total{state="Ready"}     145
cert_operator_certificates_total{state="Failed"}    0

Reconciliation performance:

1
2
cert_operator_reconcile_duration_seconds_bucket{state="Ready",success="true",le="0.1"} 523
cert_operator_reconcile_duration_seconds_bucket{state="Ready",success="true",le="1"}   612

AWS API health:

1
2
3
cert_operator_aws_api_calls_total{operation="DescribeCertificate",success="true"} 1523
cert_operator_aws_api_latency_seconds_bucket{operation="DescribeCertificate",le="0.5"} 1400
cert_operator_throttling_events_total{operation="RequestCertificate"} 3

Rate limiter state:

1
2
cert_operator_rate_limit_wait_seconds_bucket{service="acm",le="0.1"} 890
cert_operator_rate_limiter_tokens{service="acm"} 8.5

Cache effectiveness:

1
2
cert_operator_cache_hits_total{lookup_type="fqdn"} 4521
cert_operator_cache_misses_total{lookup_type="fqdn"} 234

Pre-Built Alerting Rules

The operator ships with PrometheusRule definitions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
- alert: CertificateExpiringSoon
  expr: cert_operator_certificate_expiry_days < 14 and cert_operator_certificate_expiry_days > 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Certificate {{ $labels.name }} expires in {{ $value }} days"

- alert: CertificateExpiryCritical
  expr: cert_operator_certificate_expiry_days <= 7 and cert_operator_certificate_expiry_days > 0
  for: 30m
  labels:
    severity: critical

- alert: AWSAPIThrottlingHigh
  expr: rate(cert_operator_throttling_events_total[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "AWS API throttling detected"

- alert: CertificatesFailed
  expr: cert_operator_certificates_total{state="Failed"} > 0
  for: 15m
  labels:
    severity: warning

- alert: HighReconciliationErrorRate
  expr: |
    sum(rate(cert_operator_reconcile_duration_seconds_count{success="false"}[5m]))
    / sum(rate(cert_operator_reconcile_duration_seconds_count[5m])) > 0.1
  for: 10m
  labels:
    severity: warning

We use Grafana for visualization.


Part 6: Production Hardening

Security Considerations

Error sanitization: AWS error messages often contain account IDs and ARNs. These shouldn’t appear in Kubernetes status fields or logs that users can access.

1
2
3
4
5
6
7
8
var accountIDRegex = regexp.MustCompile(`\b\d{12}\b`)
var arnRegex = regexp.MustCompile(`arn:aws:[a-z0-9-]+:[a-z0-9-]*:\d{12}:[^\s]+`)

func SanitizeErrorMessage(msg string) string {
    msg = accountIDRegex.ReplaceAllString(msg, "[ACCOUNT_ID]")
    msg = arnRegex.ReplaceAllString(msg, "[ARN]")
    return msg
}

RBAC minimization: The operator’s ServiceAccount only gets access to Cert CRDs, Events (create), and Leases (leader election). No access to Secrets, ConfigMaps, or other resources.

No credentials in code: AWS credentials come from IRSA (IAM Roles for Service Accounts) on EKS, or environment variables. Never hardcoded, never in ConfigMaps.

Webhook TLS: The validation webhook requires TLS. We recommend cert-manager for automatic certificate management. Self-signed works but requires manual rotation.

High Availability

For production, run multiple replicas with leader election:

1
2
3
replicaCount: 2
leaderElection:
  enabled: true

Only the leader reconciles. If the leader dies, another replica takes over. Leader election uses the standard Kubernetes Lease mechanism.

PodDisruptionBudget prevents simultaneous termination during node maintenance:

1
2
3
podDisruptionBudget:
  enabled: true
  minAvailable: 1

Certificate Protection

Multiple layers prevent accidental certificate deletion:

1. DeleteOnRemoval defaults to false

Deleting the Cert CR doesn’t delete the ACM certificate by default. Users must explicitly opt in:

1
2
spec:
  deleteOnRemoval: true

2. In-use detection

Even with deleteOnRemoval: true, the operator checks if the certificate is attached to an ALB, NLB, or CloudFront distribution:

1
2
3
4
5
6
7
8
9
func (c *RealAWSClient) CertificateInUse(ctx context.Context, arn string) (bool, error) {
    cert, err := c.acm.DescribeCertificate(ctx, &acm.DescribeCertificateInput{
        CertificateArn: aws.String(arn),
    })
    if err != nil {
        return false, err
    }
    return len(cert.Certificate.InUseBy) > 0, nil
}

If in use, deletion is blocked with a clear message.

3. Finalizer

The operator adds a finalizer to every Cert:

1
2
3
metadata:
  finalizers:
    - acm.ops.example.dev/finalizer

This ensures the controller gets a chance to run cleanup logic before the resource is removed from etcd.

Configuration Knobs

The operator exposes configuration for tuning:

Flag Default Description
--max-concurrent-reconciles 3 Parallel reconciliation limit
--acm-rate-limit 5.0 ACM requests/second
--route53-rate-limit 3.0 Route53 requests/second/zone
--cache-ttl 5m Certificate cache TTL
--cache-max-size 1000 Maximum cached certificates
--aws-default-timeout 30s AWS API call timeout
--metrics-expiry-threshold 90 Days threshold for per-cert metrics

Recommendations by scale:

Deployment Size Concurrent Reconciles Rate Limits
Small (<100 certs) 3 Default
Medium (100-500) 5 Default
Large (>500) 5-10 Default, monitor throttling

Part 7: Lessons Learned and Patterns

What Worked Well

State machine over ad-hoc logic: Every debugging session starts with “what state is it in?” Having explicit states made troubleshooting straightforward. Users report issues as “stuck in Created state” rather than “it’s not working.”

Decorator pattern for AWS client: When we needed to add caching, it was one new wrapper. No changes to existing code. When we needed per-zone rate limiting, same pattern. Each feature is isolated and testable.

Per-zone rate limiting: Early versions had a single Route53 rate limiter. Multi-zone deployments hit throttling because traffic to zone A consumed quota needed for zone B. Per-zone limiters solved this without changing the rate limiting logic.

Cardinality-aware metrics from day one: We’ve seen operators with unbounded label causing huge memory usage for Prometheus in production. Building the dual-layer strategy upfront avoided this.

Validation webhook: Catching errors at admission rather than during reconciliation improved user experience significantly. “Your request was rejected” is better than “check the status in 5 minutes to see what went wrong.”

What We’d Do Differently

Consider controller-runtime’s built-in rate limiting earlier: We implemented custom backoff before fully understanding what controller-runtime provides. There’s overlap. The custom backoff is still useful for per-state tuning, but we could have started simpler.

Circuit breaker pattern: Currently, if AWS is having issues, we keep retrying with backoff. A circuit breaker would fail fast after repeated failures and probe periodically for recovery. This would reduce load during AWS outages.

More structured logging from the start: We added structured logging keys later. Starting with a consistent logging schema (internal/logging/keys.go) would have made log aggregation cleaner from the beginning.

Certificate ARN exposed to user: We still have leaks in our abstraction. At the moment development team needs the ARN for the certification in order to use it with ALB. This is more issue in our roadmap since we do not yet have abstraction (operator) for deploying rest of the infrastructure. Problem will be alleviated in the future when the whole lifecycle for application/service is handled by custom operators.

ADRs as Living Documentation

We maintain Architecture Decision Records for significant decisions:

  • ADR-001: Decorator pattern for AWS client
  • ADR-002: Certificate state machine
  • ADR-003: Metrics cardinality management
  • ADR-004: Multi-zone DNS support

Format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# ADR NNN: Title

## Status
Accepted | Deprecated | Superseded

## Context
What problem are we solving?

## Decision
What did we decide?

## Consequences
What are the tradeoffs?

## Alternatives Considered
What else did we evaluate?

ADRs prevent re-litigating settled decisions. When someone asks “why don’t we just use a single rate limiter?”, point them to ADR-001. When onboarding new team members, ADRs explain why the code is structured the way it is.


Part 8: Conclusion and Next Steps

Summary

Building a production-grade Kubernetes operator for AWS integration requires more than a reconciliation loop. The patterns that made this operator production-ready:

  1. State machine: Explicit lifecycle states with differentiated retry behavior
  2. Decorator pattern: Composable AWS client with rate limiting, caching, and retry logic
  3. Bounded metrics: Dual-layer strategy that scales without unbounded cardinality
  4. Validation webhook: Fail-fast error detection before resources reach the controller
  5. Safety defaults: Protection against accidental certificate deletion

These patterns apply beyond certificate management. Any operator wrapping cloud provider APIs will face similar challenges: rate limits, async workflows, observability at scale.

Deployment

We use Helm to install the operator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Install with Helm
helm install cert-operator ./charts/example-k8s-cert-operator \
  --namespace cert-operator \
  --create-namespace \
  --set operator.aws.region=eu-west-1 \
  --set operator.dns.zoneName=example.com \
  --set operator.dns.zoneId=Z0123456789ABC

# Create your first certificate
kubectl apply -f - <<EOF
apiVersion: acm.ops.example.dev/v1
kind: Cert
metadata:
  name: my-app
spec:
  serviceName: my-app
  environment: dev
EOF

# Watch it progress
kubectl get cert my-app -w

Future Work

Multi-zone SANs: Currently all SANs must be in the same zone. Supporting SANs across zones requires creating validation records in multiple hosted zones. Deferred due to complexity, but on the roadmap.

Automatic renewal integration: ACM certificates auto-renew, but the operator could proactively trigger renewal or alert when renewal fails.

AWS Private CA support: For internal certificates that don’t need public DNS validation.

Certificate rotation for in-cluster workloads: Currently the operator exposes the ARN for use in AWS resources (ALB, CloudFront). Exporting the certificate for use inside the cluster (e.g., mounting in pods) is a different use case that warrants a separate controller (ACM doesn’t allow exporting certificates private key so completely new logic is needed).


Appendix

A. Key Code Snippets

CertSpec and CertStatus (api/v1/cert_types.go)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
type CertSpec struct {
    ServiceName             string      `json:"serviceName"`
    Environment             string      `json:"environment"`
    DomainName              string      `json:"domainName,omitempty"`
    SubjectAlternativeNames []string    `json:"subjectAlternativeNames,omitempty"`
    DeleteOnRemoval         bool        `json:"deleteOnRemoval,omitempty"`
    DNSZone                 *DNSZoneRef `json:"dnsZone,omitempty"`
}

type CertStatus struct {
    State          CertState    `json:"state,omitempty"`
    Message        string       `json:"message,omitempty"`
    CertificateArn string       `json:"certificateArn,omitempty"`
    DomainName     string       `json:"domainName,omitempty"`
    ExpirationDate *metav1.Time `json:"expirationDate,omitempty"`
    CertReady      bool         `json:"certReady,omitempty"`
    ResolvedZone   *DNSZoneRef  `json:"resolvedZone,omitempty"`
}

Jitter function (internal/awsapi/ratelimit.go)

1
2
3
4
5
6
7
8
9
func addJitter(d time.Duration, jitterFactor float64, r *rand.Rand, mu *sync.Mutex) time.Duration {
    if jitterFactor <= 0 {
        return d
    }
    mu.Lock()
    jitter := (r.Float64() - 0.5) * jitterFactor
    mu.Unlock()
    return time.Duration(float64(d) * (1 + jitter))
}

Throttling detection (internal/awsapi/ratelimit.go)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func isThrottlingError(err error) bool {
    var apiErr smithy.APIError
    if errors.As(err, &apiErr) {
        switch apiErr.ErrorCode() {
        case "Throttling", "ThrottlingException", "RequestLimitExceeded",
             "TooManyRequestsException", "SlowDown", "PriorRequestNotComplete":
            return true
        }
    }
    return false
}

B. Architecture Diagrams

Component Architecture

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
┌────────────────────────────────────────────────────────────────────────────────────┐
│                               Kubernetes Cluster                                   │
│                                                                                    │
│  ┌──────────────────────────────────────────────────────────────────────────────┐  │
│  │                           Cert Operator Pod                                  │  │
│  │                                                                              │  │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────────────────────┐  │  │
│  │  │   Admission     │  │   Controller    │  │     AWS API Layer            │  │  │
│  │  │   Webhook       │  │   Manager       │  │                              │  │  │
│  │  │                 │  │                 │  │  ┌────────────────────────┐  │  │  │
│  │  │  /validate-*    │  │  Reconciler     │  │  │   Cache (TTL + LRU)    │  │  │  │
│  │  └────────┬────────┘  └────────┬────────┘  │  └───────────┬────────────┘  │  │  │
│  │           │                    │           │              │               │  │  │
│  │           │                    │           │  ┌───────────▼────────────┐  │  │  │
│  │           │                    │           │  │   Rate Limiter         │  │  │  │
│  │           │                    └───────────┼──│   (Token Bucket)       │  │  │  │
│  │           │                                │  └───────────┬────────────┘  │  │  │
│  │           │                                │              │               │  │  │
│  │           │                                │  ┌───────────▼────────────┐  │  │  │
│  │           │                                │  │   AWS SDK v2           │  │  │  │
│  │           │                                │  │   (ACM + Route53)      │  │  │  │
│  │           │                                │  └────────────────────────┘  │  │  │
│  │           │                                └──────────────────────────────┘  │  │
│  │           │                                                                  │  │
│  │  ┌────────▼────────┐  ┌─────────────────┐  ┌─────────────────────────────┐   │  │
│  │  │  Zone Registry  │  │    Metrics      │  │     Health Probes           │   │  │
│  │  │  (Multi-Zone)   │  │  (Prometheus)   │  │  (Liveness/Readiness)       │   │  │
│  │  └─────────────────┘  └─────────────────┘  └─────────────────────────────┘   │  │
│  └──────────────────────────────────────────────────────────────────────────────┘  │
│                                                                                    │
└────────────────────────────────────────────────────────────────────────────────────┘

State Machine

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
                    ┌─────────┐
                    │ Pending │◄────── New CR Created
                    └────┬────┘
                         │ Request ACM cert
                    ┌─────────┐
             ┌──────│ Created │
             │      └────┬────┘
             │           │ Create DNS record
             │           ▼
             │      ┌───────────┐
             │      │ Validated │
             │      └─────┬─────┘
             │            │ Wait for issuance
             │            ▼
             │      ┌─────────┐
             │      │  Ready  │◄──── Certificate Issued
             │      └─────────┘
             │      ┌─────────┐
             └─────▶│ Failed  │◄──── Unrecoverable error
                    └─────────┘

C. Operator Framework Choice

We used Kubebuilder (controller-runtime) rather than Operator SDK. Both are valid choices; the differences are mainly in scaffolding and CLI tooling.

Key controller-runtime features used:

  • Reconciler interface: Standard Reconcile(ctx, Request) (Result, error) pattern
  • Client: Cached Kubernetes client with Get, List, Create, Update, Delete
  • Event recorder: For creating Kubernetes events
  • Leader election: Built-in via manager options
  • Metrics: Automatic controller metrics, plus custom metrics registration
  • Webhook support: Admission webhook scaffolding

The learning curve is moderate. Understanding the reconciliation loop, watch predicates, and status subresource updates takes time. The controller-runtime book (https://book.kubebuilder.io/) is the canonical resource.

For AWS SDK, we use AWS SDK for Go v2. Key packages:

  • github.com/aws/aws-sdk-go-v2/service/acm
  • github.com/aws/aws-sdk-go-v2/service/route53
  • github.com/aws/aws-sdk-go-v2/config (for credential loading)