It was DNS

Summary

DNS can be challenging in mixed IPv4 and IPv6 environments (dual stack). Much of the underlying basics on which our current systems rely on are taken as granted without paying much attention on how they actually work. DNS is one of such thing. In order to debug complex, intermittent problems you must be aware how the basics operate. There are a lot of differences between DNS resolving between glibc and musl. Know them. Be also familiar on how routing works, what NAT64 is and DNS64.

Background

The inspiration for this post and talk I gave on GDG Tampere meetup come from real life experiences with one of my clients. I’ll omit the names and speak on a high level about the setup.

AWS VPC with dual stack (IPv4 and IPv6) subnets
- Let’s say we have IPv6 address space 2001:db8::/32 (A prefix used for documentation) and IPv4 address space of 10.10.10.0/24 (A RFC1918 private space) for workload VPC and 10.20.0.0/16 for centralized egress VPC
EKS cluster
- We use IPv6 inside the cluster
- Pods get only IPv4 address
Centralized networking with dedicated VPC & NAT Gateways and Transit Gateway for egress traffic
- Our default route for both IPv4 and IPv6 points towards the TGW attachment in our VPC which means that all non-local traffic is directed towards our centralized egress VPC
- NAT Gateways in egress VPC subnets are allocated IPv4 addresses and let’s assume those are 10.20.1.10 and 10.20.1.20
- IPv4 to IPv6 address translation is done with 64:ff9b::/96 like stated in RFC6052

Problem

Setting aside the complexities of routing, setup etc we encountered occasional problems when dev teams run their applications. There were non-deterministic connectivity issues, “every second call fails” situations and other strange problems which usually looked like the service would be in some sort of failure state but load-balancer didn’t remove it properly. Sometimes the problems would go away and sometimes dev teams would have struggled a long time with the problems.

Of course this sort of situation is not something we want to have in production so we needed to figure out the problems and fix them. This debugging lead us to deep rabbit hole commonly known as DNS.

What did we discover?

A lot. We probably discovered more than we would have liked. In order to understand why some systems behave erratically we need to understand the “happy path” so let’s go through how we assumed things would work:

K8S pod is running with IPv6 address. No IPv4 address.
Workload wants to make API call to service running outside of our network in on-premise which means we have to go through our egress networking. External service has IPv4 address.
Workload makes DNS query and since target is IPv4 address and we only operate on IPv6 the answer must be translated from IPv4 to IPv6. The specific details how this happends can be read from here
- NAT Gateway in our egress VPC handless DNS64
- 64:ff9b::/96 mentioned earlier is used with DNS64 and workload gets the DNS response and makes the query
64:ff9b::/96 has route pointing to NAT Gateway so the traffic is sent there
NAT Gateway converts 64:ff9b::/96 address to IPv4 and applies NAT
- For the target service the traffic seems to originate from one of the IPv4 addresses allocated for the NAT Gateway instances
Connection is establised, API call made and everyone is happy

Of course, things do not behave this way as we found out.

This is when DNS raises it’s head and starts messing around. Let’s first clarify what happens. The problem is that sometimes when workload makes DNS query the result will have IPv4 answer. I might also get IPv6 answer but IPv4 could be first. What this actually does is that since target appears to be IPv4 the NAT64 suddenly happens on the node where pod is running! This in turn makes the connection seemingly originate from the node. So instead of the NAT Gateway addresses 10.20.1.10 and 10.20.1.20 the connection originates from 10.10.10.X which is the address space where cluster is running. Target service only allows requests from the NAT Gateways addresses so the connection attempt is blocked and packets dropped. Workload doesn’t get any response but just waits for the return packets which never arrive.

This kinda sucks. Things are made even more trickier is that different resolvers may behave differently. Glibc, musl and language specific resolvers may behave differently. Debugging is also challenging due to tools lacking support (I’m looking at you AWS VPC Reachability analyzer).

So, in the end there is no clear one absolute solution. Only proper advice is that you should be aware of what platforms/languages your dev teams are using and be mindful of the basics. Sure, it is huge topic in itself that how technology choices are made, with wide autonomy the wide spectrum of used technologies is one of the costs that you have pay to enable it.