You are visiting relatives, staying overnight. How do you leave your room once you are leaving? Or AirBnB apartment? Do you leave garbage laying around, water running, door open and stove on? Probably not. At least I hope you do not. The same should apply to your applications. Applications should shutdown gracefully (and preferably pretty fast). Graceful shutdowns are even more important nowadays when underlying orchestration service such as Kubernetes is handling the lifecycle of your app. Neglecting graceful shutdown causes a lot of headaches to operations:
- Scaling is slower
- Resources leak
- Data stores are left in inconsistent state and you might lose data
- Operations are having harder time managing the app
Follow the simple instructions layed out in this post and keep your operations smooth and operators (or your own team) happy!
Do not swallow OS signals
Your app should be PID 1 in the container. Run your application directly at the ENTRYPOINT:
|
|
Avoid shell-wrappers since those can swallow signals. If you are absolutely sure you need such tool make sure you use one that is capable of forwarding any OS signals it receives to your application. (Note: running job-management tools is not good reason. It usually indicates that you are trying to dynamically run parts of your application inside container implying multiple critical processes. That is huge antipattern. You should always have only one critical process inside your application to keep things simple and easy to reason about.)
Handle the SIGINT and SIGTERM
You should listen and act on the SIGINT and SIGTERM signals. This is the way how underlying runtime notifies your app that it is time to stop whatever application is doing, release resources and quit. Furthermore your shutdown shouldn’t take very long. Finish in-flight request, do not accept more and release any locks and connections there might be allocated.
This is minimal example with Go how you can be sure that you are playing nice with your runtime:
|
|
Catch those signals! There is no reason to not implement proper signal handling.
Respect timeouts
When pod needs to be terminated Kubernetes sends a SIGTERM signal to the main container process (PID 1) and waits for the pod to terminate. This is the reason why you want your application to be PID 1. If the pod does not terminate within the 30-second grace period Kubernetes sends a SIGKILL signal to force termination. Note, there is no way to handle SIGKILL, your pod will die and anything you have ongoing/open is lost. Having to resort to SIGKILL should be absolute worst case scenario and you never want your pods to be killed that way.
In order to not becoming forcefully terminated your app should cleanup and quit as fast as possible. Even though the grace period can be increased you shouldn’t need to. If you have operations running inside the pod that require over 30s to complete (let alone more if you need to increase grace period) you are probably doing something that smells fishy. This is especially true for any application that servers as frontend. HTTP request running >30s? You are doing too much, capture the work, store it somewhere and return token which can be used to query job status. Do not block frontend server with long running jobs. Only place where longer running jobs are valid are backend job-queues but even then try to make jobs small so they execute quickly.
On the example above there is 10s grace period after which the application will close. Implement shutdown timeout for your app as well.