Making the Health Check not timeout in face of failure

23 Aug, 2022

As with all service based software, it's good to have a health check endpoint so the health of the service can be determined by the dependent services.

However, that endpoint needs to not timeout when the services it is checking timeout.

In kubernetes, a pod that has a liveness probe that times out will eventually be killed. When this happens it will first send a SIGTERM signal and wait for the graceful_termination amount of time, then if it hasn't terminated it will send a SIGKILL signal.

Monitoring our pods, this happens with our django serivce and results in this showing up:

django-5df7bd5fbf-2rhd7                      1/1     Running     1          14h
django-5df7bd5fbf-4grdt                      1/1     Running     2          14h
django-5df7bd5fbf-66mwq                      1/1     Running     0          14h
django-5df7bd5fbf-c6284                      1/1     Running     2          14h
django-5df7bd5fbf-cjlcw                      1/1     Running     1          14h
django-5df7bd5fbf-l649h                      1/1     Running     1          14h
django-5df7bd5fbf-mlsbt                      1/1     Running     1          14h

There are a bunch of restarts (not a lot, but some).

Then looking into one of the pods:

$ kubectl describe pod django-5df7bd5fbf-c6284 
...
    State:          Running
      Started:      Tue, 23 Aug 2022 06:52:17 -0700
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 23 Aug 2022 06:49:17 -0700
      Finished:     Tue, 23 Aug 2022 06:52:17 -0700
...

Now normally, there would be some events for liveness probe failed, but in this case I didn't find any. Instead I had to look into Google's Log Explorer to find out that these were terminated by Kubernetes.

Using this query:

resource.type="k8s_node"
jsonPayload.MESSAGE=~"illing"

Shows errors like this:

I0823 13:52:07.475101    1984 kubelet.go:2028] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="app-amigocloud-com/django-5df7bd5fbf-4grdt""
I0823 13:52:07.474945    1984 prober.go:116] "Probe failed" probeType="Liveness" pod="app-amigocloud-com/django-5df7bd5fbf-4grdt" podUID=b4d950ab-d9a9-49ea-9052-f12088b6a97a containerName="django" probeResult=failure output="Get \"http://10.16.3.50:8000/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
I0823 13:52:07.475973    1984 kuberuntime_manager.go:689] "Message for Container of pod" containerName="django" containerStatusID={Type:containerd ID:4e9f8f76ad1a5c558f660621f9379fc84b696cd4529bc1c5befb1568faba26e5} pod="app-amigocloud-com/django-5df7bd5fbf-4grdt" containerMessage="Container django failed liveness probe, will be restarted"
I0823 13:52:07.476056    1984 kuberuntime_container.go:714] "Killing container with a grace period override" pod="app-amigocloud-com/django-5df7bd5fbf-4grdt" podUID=b4d950ab-d9a9-49ea-9052-f12088b6a97a containerName="django" containerID="containerd://4e9f8f76ad1a5c558f660621f9379fc84b696cd4529bc1c5befb1568faba26e5" gracePeriod=60"
time="2022-08-23T13:52:07.476692309Z" level=info msg="StopContainer for \"4e9f8f76ad1a5c558f660621f9379fc84b696cd4529bc1c5befb1568faba26e5\" with timeout 60 (s)"
time="2022-08-23T13:52:07.477308945Z" level=info msg="Stop container \"4e9f8f76ad1a5c558f660621f9379fc84b696cd4529bc1c5befb1568faba26e5\" with signal terminated"

So it found that the liveness probe failed, then sent SIGTERM, and it terminated in less than a second.

So what's going on here? This is our webserver, it is dependant on other services like redis and postgres. If one of those goes down, we don't want to recieve any traffic. This should be done on an endpoint-by-endpoint basis as some requests can be processed, like getting the list of datasets, but others cannot be processed like accepting a file upload.

As the development team didn't implement an endpoint-by-endpoint health check, it's better for the entire site to go down if the databases are not running. This happenes very rarely (like twice a year for an hour), so we plan it into our maintenance plan, notify the clients, and what not. Also, google may perform an update that will also cause some unexpected downtime, but for data consistency, it's better this way.

So why did the liveness probe fail? It timed out. How do we know what service is causing the issue? We don't. So there are two improvements. Ensure that if a service goes unresponsive or hangs trying to communicate with it, terminate the connection and then continue. The web service will go down if the dependant services are unresponsive, but at least it should be possible to identify the bad service quickly and efficiently. Second, ensure that the dependant services have their own health checks so they can be restarted if something goes wrong.

So to do this, I need to call the function that does the health check asynchronously and give it a timeout.

There are a few options:

Use gevent (we're using gunicorn so this might work if we're using gevent for our workers)
Use multiprocessing/multithreading (if we're not using gevent then this is a possibility)

Gevent Method

First I checked the gunicron settings:

-w 4 -b 0.0.0.0:8000 -t 180 amigoserver.wsgi

production is a little different, but it is easy to see that there is no gevent worker type being specified. The default type is sync so this won't work

I tried it just to make sure and it's easy to see that the monkey patching isn't happening:

import gevent.socket as geventsocket
import socket
...
if socket.socket is geventsocket.socket:
    logger.info( "Gevent monkey patch has occurred")
else:
    logger.info( "Gevent monkey patch has not occurred")

This always showed Gevent monkey patch has not occurred so I moved on to other methods.

Signal Method

Signals can be sent, but it we're waiting on IO it won't work as the operation is stuck waiting until it can handle the signal being sent,.

Multiprocessing Method

The multiprocessing method is the one I chose. The difficulty is that it will not call the "finally" statements in the case the process needs to be terminated after timing out. I wish there was an option for that, but that's what we're living with in 2022.

To do that I imported the multiprocessing library:

import multiprocessing

Then used it around where I needed to call the health check functions:

process = multiprocessing.Process(target=validation_method)
process.start()
process.join(settings.HEALTH_CHECK_SERVICES_TIMEOUT_SECONDS)

if process.is_alive():
    process.terminate()
    process.join()
    logger.error('HealthCheck for {service} took too long'.format(service=service))
    service_status = 'fail'

I made the timeout an environment variable that can be controlled later. I added it to the config map and the settings. That way I can tweak it in production by manually changing the config map and then restarting the django workers. So this gives me a quick and easy option to keep things running if the number is off.

In addition to this I added in logging for each check so if it fails, it will log it.

Hopefully this will shed some light on which service is causing problems as it seems the django service's health check is failing, while all the others for the dependent services are succeeding.

So either the dependent service's health checks are wrong, or there is some communication issue happening between django and those services.

Notes:

[Gunicorn Worker Types Advice] (https://medium.com/@nhudinhtuan/gunicorn-worker-types-practice-advice-for-better-performance-7a299bb8f929)]

Function Timeouts using Signals

Function Timeouts using Multiprocessing