Container Context and PID 1

In container context, when it comes to teardown

  • docker stop send a SIGTERM to the main process inside the container, and after a grace period (default 10s), SIGKILL
  • Kubernetes send a SIGTERM signal to the main process of containers in the Pod, and after a grace period (default 30s), SIGKILL
  • Under interactive mode, pressing Ctrl+C causes the system to send a SIGINT signal to the main process

For Terminating in graceful and responsive way, processes inside container should handle SIGTERM and SIGINT. Many annoying issues/complains about container exit can be found in network[1][2]. Let’s dig out why.

Applications shipped in minimal container image, such as Distroless Container Images, or FROM scratch static binary, usually have with entrypoint like /app, run directly as PID 1 in container’s pid namespace.

But PID 1 is treated specially by Linux[3][4][5]:

  • The process will not terminate on SIGINT or SIGTERM unless it is coded to do so.
  • Indeed, it is unkillable, meaning that it doesn’t get killed by signals which would terminate regular processes.
  • When the process with pid 1 die for any reason, all other processes are killed with KILL signal
  • When any process having children dies for any reason, its children are reparented to process with PID 1
  • PID 1 has a unique responsibility, which is to reap zombie processes

The following Rust code build a basic application that print its PID and lives for 60 seconds. The full code is here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// sleep.rs
use std::{thread, time, process};

fn main() {
    let pid = process::id();
    println!("pid {}", pid);

    let delay = time::Duration::from_secs(1);

    for i in 1..=60 {
        thread::sleep(delay);
        println!(". {}", i);
    }
}

Let’s build and run it in container, then send SIGTERM or SIGINT to sleep process. Our process with PID 41 is child of /bin/sh, it will exit immediately with code 143 (SIGTERM) or 130 (SIGINT).

This much like the case we run program in terminal, test it, and press CTRL-C to stop it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
docker run --rm --name sleep -w /src  -it -v $PWD:/src rust:alpine   # $ docker exec sleep ps
/src # rustc sleep.rs                                                # PID   USER     TIME  COMMAND
/src # ./sleep                                                       #     1 root      0:00 /bin/sh
pid 41                                                               #    41 root      0:00 ./sleep
. 1                                                                  #
. 2                                                                  # $ docker exec sleep kill -s -SIGTERM 41
. 3                                                                  #       
. 4                                                                  #          
. 5                                                                  #

Terminated
/src # echo $?
143

/src # ./sleep
pid 96
^C
/src # echo $?
130

PID 1 Behavior

When run as PID 1, it is unstoppable in its PID namespace. None of SIGINT, SIGTERM or SIGKILL will work.

1
2
3
4
5
6
7
8
docker run --rm --entrypoint /src/sleep -it -v $PWD:/src rust:alpine   # $ docker exec sleep ps
pid 1                                                                  # PID   USER     TIME  COMMAND
. 1                                                                    #     1 root      0:00 ./sleep
. 2                                                                    #    
^C^C. 3                                                                # $ docker exec sleep kill -s SIGINT 1
^C^C^C^C^C. 4                                                          # $ docker exec sleep kill -s SIGTERM 1
...                                                                    # $ docker exec sleep kill -s SIGKILL 1
. 60                                                                   # 

As container processes are just normal processes in host PID namespace, sending SIGKILL in host work as expected. Docker or Kubelet send signals to PID 1 in every container by this way.

The process won’t repond to SIGTERM or SIGINT because it is not coded to do it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# docker run --rm --name sleep --entrypoint /src/sleep -it -v $PWD:/src alpine       
pid 1                                                                 # $ docker exec sleep ps
. 1                                                                   # PID   USER     TIME  COMMAND
. 2                                                                   #     1 root      0:00 ./sleep
. 3                                                                   # 
. 4                                                                   # $ ps -ef | grep sleep
. 5                                                                   # root      9521  9374  0 15:49 pts/0    00:00:00 /src/sleep 
. 6                                                                   # $ kill -s SIGINT  9521 // not work
. 7                                                                   # $ kill -s SIGTERM 9521 // not work
. 8                                                                   # $ kill -s SIGKILL 9521 // worked
#                                                                     #

Solution for entrypoint is application binary

Solutions to this problem depends on what the behavior is expected. If reponsive to SIGINT(CTRL-C) or SIGTERM is the only demand, for languages have default behavior, such as Golang, it abort directly when receive SIGTERM or SIGINT. Nothing need to be done.

For language don’t have default behavior, like Rust, using tini or dumb-init to wrap container entrypoint are the fast way.

tini or dumb-init will act as PID 1 in container and immediately spawns command as a child process, taking care to properly handle and forward signals as they are received

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# docker run --rm --name sleep --entrypoint /dumb-init -v $PWD:/src zengxu/alpine:init /src/sleep
pid 8                                                         # 
. 1                                                           # $ docker exec sleep ps
. 2                                                           # PID   USER     TIME  COMMAND
. 3                                                           #     1 root      0:00 /dumb-init /src/sleep
. 4                                                           #     8 root      0:00 /src/sleep
. 5                                                           #
. 6                                                           # docker exec sleep kill 1
. 7                                                           #
# 

Note zengxu/alpine:init is build by this Dockerfile:

1
2
3
4
5
6
7
8
FROM alpine

ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-static /tini
RUN chmod +x /tini

RUN wget -O /dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64
RUN chmod +x /dumb-init

Above solution can work in any runtime context, including Containerd, Docker, Podman, or Kubernetes.

In runtime context is Docker, tini is included in it. Adding arg --init in run command will override target’s entrypoint as /sbin/docker-init -- /src/sleep.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# docker run --rm --name sleep --init -v $PWD:/src zengxu/alpine:init /src/sleep
pid 8                                                         # 
. 1                                                           # $ docker exec sleep ps
. 2                                                           # PID   USER     TIME  COMMAND
. 3                                                           #     1 root      0:00 /sbin/docker-init -- /src/sleep
. 4                                                           #     8 root      0:00 /src/sleep
. 5                                                           #
. 6                                                           # docker exec sleep kill 1
. 7                                                           #
# 

Solution for entrypoint script

What about application that must be start from a shell script? Bash or shell don’t forward signals like SIGTERM to processes it is currently waiting on[6].

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# cat start.sh                                  
#!/bin/sh                                                                         
/src/sleep

# docker run --rm --name sleep -v $PWD:/src --entrypoint /src/start.sh alpine
pid 7
. 1
^C^C. 2
^C^C^C. 3
...
. 60

This is why the annoying scene happens[1], the container can’t kill by Ctrl-C.

1
2
3
4
5
6
7
8
9
              + ------------------ Contianer PID namespace ------------------
              |                                                           
SIG_INT/SIG_TERM ---> PID 1, /bin/sh /src/start.sh  (won't forward signals to child 
              |
              |      |                     
              |      |
              |      +----> PID 8, /src/sleep
              |                                                           
              +--------------------------------------------------------------

As pointed out by answers in [6], exec process and let it replace shell process solve this problem. Writing signal handler in script do the best, but can be a litte complex.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# cat exec.sh 
#!/bin/sh                                                                         
exec /tini -- /src/sleep

# docker run --rm --name sleep -v $PWD:/src --entrypoint /src/exec.sh zengxu/alpine:init
pid 7
. 1
. 2
^C
# 

What happen here is

1
2
3
4
5
6
7
8
9
              + ------------------ Contianer PID namespace ------------------
              |                                                           
SIG_INT/SIG_TERM ---> PID 1, /tini -- /src/sleep   (/tini replace /bin/sh as PID 1, will forward signals to child
              |
              |      |                     
              |      |
              |      +----> PID 8, /src/sleep
              |                                                           
              +--------------------------------------------------------------

child reaping

In some cases, application use unix fork to do specific tasks. Then the duty of reaping children comes to the entrypoint process. Since not all application will carefully reaping child processes by installing SIGCHILD signal hander, calling the wait syscall in parent process, these child processes may become longer lived zombie processes. Large sized, long lived zombie processes are harmful to Unix system, it will exhaust pid resource and process table[7].

Below Golang samaples try to create zombies every 1 second

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
for i := 1; i <= 60; i++ {
  fmt.Println(pid, ".", i)
  if _, isChild := os.LookupEnv("CHILD_ID"); !isChild {
    pwd, err := os.Getwd()
    if err != nil {
      log.Fatalf("getwd err: %s", err)
    }
    args := append(os.Args, fmt.Sprintf("#child_%d_of_%d", i, os.Getpid()))
    childENV := []string{
      fmt.Sprintf("CHILD_ID=%d", i),
    }
    syscall.ForkExec(args[0], args, &syscall.ProcAttr{
      Dir: pwd,
      Env: append(os.Environ(), childENV...),
      Sys: &syscall.SysProcAttr{
        Setsid: true,
      },
      Files: []uintptr{0, 1, 2}, // print message to the same pty
    })
  } else {
    os.Exit(0) // child exit directly, become zombie
  }
  time.Sleep(time.Second)
}

If application don’t reap children, in the end there’re 60 zombies. The full code is here. You can play it with

1
2
3
4
5
6
7
8
9
-- at window/panel 1
docker run --rm --name sleep -w /src  -it -v $PWD:/src ubuntu:jammy  
---at window/panel 2
docker exec sleep ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  0.0 710720  1992 pts/0    Ssl+ 11:48   0:00 ./sleep-zombie
...
root       349  0.0  0.0      0     0 ?        Zs   11:49   0:00 [sleep-zombie] <defunct>
root       355  0.0  0.0   6408  1652 ?        Rs   11:49   0:00 ps aux

Rust version demo is here.

For such cases, container init-system such as tini and dumb-init are good choice. More at what-is-advantage-of-Tini?

As previous pointed, in Docker you can simplely use docker run --init to solve this problem. In K8s, the pause process can help reaping child processes with spec.shareProcessNamespace: true, More details at share-process-namespace, the-almighty-pause-container.

Comparing to leverage runtime, building it into the container are better choice. Things will be handled by default without reliance on properly configuring things at runtime.

graceful shutdown guides

For application should shutdown gracefully, it should be coded to catch SIGTERM or SIGINT, do cleanup such as closing connections, and finally exit with 0. For Rust this guide (handling-unix-kill-signals-in-rust) can be followed. For Golang this (how-to-stop-http-listenandserve) can be followed. Other languages are your own, but solution are quite common.