Using trace capabilities

Screencast of the trace capabilities gadget

The trace capabilities gadget allows us to see what capability security checks are triggered by applications running in Kubernetes Pods.

Linux capabilities allow for a finer privilege control because they can give root-like capabilities to processes without giving them full root access. They can also be taken away from root processes. If a pod is directly executing programs as root, we can further lock it down by taking capabilities away. Sometimes we need to add capabilities which are not there by default. You can see the list of default and available capabilities in Docker . Specially if our pod is directly run as user instead of root (runAsUser: ID), we can give some more capabilities (think as partly root) and still take all unused capabilities to really lock it down.

On Kubernetes

Here we have a small demo app which logs failures due to lacking capabilities. Since none of the default capabilities is dropped, we have to find out what non-default capability we have to add.

$ cat docs/examples/app-set-priority.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: set-priority
  labels:
    k8s-app: set-priority
spec:
  selector:
    matchLabels:
      name: set-priority
  template:
    metadata:
      labels:
        name: set-priority
    spec:
      containers:
      - name: set-priority
        image: busybox
        command: [ "sh", "-c", "while /bin/true ; do nice -n -20 echo ; sleep 5; done" ]

$ kubectl apply -f docs/examples/app-set-priority.yaml
deployment.apps/set-priority created
$ kubectl logs -lname=set-priority
nice: setpriority(-20): Permission denied
nice: setpriority(-20): Permission denied

We could see the error messages in the pod’s log. Let’s use Inspektor Gadget to watch the capability checks:

$ kubectl gadget trace capabilities --selector name=set-priority
K8S.NODE         K8S.NAMESPACE  K8S.PODNAME             K8S.CONTAINER PID      COMM  SYSCALL      UID  CAP CAPNAME   AUDIT  VERDICT
minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711127  nice  setpriority  0    23  SYS_NICE  1      Deny
minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711260  nice  setpriority  0    23  SYS_NICE  1      Deny
minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711457  nice  setpriority  0    23  SYS_NICE  1      Deny
minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711619  nice  setpriority  0    23  SYS_NICE  1      Deny
minikube-docker  default        set-priorit…495c8-t88x8 set-priority  2711815  nice  setpriority  0    23  SYS_NICE  1      Deny
^C
Terminating...

We can leave the gadget with Ctrl-C. In the output we see that the SYS_NICE capability got checked when nice was run. We should probably add it to our pod template for nice to work. We can also drop all other capabilities from the default list (see link above) since nice did not use them:

The meaning of the columns is:

SYSCALL: the system call that caused the capability to be exercised
CAP: capability number
CAPNAME: capability name in a human friendly format
AUDIT: whether the kernel should audit the security request or not
VERDICT: whether the capability was present (allow) or not (deny)

$ cat docs/examples/app-set-priority-locked-down.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: set-priority
  labels:
    k8s-app: set-priority
spec:
  selector:
    matchLabels:
      name: set-priority
  template:
    metadata:
      labels:
        name: set-priority
    spec:
      containers:
      - name: set-priority
        image: busybox
        command: [ "sh", "-c", "while /bin/true ; do nice -n -20 echo ; sleep 5; done" ]
        securityContext:
          capabilities:
            add: ["SYS_NICE"]
            drop: [all]

Let’s verify that our locked-down version works.

$ kubectl delete -f docs/examples/app-set-priority.yaml
deployment.apps "set-priority" deleted
$ kubectl apply -f docs/examples/app-set-priority-locked-down.yaml
deployment.apps/set-priority created
$ kubectl logs -lname=set-priority

The logs are clean, so everything works!

We can see the same checks but this time with the Allow verdict:

$ kubectl gadget trace capabilities --selector name=set-priority
K8S.NODE         K8S.NAMESPACE  K8S.PODNAME             K8S.CONTAINER PID      COMM  SYSCALL      UID  CAP CAPNAME   AUDIT  VERDICT
minikube-docker  default        set-priorit…66dff-nm5pt set-priority  2718069  nice  setpriority  0    23  SYS_NICE  1      Allow
minikube-docker  default        set-priorit…66dff-nm5pt set-priority  2718291  nice  setpriority  0    23  SYS_NICE  1      Allow
^C
Terminating...

You can now delete the pod you created:

$ kubectl delete -f docs/examples/app-set-priority-locked-down.yaml

Interpreting advanced columns

Some columns are not displayed by default:

caps: the effective capability bitfield of the process
capsnames: same as caps in a human friendly format
currentuserns: the user namespace of the process
targetuserns: the user namespace that the kernel used to test the capability.

They can be useful to understand advanced usage of capabilities. Let’s see two examples.

$ kubectl run -ti --rm --restart=Never \
    --image busybox --privileged testcaps -- \
    chroot /

$ kubectl gadget trace capabilities \
    -o columns=comm,syscall,capName,verdict,targetuserns,currentuserns,caps,capsnames
COMM             SYSCALL                      CAPNAME            VERDICT TARGETUSERNS        CURRENTUSERNS       CAPS                 CAPSNAMES
chroot           chroot                       SYS_CHROOT         Allow   4026531837          4026531837          3fffffffff           chown,dac_override,dac_…

In this example, targetuserns and currentuserns are the same. This is necessarily the case for chroot because the kernel tests the capability in this way:

if (!ns_capable(current_user_ns(), CAP_SYS_CHROOT))

The effective capability bitfield is “3fffffffff”. This can be decoded in this way:

$ capsh --decode=3fffffffff
0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read

The effective capability set includes CAP_SYS_CHROOT and targetuserns and currentuserns are the same. Hence the verdict “Allow”.

It is also possible to see the list of capabilities in json:

$ kubectl gadget trace capabilities -o json | jq .
{
  "node": "minikube-docker",
  "namespace": "default",
  "pod": "testcaps",
  "container": "testcaps",
  "timestamp": 1677087968732237745,
  "type": "normal",
  "mountnsid": 4026533307,
  "pid": 3277678,
  "comm": "chroot",
  "syscall": "chroot",
  "uid": 0,
  "gid": 0,
  "cap": 18,
  "capName": "SYS_CHROOT",
  "audit": 1,
  "verdict": "Allow",
  "insetid": false,
  "targetuserns": 4026531837,
  "currentuserns": 4026531837,
  "caps": 274877906943,
  "capsNames": [
    ...
    "sys_rawio",
    "sys_chroot",
    "sys_ptrace",
    ...
  ]
}

In the next example, we will create a new user namespace but without creating a new mount namespace. We will then attempt to create a new mount:

$ kubectl run -ti --rm --restart=Never \
    --image busybox --privileged testcaps -- \
    /bin/unshare -Urf /bin/mount -t tmpfs tmpfs /tmp

Let’s have a look at the generated logs for the mount process:

$ kubectl gadget trace capabilities -o json | jq .
{
  "node": "minikube-docker",
  "namespace": "default",
  "pod": "testcaps",
  "container": "testcaps",
  "timestamp": 1677088257998618652,
  "type": "normal",
  "mountnsid": 4026533307,
  "pid": 3287538,
  "comm": "mount",
  "syscall": "mount",
  "uid": 0,
  "gid": 0,
  "cap": 21,
  "capName": "SYS_ADMIN",
  "audit": 1,
  "verdict": "Deny",
  "insetid": false,
  "targetuserns": 4026531837,
  "currentuserns": 4026533310,
  "caps": 2199023255551,
  "capsNames": [
    ...
    "sys_pacct",
    "sys_admin",
    "sys_boot",
    ...
  ]
}

The capability set includes CAP_SYS_ADMIN. However, the verdict is “Deny”.

This can be explained by the interaction with user namespaces. The target and current user namespaces are different. This makes a difference because the kernel tests the capability with regard to the user namespaces owning the mount namespace, that is the parent user namespace:

if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) || ...

With `ig`

Start ig:

$ ig trace capabilities -r docker -c test
RUNTIME.CONTAINERNAME  PID      COMM     SYSCALL  UID  CAP CAPNAME      AUDIT  VERDICT

Start the test container exercising the capabilities:

$ docker run -ti --rm --name=test --privileged busybox
/ # touch /aaa ; chown 1:1 /aaa ; chmod 400 /aaa
/ # chroot /
/ # mkdir /mnt ; mount -t tmpfs tmpfs /mnt
/ # export PPID=$$;/bin/unshare -i sh -c "/bin/nsenter -i -t $PPID echo OK"
OK

Observe the resulting trace:

RUNTIME.CONTAINERNAME  PID      COMM     SYSCALL  UID  CAP CAPNAME      AUDIT  VERDICT
test                   2609137  chown    chown    0    0   CHOWN        1      Allow
test                   2609137  chown    chown    0    0   CHOWN        1      Allow
test                   2609138  chmod    chmod    0    3   FOWNER       1      Allow
test                   2609138  chmod    chmod    0    4   FSETID       1      Allow
test                   2609138  chmod    chmod    0    4   FSETID       1      Allow
test                   2609694  chroot   chroot   0    18  SYS_CHROOT   1      Allow
test                   2610364  mount    mount    0    21  SYS_ADMIN    1      Allow
test                   2610364  mount    mount    0    21  SYS_ADMIN    1      Allow
test                   2633270  unshare  unshare  0    21  SYS_ADMIN    1      Allow
test                   2633270  nsenter  setns    0    21  SYS_ADMIN    1      Allow
test                   2633270  nsenter  setns    0    21  SYS_ADMIN    1      Allow

Using trace capabilities

On Kubernetes

Interpreting advanced columns

With ig

With `ig`