Fixing failure to mount volumes due to "an operation with the given Volume ID already exists"

22 Aug, 2022

After performing the upgrades of rook to version 1.6.11, I noticed that we had some pods that were stuck at the "ContainerCreating" state.

celery-user-5b69dbbff9-xsd5z                   0/1     ContainerCreating   0          5h33m
celery-user-5b69dbbff9-xxx2f                   0/1     ContainerCreating   0          5h33m
django-6ff8784799-n77kx                        0/1     ContainerCreating   0          16h

Looking at these pods I found there was an error in the VolumeMount. This seemed specific to the pod as other pods in the same deployment succeeded.

kubectl describe pod celery-user-5b69dbbff9-xsd5z                   
...
Events: 
  Type     Reason       Age                    From             Message
  ----     ------       ----                   ----             -------
  Warning  FailedMount  57m (x12 over 75m)     kubelet, gold-5  MountVolume.MountDevice failed for volume "pvc-90eca685-5b34-477a-83f2-77ea08e29177" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-9f6cc1a8-8f15-11ea-b24a-f271f33f8f2a already exists
...
  Warning  FailedMount  26m (x2 over 38m)      kubelet, gold-5  Unable to attach or mount volumes: unmounted volumes=[djangostatic djangomedia djangomediatests djangosecrets amigo-log], unattached volumes=[default-token-ngwhd djangostatic djangomedia djangomediatests djangosecrets amigo-log]: timed out waiting for the condition

So some volumes can't be mounted due to an existing operation. What is that?

First I switched to the rook-ceph namespace:

$ kubens rook-ceph
Context "kubernetes-admin@kubernetes" modified.
Active namespace is "rook-ceph".

I checked the cluster status:

$ kubectl exec rook-ceph-tools-6dcbb78845-6xkk8 -- ceph status
  cluster:
    id:     04461f64-e630-4891-bcea-0de24cf06c51
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum k,l,m (age 5m)
    mgr: a(active, since 12m)
    mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
    osd: 13 osds: 6 up (since 42h), 6 in (since 8d)

  data:
    pools:   4 pools, 73 pgs
    objects: 6.69M objects, 2.9 TiB
    usage:   8.8 TiB used, 46 TiB / 55 TiB avail
    pgs:     73 active+clean

  io:
    client:   1.2 KiB/s rd, 34 KiB/s wr, 2 op/s rd, 3 op/s wr

Everything is fine.

Then I took a look at the csi provisioners:

$ kubectl get pods -o wide | grep csi
csi-cephfsplugin-9dbc2                             3/3     Running            0          2d23h   10.1.1.246       gold-6   <none>           <none>
csi-cephfsplugin-gdz9c                             3/3     Running            0          2d23h   10.1.1.241       gold-1   <none>           <none>
csi-cephfsplugin-hfx8f                             3/3     Running            36         2d16h   10.1.1.245       gold-5   <none>           <none>
csi-cephfsplugin-nwdvw                             3/3     Running            0          2d23h   10.1.1.244       gold-4   <none>           <none>
csi-cephfsplugin-provisioner-68887cc8b5-f48n5      3/5     CrashLoopBackOff   1124       43h     10.233.103.231   gold-5   <none>           <none>
csi-cephfsplugin-provisioner-68887cc8b5-g9gm6      3/5     CrashLoopBackOff   1032       43h     10.233.97.56     gold-1   <none>           <none>
csi-rbdplugin-5j4tg                                3/3     Running            0          2d23h   10.1.1.241       gold-1   <none>           <none>
csi-rbdplugin-fhwlp                                3/3     Running            0          2d23h   10.1.1.246       gold-6   <none>           <none>
csi-rbdplugin-provisioner-75ff54ff6b-4582g         3/5     CrashLoopBackOff   1124       43h     10.233.103.159   gold-5   <none>           <none>
csi-rbdplugin-provisioner-75ff54ff6b-gn7vl         3/5     CrashLoopBackOff   1032       43h     10.233.97.180    gold-1   <none>           <none>
csi-rbdplugin-tqfrq                                3/3     Running            39         2d23h   10.1.1.245       gold-5   <none>           <none>
csi-rbdplugin-ts6cv                                3/3     Running            0          2d23h   10.1.1.244       gold-4   <none>           <none>

It seems the provisioner on gold-5 and gold-1 nodes are stuck. Checked the last state:

kubectl describe pod csi-rbdplugin-provisioner-75ff54ff6b-gn7vl
...
  csi-provisioner: 
...
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 18 Aug 2022 23:01:37 +0000
      Finished:     Thu, 18 Aug 2022 23:01:37 +0000
...
  csi-resizer: 
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 18 Aug 2022 23:01:37 +0000
      Finished:     Thu, 18 Aug 2022 23:01:37 +0000
...
  csi-attacher: 
    State:          Running
      Started:      Thu, 18 Aug 2022 22:42:56 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Thu, 18 Aug 2022 21:32:07 +0000
      Finished:     Thu, 18 Aug 2022 22:42:55 +0000

Looks like the provisioner and resizer, are terminating with error 2 and attacher is terminating with error 255.

csi-provisioner shows this in the log:

$ kubectl logs csi-rbdplugin-provisioner-75ff54ff6b-gn7vl --previous csi-provisioner
unknown flag: --leader-election

csi-resizer shows this in the log:

$ kubectl logs csi-rbdplugin-provisioner-75ff54ff6b-gn7vl --previous csi-resizer
flag provided but not defined: -timeout

$ kubectl logs csi-rbdplugin-provisioner-75ff54ff6b-gn7vl --previous csi-attacher
I0818 21:32:07.681303       1 main.go:92] Version: v2.1.0-0-g26d9e9b8
I0818 21:32:07.683313       1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I0818 21:32:07.684042       1 common.go:111] Probing CSI driver for readiness
W0818 21:32:07.685729       1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I0818 21:32:07.687199       1 leaderelection.go:242] attempting to acquire leader lease  rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com...
E0818 21:32:23.404867       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: etcdserver: request timed out
E0818 21:32:52.535931       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: etcdserver: request timed out
E0818 21:33:09.110779       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: http2: server sent GOAWAY and closed the connection; LastStreamID=5, ErrCode=NO_ERROR, debug=""
E0818 21:33:48.099171       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: i/o timeout
E0818 21:34:01.674726       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: no route to host
E0818 21:34:23.697279       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: unexpected EOF
E0818 21:34:32.819510       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:34:38.752454       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: unexpected EOF
E0818 21:34:44.692780       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:34:50.276317       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:35:03.114382       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: no route to host
E0818 21:35:24.367304       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""
E0818 21:35:36.906501       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: no route to host
E0818 21:36:07.566802       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Unauthorized
E0818 21:36:22.879799       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: unexpected EOF
E0818 21:36:29.789400       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:36:41.418592       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: no route to host
E0818 21:36:48.117671       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:36:54.877792       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:37:03.953546       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:37:10.265916       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:37:16.486380       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:37:37.547353       1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: leases.coordination.k8s.io "external-attacher-leader-rook-ceph-rbd-csi-ceph-com" is forbidden: User "system:serviceaccount:rook-ceph:rook-csi-rbd-provisioner-sa" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "rook-ceph"
I0818 21:37:45.990863       1 leaderelection.go:252] successfully acquired lease rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com
I0818 21:37:45.991040       1 controller.go:121] Starting CSI attacher
I0818 22:42:55.050245       1 leaderelection.go:288] failed to renew lease rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: failed to tryAcquireOrRenew context deadline exceeded
F0818 22:42:55.050318       1 leader_election.go:169] stopped leading

So that is all weird.

Checking the csi version I have v2.0.0.

$ kubectl --namespace rook-ceph get pod -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}' -l 'app in (csi-rbdplugin,csi-rbdplugin-provisioner,csi-cephfsplugin,csi-cephfsplugin-provisioner)' | sort | uniq
quay.io/cephcsi/cephcsi:v2.0.0
quay.io/k8scsi/csi-attacher:v2.1.0
quay.io/k8scsi/csi-node-driver-registrar:v1.2.0
quay.io/k8scsi/csi-provisioner:v1.4.0
quay.io/k8scsi/csi-resizer:v0.4.0

Checking the readme of the v1.6.0 release...

Kubernetes 1.11 or newer is supported. I'm on 1.16 so that's fine.

CSI v3.3.0 driver enabeld by default. Mine is stuck at 2.0.0. Maybe that is the problem.

Checking old releses, it seems that Ceph-CSI 3.0 has been the default since version 1.4. When the cluster I'm working on was deployed, the operator had environment variables that set the ceph version to version 2.0. Removing those version variables may allow the operator to upgrade to the latest CSI version. There is no mention of droping support for version 2.0 of CSI.

running kubectl edit deployment rook-ceph-operator

shows this in the enviornment:

        - name: ROOK_CSI_CEPH_IMAGE
          value: quay.io/cephcsi/cephcsi:v2.0.0
        - name: ROOK_CSI_REGISTRAR_IMAGE
          value: quay.io/k8scsi/csi-node-driver-registrar:v1.2.0
        - name: ROOK_CSI_RESIZER_IMAGE
          value: quay.io/k8scsi/csi-resizer:v0.4.0
        - name: ROOK_CSI_PROVISIONER_IMAGE
          value: quay.io/k8scsi/csi-provisioner:v1.4.0
        - name: ROOK_CSI_SNAPSHOTTER_IMAGE
          value: quay.io/k8scsi/csi-snapshotter:v1.2.2
        - name: ROOK_CSI_ATTACHER_IMAGE
          value: quay.io/k8scsi/csi-attacher:v2.1.0

that must be where the version is being pinned. I'll remove it and cross my fingers that it will work.

A master node was lost in the middle of the change, so that needs to be fixed before I can review. However, before it was lost, it looked like things were being fixed:

$ kubectl get pods
NAME                                               READY   STATUS              RESTARTS   AGE
csi-cephfsplugin-g6kvb                             3/3     Running             0          32s
csi-cephfsplugin-g7r2q                             3/3     Running             0          30s
csi-cephfsplugin-j574c                             3/3     Running             0          8s
csi-cephfsplugin-nwdvw                             3/3     Terminating         0          3d
csi-cephfsplugin-provisioner-5c545745f8-kzst5      5/5     Running             0          31s
csi-cephfsplugin-provisioner-5c545745f8-lptth      5/5     Running             0          31s
csi-rbdplugin-29d6l                                0/3     ContainerCreating   0          3s
csi-rbdplugin-4dhzj                                3/3     Running             0          33s
csi-rbdplugin-5v69r                                3/3     Running             0          17s
csi-rbdplugin-provisioner-7cd74df5b9-28hn6         5/5     Running             0          33s
csi-rbdplugin-provisioner-7cd74df5b9-wm6ft         5/5     Running             0          33s
csi-rbdplugin-zsjnp                                3/3     Running             0          41s

After 10 min or so the cluster came back online and everything was good:

$ kubectl get pods --namespace rook-ceph
NAME                                               READY   STATUS        RESTARTS   AGE
csi-cephfsplugin-g6kvb                             3/3     Running       0          21m
csi-cephfsplugin-g7r2q                             3/3     Running       0          21m
csi-cephfsplugin-j574c                             3/3     Running       6          21m
csi-cephfsplugin-phfxk                             3/3     Running       0          21m
csi-cephfsplugin-provisioner-5c545745f8-kzst5      5/5     Running       1          21m
csi-cephfsplugin-provisioner-5c545745f8-lptth      5/5     Running       5          21m
csi-rbdplugin-29d6l                                3/3     Running       6          21m
csi-rbdplugin-4dhzj                                3/3     Running       0          21m
csi-rbdplugin-5v69r                                3/3     Running       0          21m
csi-rbdplugin-provisioner-7cd74df5b9-28hn6         5/5     Running       10         21m
csi-rbdplugin-provisioner-7cd74df5b9-wm6ft         5/5     Running       3          21m
csi-rbdplugin-zsjnp                                3/3     Running       0          21m
...

Next up is upgrading the version to 1.7 or fixing nodes to remove the offline nodes from the Kubernetes cluster and the ceph cluster.