Fixing failure to mount volumes due to "an operation with the given Volume ID already exists"
After performing the upgrades of rook to version 1.6.11, I noticed that we had some pods that were stuck at the "ContainerCreating" state.
celery-user-5b69dbbff9-xsd5z 0/1 ContainerCreating 0 5h33m
celery-user-5b69dbbff9-xxx2f 0/1 ContainerCreating 0 5h33m
django-6ff8784799-n77kx 0/1 ContainerCreating 0 16h
Looking at these pods I found there was an error in the VolumeMount. This seemed specific to the pod as other pods in the same deployment succeeded.
kubectl describe pod celery-user-5b69dbbff9-xsd5z
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 57m (x12 over 75m) kubelet, gold-5 MountVolume.MountDevice failed for volume "pvc-90eca685-5b34-477a-83f2-77ea08e29177" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000001-9f6cc1a8-8f15-11ea-b24a-f271f33f8f2a already exists
...
Warning FailedMount 26m (x2 over 38m) kubelet, gold-5 Unable to attach or mount volumes: unmounted volumes=[djangostatic djangomedia djangomediatests djangosecrets amigo-log], unattached volumes=[default-token-ngwhd djangostatic djangomedia djangomediatests djangosecrets amigo-log]: timed out waiting for the condition
So some volumes can't be mounted due to an existing operation. What is that?
First I switched to the rook-ceph namespace:
$ kubens rook-ceph
Context "kubernetes-admin@kubernetes" modified.
Active namespace is "rook-ceph".
I checked the cluster status:
$ kubectl exec rook-ceph-tools-6dcbb78845-6xkk8 -- ceph status
cluster:
id: 04461f64-e630-4891-bcea-0de24cf06c51
health: HEALTH_OK
services:
mon: 3 daemons, quorum k,l,m (age 5m)
mgr: a(active, since 12m)
mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
osd: 13 osds: 6 up (since 42h), 6 in (since 8d)
data:
pools: 4 pools, 73 pgs
objects: 6.69M objects, 2.9 TiB
usage: 8.8 TiB used, 46 TiB / 55 TiB avail
pgs: 73 active+clean
io:
client: 1.2 KiB/s rd, 34 KiB/s wr, 2 op/s rd, 3 op/s wr
Everything is fine.
Then I took a look at the csi provisioners:
$ kubectl get pods -o wide | grep csi
csi-cephfsplugin-9dbc2 3/3 Running 0 2d23h 10.1.1.246 gold-6 <none> <none>
csi-cephfsplugin-gdz9c 3/3 Running 0 2d23h 10.1.1.241 gold-1 <none> <none>
csi-cephfsplugin-hfx8f 3/3 Running 36 2d16h 10.1.1.245 gold-5 <none> <none>
csi-cephfsplugin-nwdvw 3/3 Running 0 2d23h 10.1.1.244 gold-4 <none> <none>
csi-cephfsplugin-provisioner-68887cc8b5-f48n5 3/5 CrashLoopBackOff 1124 43h 10.233.103.231 gold-5 <none> <none>
csi-cephfsplugin-provisioner-68887cc8b5-g9gm6 3/5 CrashLoopBackOff 1032 43h 10.233.97.56 gold-1 <none> <none>
csi-rbdplugin-5j4tg 3/3 Running 0 2d23h 10.1.1.241 gold-1 <none> <none>
csi-rbdplugin-fhwlp 3/3 Running 0 2d23h 10.1.1.246 gold-6 <none> <none>
csi-rbdplugin-provisioner-75ff54ff6b-4582g 3/5 CrashLoopBackOff 1124 43h 10.233.103.159 gold-5 <none> <none>
csi-rbdplugin-provisioner-75ff54ff6b-gn7vl 3/5 CrashLoopBackOff 1032 43h 10.233.97.180 gold-1 <none> <none>
csi-rbdplugin-tqfrq 3/3 Running 39 2d23h 10.1.1.245 gold-5 <none> <none>
csi-rbdplugin-ts6cv 3/3 Running 0 2d23h 10.1.1.244 gold-4 <none> <none>
It seems the provisioner on gold-5 and gold-1 nodes are stuck. Checked the last state:
kubectl describe pod csi-rbdplugin-provisioner-75ff54ff6b-gn7vl
...
csi-provisioner:
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 18 Aug 2022 23:01:37 +0000
Finished: Thu, 18 Aug 2022 23:01:37 +0000
...
csi-resizer:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Thu, 18 Aug 2022 23:01:37 +0000
Finished: Thu, 18 Aug 2022 23:01:37 +0000
...
csi-attacher:
State: Running
Started: Thu, 18 Aug 2022 22:42:56 +0000
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Thu, 18 Aug 2022 21:32:07 +0000
Finished: Thu, 18 Aug 2022 22:42:55 +0000
Looks like the provisioner and resizer, are terminating with error 2 and attacher is terminating with error 255.
csi-provisioner shows this in the log:
$ kubectl logs csi-rbdplugin-provisioner-75ff54ff6b-gn7vl --previous csi-provisioner
unknown flag: --leader-election
csi-resizer shows this in the log:
$ kubectl logs csi-rbdplugin-provisioner-75ff54ff6b-gn7vl --previous csi-resizer
flag provided but not defined: -timeout
$ kubectl logs csi-rbdplugin-provisioner-75ff54ff6b-gn7vl --previous csi-attacher
I0818 21:32:07.681303 1 main.go:92] Version: v2.1.0-0-g26d9e9b8
I0818 21:32:07.683313 1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I0818 21:32:07.684042 1 common.go:111] Probing CSI driver for readiness
W0818 21:32:07.685729 1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I0818 21:32:07.687199 1 leaderelection.go:242] attempting to acquire leader lease rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com...
E0818 21:32:23.404867 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: etcdserver: request timed out
E0818 21:32:52.535931 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: etcdserver: request timed out
E0818 21:33:09.110779 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: http2: server sent GOAWAY and closed the connection; LastStreamID=5, ErrCode=NO_ERROR, debug=""
E0818 21:33:48.099171 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: i/o timeout
E0818 21:34:01.674726 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: no route to host
E0818 21:34:23.697279 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: unexpected EOF
E0818 21:34:32.819510 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:34:38.752454 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: unexpected EOF
E0818 21:34:44.692780 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:34:50.276317 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:35:03.114382 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: no route to host
E0818 21:35:24.367304 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""
E0818 21:35:36.906501 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: no route to host
E0818 21:36:07.566802 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Unauthorized
E0818 21:36:22.879799 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: unexpected EOF
E0818 21:36:29.789400 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:36:41.418592 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: no route to host
E0818 21:36:48.117671 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:36:54.877792 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:37:03.953546 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:37:10.265916 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:37:16.486380 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: Get https://10.233.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: dial tcp 10.233.0.1:443: connect: connection refused
E0818 21:37:37.547353 1 leaderelection.go:331] error retrieving resource lock rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: leases.coordination.k8s.io "external-attacher-leader-rook-ceph-rbd-csi-ceph-com" is forbidden: User "system:serviceaccount:rook-ceph:rook-csi-rbd-provisioner-sa" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "rook-ceph"
I0818 21:37:45.990863 1 leaderelection.go:252] successfully acquired lease rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com
I0818 21:37:45.991040 1 controller.go:121] Starting CSI attacher
I0818 22:42:55.050245 1 leaderelection.go:288] failed to renew lease rook-ceph/external-attacher-leader-rook-ceph-rbd-csi-ceph-com: failed to tryAcquireOrRenew context deadline exceeded
F0818 22:42:55.050318 1 leader_election.go:169] stopped leading
So that is all weird.
Checking the csi version I have v2.0.0.
$ kubectl --namespace rook-ceph get pod -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}' -l 'app in (csi-rbdplugin,csi-rbdplugin-provisioner,csi-cephfsplugin,csi-cephfsplugin-provisioner)' | sort | uniq
quay.io/cephcsi/cephcsi:v2.0.0
quay.io/k8scsi/csi-attacher:v2.1.0
quay.io/k8scsi/csi-node-driver-registrar:v1.2.0
quay.io/k8scsi/csi-provisioner:v1.4.0
quay.io/k8scsi/csi-resizer:v0.4.0
Checking the readme of the v1.6.0 release...
Kubernetes 1.11 or newer is supported. I'm on 1.16 so that's fine.
CSI v3.3.0 driver enabeld by default. Mine is stuck at 2.0.0. Maybe that is the problem.
Checking old releses, it seems that Ceph-CSI 3.0 has been the default since version 1.4. When the cluster I'm working on was deployed, the operator had environment variables that set the ceph version to version 2.0. Removing those version variables may allow the operator to upgrade to the latest CSI version. There is no mention of droping support for version 2.0 of CSI.
running kubectl edit deployment rook-ceph-operator
shows this in the enviornment:
- name: ROOK_CSI_CEPH_IMAGE
value: quay.io/cephcsi/cephcsi:v2.0.0
- name: ROOK_CSI_REGISTRAR_IMAGE
value: quay.io/k8scsi/csi-node-driver-registrar:v1.2.0
- name: ROOK_CSI_RESIZER_IMAGE
value: quay.io/k8scsi/csi-resizer:v0.4.0
- name: ROOK_CSI_PROVISIONER_IMAGE
value: quay.io/k8scsi/csi-provisioner:v1.4.0
- name: ROOK_CSI_SNAPSHOTTER_IMAGE
value: quay.io/k8scsi/csi-snapshotter:v1.2.2
- name: ROOK_CSI_ATTACHER_IMAGE
value: quay.io/k8scsi/csi-attacher:v2.1.0
that must be where the version is being pinned. I'll remove it and cross my fingers that it will work.
A master node was lost in the middle of the change, so that needs to be fixed before I can review. However, before it was lost, it looked like things were being fixed:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-g6kvb 3/3 Running 0 32s
csi-cephfsplugin-g7r2q 3/3 Running 0 30s
csi-cephfsplugin-j574c 3/3 Running 0 8s
csi-cephfsplugin-nwdvw 3/3 Terminating 0 3d
csi-cephfsplugin-provisioner-5c545745f8-kzst5 5/5 Running 0 31s
csi-cephfsplugin-provisioner-5c545745f8-lptth 5/5 Running 0 31s
csi-rbdplugin-29d6l 0/3 ContainerCreating 0 3s
csi-rbdplugin-4dhzj 3/3 Running 0 33s
csi-rbdplugin-5v69r 3/3 Running 0 17s
csi-rbdplugin-provisioner-7cd74df5b9-28hn6 5/5 Running 0 33s
csi-rbdplugin-provisioner-7cd74df5b9-wm6ft 5/5 Running 0 33s
csi-rbdplugin-zsjnp 3/3 Running 0 41s
After 10 min or so the cluster came back online and everything was good:
$ kubectl get pods --namespace rook-ceph
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-g6kvb 3/3 Running 0 21m
csi-cephfsplugin-g7r2q 3/3 Running 0 21m
csi-cephfsplugin-j574c 3/3 Running 6 21m
csi-cephfsplugin-phfxk 3/3 Running 0 21m
csi-cephfsplugin-provisioner-5c545745f8-kzst5 5/5 Running 1 21m
csi-cephfsplugin-provisioner-5c545745f8-lptth 5/5 Running 5 21m
csi-rbdplugin-29d6l 3/3 Running 6 21m
csi-rbdplugin-4dhzj 3/3 Running 0 21m
csi-rbdplugin-5v69r 3/3 Running 0 21m
csi-rbdplugin-provisioner-7cd74df5b9-28hn6 5/5 Running 10 21m
csi-rbdplugin-provisioner-7cd74df5b9-wm6ft 5/5 Running 3 21m
csi-rbdplugin-zsjnp 3/3 Running 0 21m
...
Next up is upgrading the version to 1.7 or fixing nodes to remove the offline nodes from the Kubernetes cluster and the ceph cluster.