Removing osd's from our ceph cluster
We lost a kubernetes node. Looking at it's log using iLo (integrated Lights Out which is a computer within a computer that allows me to see the console remotely
I would never recommend using the HP servers because of the monthly fee to use them. Why buy hardware if you have to pay to fully use it? What a rip off.
Anyway, It's in kernel panic and someone needs to visit the colo and that won't happen for another month. In the mean time, the ceph cluster needs to know that the osd's on that node are now gone and need to be removed.
Removing the node from the cluster
To remove the node from the cluster, I just edit the ceph cluster:
$ kubectl edit cephcluster rook-ceph
...
storage:
config: null
nodes:
- config: null
devices:
- config: null
name: sdb
- config: null
name: sdc
name: gold-1
resources: {}
- config: null
devices:
- config: null
name: sdb
- config: null
name: sdc
name: gold-4
resources: {}
- config: null
devices:
- config: null
name: sda
- config: null
name: sdb
name: gold-6
resources: {}
...
get's changed to
storage:
config: null
nodes:
- config: null
devices:
- config: null
name: sdb
- config: null
name: sdc
name: gold-1
resources: {}
- config: null
devices:
- config: null
name: sda
- config: null
name: sdb
name: gold-6
resources: {}
which removes gold-4 from the cluster. Also, I'm down to 2 machines and that's not enough to have the correct number of managers and mons so I'm reducing that to 1 node as well
Delete the deployments
Checking the pods shows which osd's are on the node
$ kubectl get pods -o wide | grep gold-4 | grep osd-
rook-ceph-osd-6-679bf6779-24pnc 0/1 Terminating 0 47h <none> gold-4 <none> <none>
rook-ceph-osd-9-85987bcb4c-zqhv4 0/1 Terminating 0 47h 10.233.99.89 gold-4 <none> <none>
They are osd 6 and osd 9. So these need to be removed from the cluster.
$ kubectl delete deployment rook-ceph-osd-6
deployment.apps "rook-ceph-osd-6" deleted
$ kubectl delete deployment rook-ceph-osd-9
deployment.apps "rook-ceph-osd-9" deleted
Now that the deployments are gone, there is a wait for the cluster to rebalance the data.
deleting pods on gold-4
Gold-4 is offline so it won't respond to any kubernetes cluster communication. So the pods need to be force deleted.
$ kubectl delete pod --force --grace-period 0 rook-ceph-osd-6-679bf6779-24pnc rook-ceph-osd-9-85987bcb4c-zqhv4
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rook-ceph-osd-6-679bf6779-24pnc" force deleted
pod "rook-ceph-osd-9-85987bcb4c-zqhv4" force deleted
And there are other pods stuck in "Terminating" state:
$ kubectl get pods -o wide | grep gold-4 | grep Term | awk '{print $1}' | xargs kubectl delete pod --force --grace-period 0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "csi-cephfsplugin-provisioner-5c545745f8-47k44" force deleted
pod "csi-rbdplugin-provisioner-7cd74df5b9-64vps" force deleted
pod "rook-ceph-crashcollector-gold-4-5895bd5548-bc2pl" force deleted
pod "rook-ceph-crashcollector-gold-4-5dcc599c46-bcwdt" force deleted
pod "rook-ceph-csi-detect-version-zkgww" force deleted
pod "rook-ceph-detect-version-6tb2z" force deleted
pod "rook-ceph-mds-myfs-b-866fdb94dd-9h77s" force deleted
pod "rook-ceph-mon-l-7d559f8c49-2vqvv" force deleted
pod "rook-ceph-operator-54747bd8d8-mm69q" force deleted
pod "rook-ceph-tools-6dcbb78845-pwsxk" force deleted
And some pods that think they are running, but aren't.
$ kubectl get pods -o wide | grep gold-4 | awk '{print $1}' | xargs kubectl delete pod --force --grace-period 0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "csi-cephfsplugin-phfxk" force deleted
pod "csi-rbdplugin-5v69r" force deleted
Checking the ceph status
$ kubectl exec rook-ceph-tools-6dcbb78845-vnq2n -- ceph status cluster:
id: 04461f64-e630-4891-bcea-0de24cf06c51
health: HEALTH_WARN
1/3 mons down, quorum k,ag
2 osds down
1 host (2 osds) down
Degraded data redundancy: 6693290/20079870 objects degraded (33.333%), 73 pgs degraded, 73 pgs undersized
services:
mon: 3 daemons, quorum k,ag (age 37h), out of quorum: l
mgr: a(active, since 37h)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 13 osds: 4 up (since 37h), 6 in (since 2w)
data:
pools: 4 pools, 73 pgs
objects: 6.69M objects, 2.9 TiB
usage: 5.9 TiB used, 30 TiB / 36 TiB avail
pgs: 6693290/20079870 objects degraded (33.333%)
73 active+undersized+degraded
io:
client: 1.3 KiB/s rd, 30 KiB/s wr, 2 op/s rd, 3 op/s wr
One of the mons is still missing, though there should only be 1, not 3. The osd's are only finding 4 of the 6 that's correct.
Losing 1/3 of the cluster is not good, but there's nothing that can be done about that.
Now it's time to wait and see if things can rebalance, or if it is a complete loss.
Rebalancing woes
Looking at the ceph cluster after all the pods are running I found this;
[root@rook-ceph-tools-6dcbb78845-vnq2n /]# ceph status
cluster:
id: 04461f64-e630-4891-bcea-0de24cf06c51
health: HEALTH_WARN
2 osds down
1 host (2 osds) down
Degraded data redundancy: 6678601/20079882 objects degraded (33.260%), 72 pgs degraded, 72 pgs undersized
services:
mon: 2 daemons, quorum l,ag (age 12m)
mgr: a(active, since 37h)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 14 osds: 5 up (since 9m), 7 in (since 9m); 32 remapped pgs
data:
pools: 4 pools, 73 pgs
objects: 6.69M objects, 2.9 TiB
usage: 5.9 TiB used, 40 TiB / 45 TiB avail
pgs: 6678601/20079882 objects degraded (33.260%)
2553475/20079882 objects misplaced (12.717%)
41 active+undersized+degraded
28 active+undersized+degraded+remapped+backfill_wait
3 active+undersized+degraded+remapped+backfilling
1 active+clean+remapped
io:
client: 1.2 KiB/s rd, 13 KiB/s wr, 2 op/s rd, 1 op/s wr
recovery: 341 KiB/s, 132 keys/s, 32 objects/s
progress:
Rebalancing after osd.13 marked in (9m)
[............................]
A quick recheck of the pods shows that there are now 3 osd's on gold-6.
$ kubectl get pods -o wide | grep gold-6 | grep osd
rook-ceph-osd-1-768b4d5cb8-m87dp 1/1 Running 0 46h 10.233.90.183 gold-6 <none> <none>
rook-ceph-osd-13-5f7fc74964-nx7lf 1/1 Running 0 11m 10.233.90.234 gold-6 <none> <none>
rook-ceph-osd-3-56f6dbbfbd-bg72h 1/1 Running 0 46h 10.233.90.142 gold-6 <none> <none>
rook-ceph-osd-prepare-gold-6-zh8jv 0/1 Completed 0 13m 10.233.90.12 gold-6 <none> <none>
Why would this happen when there are only 2 devices marked for gold-6:
$ kubectl edit cephcluster rook-ceph
...
- config: null
devices:
- config: null
name: sda
- config: null
name: sdb
name: gold-6
resources: {}
...
$ kubectl logs rook-ceph-osd-prepare-gold-6-zh8jv
...
2022-08-29 17:19:25.436612 D | cephosd: {
"1": {
"ceph_fsid": "04461f64-e630-4891-bcea-0de24cf06c51",
"device": "/dev/mapper/ceph--e1c76b8f--f519--4cd5--a64e--90ccaaf28bb6-osd--data--f316dbd9--837b--4cbd--9a3c--41fb4693a9b6",
"osd_id": 1,
"osd_uuid": "7e776dc8-aa52-4094-a9fc-3b31542ab2b7",
"type": "bluestore"
},
"13": {
"ceph_fsid": "04461f64-e630-4891-bcea-0de24cf06c51",
"device": "/dev/mapper/ceph--cf29919d--4d5c--43bb--a061--bfd3d9b51a1f-osd--block--355b64d7--eb93--48ec--9fff--155cd4a2e024",
"osd_id": 13,
"osd_uuid": "355b64d7-eb93-48ec-9fff-155cd4a2e024",
"type": "bluestore"
},
"3": {
"ceph_fsid": "04461f64-e630-4891-bcea-0de24cf06c51",
"device": "/dev/mapper/ceph--b319867a--43ec--4329--9aa9--f130ca73abdb-osd--data--0d352e53--6a18--43b2--b84f--a52997f77c9e",
"osd_id": 3,
"osd_uuid": "6f2e4329-81d2-4a9a-9360-a1fc30cc3bc1",
"type": "bluestore"
}
}
...
The new osd is on device "/dev/mapper/ceph--cf29919d--4d5c--43bb--a061--bfd3d9b51a1f-osd--block--355b64d7--eb93--48ec--9fff--155cd4a2e024"
Which looking on the machine:
dcaldwel@gold-6:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 9.1T 0 disk
`-ceph--cf29919d--4d5c--43bb--a061--bfd3d9b51a1f-osd--block--355b64d7--eb93--48ec--9fff--155cd4a2e024
253:2 0 9.1T 0 lvm
sdb 8:16 0 9.1T 0 disk
`-ceph--e1c76b8f--f519--4cd5--a64e--90ccaaf28bb6-osd--data--f316dbd9--837b--4cbd--9a3c--41fb4693a9b6
253:0 0 9.1T 0 lvm
sdc 8:32 0 9.1T 0 disk
`-ceph--b319867a--43ec--4329--9aa9--f130ca73abdb-osd--data--0d352e53--6a18--43b2--b84f--a52997f77c9e
253:1 0 9.1T 0 lvm
Is device /dev/sda which is one of the devices that is marked to be used.
So how did this happen? Well the raid controller in HBA mode can randomly reassign the drive names. This has caused problems in the past and apparently happened when the last reboot occurred. (The machine was rebooted to fix an issue with the kernel systemd services).
So now gold-6 has 3 osds and gold-1 has 2 osds.
The best bet now is to let the node rebalance. Hopefully the cluster will survive until a visit to the colo can be done and the other machines can be repaired, then readded to the cluster. Four need new os installs, 1 needs new fans, another needs failing ram replaced. It's difficult to work with failing hardware, which is why cloud services are so popular (though very, very expensive).