Removing osd's from our ceph cluster

29 Aug, 2022

We lost a kubernetes node. Looking at it's log using iLo (integrated Lights Out which is a computer within a computer that allows me to see the console remotely for about 10 seconds because HP licenses that functionality because you can't just buy a computer and use it, you have to pay a monthly fee to use it after you buy it... ) it is in kernel panic and a reinstall of the os is needed. I would reinstall it, but as said before it requires an extra licensing fee to it remotely so we're just stuck until someone physically visits the colo.

I would never recommend using the HP servers because of the monthly fee to use them. Why buy hardware if you have to pay to fully use it? What a rip off.

Anyway, It's in kernel panic and someone needs to visit the colo and that won't happen for another month. In the mean time, the ceph cluster needs to know that the osd's on that node are now gone and need to be removed.

Removing the node from the cluster

To remove the node from the cluster, I just edit the ceph cluster:

$ kubectl edit cephcluster rook-ceph
...
  storage:
    config: null
    nodes:
    - config: null
      devices:
      - config: null
        name: sdb
      - config: null
        name: sdc
      name: gold-1
      resources: {}
    - config: null
      devices:
      - config: null
        name: sdb
      - config: null
        name: sdc
      name: gold-4
      resources: {}
    - config: null
      devices:
      - config: null
        name: sda
      - config: null
        name: sdb
      name: gold-6
      resources: {}
...

get's changed to

  storage:
    config: null
    nodes:
    - config: null
      devices:
      - config: null
        name: sdb
      - config: null
        name: sdc
      name: gold-1
      resources: {}
    - config: null
      devices:
      - config: null
        name: sda
      - config: null
        name: sdb
      name: gold-6
      resources: {}

which removes gold-4 from the cluster. Also, I'm down to 2 machines and that's not enough to have the correct number of managers and mons so I'm reducing that to 1 node as well

Delete the deployments

Checking the pods shows which osd's are on the node

$ kubectl get pods -o wide | grep gold-4 | grep osd-
rook-ceph-osd-6-679bf6779-24pnc                    0/1     Terminating   0          47h   <none>          gold-4   <none>           <none>
rook-ceph-osd-9-85987bcb4c-zqhv4                   0/1     Terminating   0          47h   10.233.99.89    gold-4   <none>           <none>

They are osd 6 and osd 9. So these need to be removed from the cluster.

$ kubectl delete deployment rook-ceph-osd-6 
deployment.apps "rook-ceph-osd-6" deleted

$ kubectl delete deployment rook-ceph-osd-9 
deployment.apps "rook-ceph-osd-9" deleted

Now that the deployments are gone, there is a wait for the cluster to rebalance the data.

deleting pods on gold-4

Gold-4 is offline so it won't respond to any kubernetes cluster communication. So the pods need to be force deleted.

$ kubectl delete pod --force --grace-period 0 rook-ceph-osd-6-679bf6779-24pnc rook-ceph-osd-9-85987bcb4c-zqhv4
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "rook-ceph-osd-6-679bf6779-24pnc" force deleted
pod "rook-ceph-osd-9-85987bcb4c-zqhv4" force deleted

And there are other pods stuck in "Terminating" state:

$ kubectl get pods -o wide | grep gold-4 | grep Term | awk '{print $1}' | xargs kubectl delete pod --force --grace-period 0 
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "csi-cephfsplugin-provisioner-5c545745f8-47k44" force deleted
pod "csi-rbdplugin-provisioner-7cd74df5b9-64vps" force deleted
pod "rook-ceph-crashcollector-gold-4-5895bd5548-bc2pl" force deleted
pod "rook-ceph-crashcollector-gold-4-5dcc599c46-bcwdt" force deleted
pod "rook-ceph-csi-detect-version-zkgww" force deleted
pod "rook-ceph-detect-version-6tb2z" force deleted
pod "rook-ceph-mds-myfs-b-866fdb94dd-9h77s" force deleted
pod "rook-ceph-mon-l-7d559f8c49-2vqvv" force deleted
pod "rook-ceph-operator-54747bd8d8-mm69q" force deleted
pod "rook-ceph-tools-6dcbb78845-pwsxk" force deleted

And some pods that think they are running, but aren't.

$ kubectl get pods -o wide | grep gold-4 | awk '{print $1}' | xargs kubectl delete pod --force --grace-period 0 
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "csi-cephfsplugin-phfxk" force deleted
pod "csi-rbdplugin-5v69r" force deleted

Checking the ceph status

$ kubectl exec rook-ceph-tools-6dcbb78845-vnq2n  -- ceph status  cluster:
    id:     04461f64-e630-4891-bcea-0de24cf06c51
    health: HEALTH_WARN
            1/3 mons down, quorum k,ag
            2 osds down
            1 host (2 osds) down
            Degraded data redundancy: 6693290/20079870 objects degraded (33.333%), 73 pgs degraded, 73 pgs undersized
 
  services:
    mon: 3 daemons, quorum k,ag (age 37h), out of quorum: l
    mgr: a(active, since 37h)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 13 osds: 4 up (since 37h), 6 in (since 2w)
 
  data:
    pools:   4 pools, 73 pgs
    objects: 6.69M objects, 2.9 TiB
    usage:   5.9 TiB used, 30 TiB / 36 TiB avail
    pgs:     6693290/20079870 objects degraded (33.333%)
             73 active+undersized+degraded
 
  io:
    client:   1.3 KiB/s rd, 30 KiB/s wr, 2 op/s rd, 3 op/s wr

One of the mons is still missing, though there should only be 1, not 3. The osd's are only finding 4 of the 6 that's correct.

Losing 1/3 of the cluster is not good, but there's nothing that can be done about that.

Now it's time to wait and see if things can rebalance, or if it is a complete loss.

Rebalancing woes

Looking at the ceph cluster after all the pods are running I found this;

[root@rook-ceph-tools-6dcbb78845-vnq2n /]# ceph status
  cluster:
    id:     04461f64-e630-4891-bcea-0de24cf06c51
    health: HEALTH_WARN
            2 osds down
            1 host (2 osds) down
            Degraded data redundancy: 6678601/20079882 objects degraded (33.260%), 72 pgs degraded, 72 pgs undersized
 
  services:
    mon: 2 daemons, quorum l,ag (age 12m)
    mgr: a(active, since 37h)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 14 osds: 5 up (since 9m), 7 in (since 9m); 32 remapped pgs
 
  data:
    pools:   4 pools, 73 pgs
    objects: 6.69M objects, 2.9 TiB
    usage:   5.9 TiB used, 40 TiB / 45 TiB avail
    pgs:     6678601/20079882 objects degraded (33.260%)
             2553475/20079882 objects misplaced (12.717%)
             41 active+undersized+degraded
             28 active+undersized+degraded+remapped+backfill_wait
             3  active+undersized+degraded+remapped+backfilling
             1  active+clean+remapped
 
  io:
    client:   1.2 KiB/s rd, 13 KiB/s wr, 2 op/s rd, 1 op/s wr
    recovery: 341 KiB/s, 132 keys/s, 32 objects/s
 
  progress:
    Rebalancing after osd.13 marked in (9m)
      [............................]

A quick recheck of the pods shows that there are now 3 osd's on gold-6.

$ kubectl get pods -o wide | grep gold-6 | grep osd
rook-ceph-osd-1-768b4d5cb8-m87dp                   1/1     Running     0          46h   10.233.90.183   gold-6   <none>           <none>
rook-ceph-osd-13-5f7fc74964-nx7lf                  1/1     Running     0          11m   10.233.90.234   gold-6   <none>           <none>
rook-ceph-osd-3-56f6dbbfbd-bg72h                   1/1     Running     0          46h   10.233.90.142   gold-6   <none>           <none>
rook-ceph-osd-prepare-gold-6-zh8jv                 0/1     Completed   0          13m   10.233.90.12    gold-6   <none>           <none>

Why would this happen when there are only 2 devices marked for gold-6:

$ kubectl edit cephcluster rook-ceph
...
    - config: null
      devices:
      - config: null
        name: sda
      - config: null
        name: sdb
      name: gold-6
      resources: {}
...

$ kubectl logs rook-ceph-osd-prepare-gold-6-zh8jv
...
2022-08-29 17:19:25.436612 D | cephosd: {
    "1": {
        "ceph_fsid": "04461f64-e630-4891-bcea-0de24cf06c51",
        "device": "/dev/mapper/ceph--e1c76b8f--f519--4cd5--a64e--90ccaaf28bb6-osd--data--f316dbd9--837b--4cbd--9a3c--41fb4693a9b6",
        "osd_id": 1,
        "osd_uuid": "7e776dc8-aa52-4094-a9fc-3b31542ab2b7",
        "type": "bluestore"
    },
    "13": {
        "ceph_fsid": "04461f64-e630-4891-bcea-0de24cf06c51",
        "device": "/dev/mapper/ceph--cf29919d--4d5c--43bb--a061--bfd3d9b51a1f-osd--block--355b64d7--eb93--48ec--9fff--155cd4a2e024",
        "osd_id": 13,
        "osd_uuid": "355b64d7-eb93-48ec-9fff-155cd4a2e024",
        "type": "bluestore"
    },
    "3": {
        "ceph_fsid": "04461f64-e630-4891-bcea-0de24cf06c51",
        "device": "/dev/mapper/ceph--b319867a--43ec--4329--9aa9--f130ca73abdb-osd--data--0d352e53--6a18--43b2--b84f--a52997f77c9e",
        "osd_id": 3,
        "osd_uuid": "6f2e4329-81d2-4a9a-9360-a1fc30cc3bc1",
        "type": "bluestore"
    }
}
...

The new osd is on device "/dev/mapper/ceph--cf29919d--4d5c--43bb--a061--bfd3d9b51a1f-osd--block--355b64d7--eb93--48ec--9fff--155cd4a2e024"

Which looking on the machine:

dcaldwel@gold-6:~$ lsblk
NAME                                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                            8:0    0   9.1T  0 disk 
`-ceph--cf29919d--4d5c--43bb--a061--bfd3d9b51a1f-osd--block--355b64d7--eb93--48ec--9fff--155cd4a2e024
                                             253:2    0   9.1T  0 lvm  
sdb                                            8:16   0   9.1T  0 disk 
`-ceph--e1c76b8f--f519--4cd5--a64e--90ccaaf28bb6-osd--data--f316dbd9--837b--4cbd--9a3c--41fb4693a9b6
                                             253:0    0   9.1T  0 lvm  
sdc                                            8:32   0   9.1T  0 disk 
`-ceph--b319867a--43ec--4329--9aa9--f130ca73abdb-osd--data--0d352e53--6a18--43b2--b84f--a52997f77c9e
                                             253:1    0   9.1T  0 lvm

Is device /dev/sda which is one of the devices that is marked to be used.

So how did this happen? Well the raid controller in HBA mode can randomly reassign the drive names. This has caused problems in the past and apparently happened when the last reboot occurred. (The machine was rebooted to fix an issue with the kernel systemd services).

So now gold-6 has 3 osds and gold-1 has 2 osds.

The best bet now is to let the node rebalance. Hopefully the cluster will survive until a visit to the colo can be done and the other machines can be repaired, then readded to the cluster. Four need new os installs, 1 needs new fans, another needs failing ram replaced. It's difficult to work with failing hardware, which is why cloud services are so popular (though very, very expensive).