Daniel's Blog

Upgrading rook-ceph from Ceph Nautilus (14.2.8) to Octopus (15.2.13)

Upgrading Ceph versions

We are currently on Ceph Nautilus (v14) and after upgrading to rook version 1.3, there are instructions for upgrading to Ceph Octopus (v15). The documentation in other releases says that Rook supports version 15.2+, so the plan is to upgrade to the latest v15 ceph.

Check the image versions to find the latest

Ceph hosts it's images at quay.io and docker hub. In docker hub the images are here:

https://hub.docker.com/r/ceph/ceph/tags

and as of today, the latest is v15.2.13. There is also a v16, but I'll check that when I'm upgrading to later versions of rook.

Upgrading to v15.2.13 daemons

kubectl -n rook-ceph patch CephCluster rook-ceph --type=merge -p "{\"spec\": {\"cephVersion\": {\"image\": \"ceph/ceph:v15.2.13 \"}}}"

For example:

$ kubectl -n rook-ceph patch CephCluster rook-ceph --type=merge -p "{\"spec\": {\"cephVersion\": {\"image\": \"ceph/ceph:v15.2.13 \"}}}"
cephcluster.ceph.rook.io/rook-ceph patched

Checking the log of the operator shows that the change of the cluster was detected:

$ kubectl logs rook-ceph-operator-88cdb5f4b-k6xvw

...
2022-08-16 01:18:43.476325 I | op-cluster: The Cluster CR has changed. diff=  v1.ClusterSpec{
        CephVersion: v1.CephVersionSpec{
-               Image:            "ceph/ceph:v14.2.8",
+               Image:            "ceph/ceph:v15.2.13 ",
                AllowUnsupported: false,
        },
        Storage:     v1.StorageScopeSpec{Nodes: []v1.Node{{Name: "gold-1", Selection: v1.Selection{Devices: []v1.Device{{Name: "sdb"}, {Name: "sdc"}}}}, {Name: "gold-4", Selection: v1.Selection{Devices: []v1.Device{{Name: "sdb"}, {Name: "sdc"}}}}, {Name: "gold-6", Selection: v1.Selection{Devices: []v1.Device{{Name: "sda"}, {Name: "sdb"}}}}}, Selection: v1.Selection{UseAllDevices: &false}},
        Annotations: nil,
        ... // 17 identical fields
  }
2022-08-16 01:18:43.476374 I | op-cluster: update event for cluster "rook-ceph" is supported, orchestrating update now
2022-08-16 01:18:43.484473 I | op-config: CephCluster "rook-ceph" status: "Updating". "Cluster is updating"
2022-08-16 01:18:43.507691 I | op-cluster: the ceph version changed from "ceph/ceph:v14.2.8" to "ceph/ceph:v15.2.13 "
2022-08-16 01:18:43.507727 I | op-cluster: detecting the ceph image version for image ceph/ceph:v15.2.13 ...

Nothing happened so I checked the logs after about 15 min and found these lines:

...
2022-08-16 01:26:50.090394 E | ceph-crashcollector-controller: ceph version not found for image "ceph/ceph:v15.2.13 " used by cluster "rook-ceph". attempt to determine ceph version for the current cluster image timed out
...
2022-08-16 01:33:43.540707 E | op-cluster: unknown ceph major version. "failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap"

So I looked at the images in quay.io here:

https://quay.io/repository/ceph/ceph?tab=tags

There was a version v15.2.17 so I tried that one.

Looking at the diff now there was a unintentional space added in the previous version:

        CephVersion: v1.CephVersionSpec{
-               Image:            "ceph/ceph:v15.2.13 ",
+               Image:            "ceph/ceph:v15.2.17",
                AllowUnsupported: false,
        },
        Storage:     v1.StorageScopeSpec{Nodes: []v1.Node{{Name: "gold-1", Selection: v1.Selection{Devices: []v1.Device{{Name: "sdb"}, {Name: "sdc"}}}}, {Name: "gold-4", Selection: v1.Selection{Devices: []v1.Device{{Name: "sdb"}, {Name: "sdc"}}}}, {Name: "gold-6", Selection: v1.Selection{Devices: []v1.Device{{Name: "sda"}, {Name: "sdb"}}}}}, Selection: v1.Selection{UseAllDevices: &false}},
        Annotations: nil,
        ... // 17 identical fields
  }

So maybe that was a problem.

Looking at the log I noticed that a job is created:

2022-08-16 01:36:20.996433 I | op-cluster: update event for cluster "rook-ceph" is supported, orchestrating update now
2022-08-16 01:36:21.002299 I | op-config: CephCluster "rook-ceph" status: "Updating". "Cluster is updating"
2022-08-16 01:36:21.022645 I | op-cluster: the ceph version changed from "ceph/ceph:v15.2.13 " to "ceph/ceph:v15.2.17"
2022-08-16 01:36:21.022678 I | op-cluster: detecting the ceph image version for image ceph/ceph:v15.2.17...
2022-08-16 01:36:21.032230 I | op-k8sutil: Removing previous job rook-ceph-detect-version to start a new one
2022-08-16 01:36:21.050551 I | op-k8sutil: batch job rook-ceph-detect-version still exists
2022-08-16 01:36:23.058898 I | op-k8sutil: batch job rook-ceph-detect-version deleted

That makes sense.

Looking at the jobs I see that it hasn't started:

$ kubectl get jobs
NAME                           COMPLETIONS   DURATION   AGE
rook-ceph-detect-version       0/1           4m45s      4m45s
rook-ceph-osd-prepare-gold-1   1/1           20s        81m
rook-ceph-osd-prepare-gold-2   1/1           9s         36d
rook-ceph-osd-prepare-gold-3   1/1           23s        216d
rook-ceph-osd-prepare-gold-4   1/1           27s        81m
rook-ceph-osd-prepare-gold-5   1/1           21s        5d10h
rook-ceph-osd-prepare-gold-6   1/1           22s        81m

Looking at the job I can see the pod was created:

$ kubectl describe job rook-ceph-detect-version
Name:           rook-ceph-detect-version
Namespace:      rook-ceph
Selector:       controller-uid=6fef3c97-c2be-4fee-acaf-1f706a91118e
Labels:         app=rook-ceph-detect-version
                rook-version=v1.3.11
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Tue, 16 Aug 2022 01:36:23 +0000
Pods Statuses:  1 Running / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=rook-ceph-detect-version
                    controller-uid=6fef3c97-c2be-4fee-acaf-1f706a91118e
                    job-name=rook-ceph-detect-version
                    rook-version=v1.3.11
  Service Account:  rook-ceph-cmd-reporter
  Init Containers:
   init-copy-binaries:
    Image:      rook/ceph:v1.3.11
    Port:       <none>
    Host Port:  <none>
    Args:
      copy-binaries
      --copy-to-dir
      /rook/copied-binaries
    Environment:  <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
  Containers:
   cmd-reporter:
    Image:      ceph/ceph:v15.2.17
    Port:       <none>
    Host Port:  <none>
    Command:
      /rook/copied-binaries/tini
      --
      /rook/copied-binaries/rook
    Args:
      cmd-reporter
      --command
      {"cmd":["ceph"],"args":["--version"]}
      --config-map-name
      rook-ceph-detect-version
      --namespace
      rook-ceph
    Environment:  <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
  Volumes:
   rook-copied-binaries:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  5m2s  job-controller  Created pod: rook-ceph-detect-version-fdptl

Looking at the pod I can see that it is unable to find the version 15.2.17 ceph image:

$ kubectl describe pod rook-ceph-detect-version-fdptl
Name:         rook-ceph-detect-version-fdptl
Namespace:    rook-ceph
Priority:     0
Node:         gold-5/10.1.1.245
Start Time:   Tue, 16 Aug 2022 01:36:23 +0000
Labels:       app=rook-ceph-detect-version
              controller-uid=6fef3c97-c2be-4fee-acaf-1f706a91118e
              job-name=rook-ceph-detect-version
              rook-version=v1.3.11
Annotations:  <none>
Status:       Pending
IP:           10.233.103.126
IPs:
  IP:           10.233.103.126
Controlled By:  Job/rook-ceph-detect-version
Init Containers:
  init-copy-binaries:
    Container ID:  docker://1a4a546818e1c592e535b9c28150d4dc9fe2b286239c0b80f6acde2a4751d8de
    Image:         rook/ceph:v1.3.11
    Image ID:      docker-pullable://rook/ceph@sha256:fe233c082d9f845ad053f5299da084bd84a201d6c32b24c0daf0d1a1e0d04088
    Port:          <none>
    Host Port:     <none>
    Args:
      copy-binaries
      --copy-to-dir
      /rook/copied-binaries
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 16 Aug 2022 01:36:25 +0000
      Finished:     Tue, 16 Aug 2022 01:36:25 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from rook-ceph-cmd-reporter-token-xg29f (ro)
Containers:
  cmd-reporter:
    Container ID:
    Image:         ceph/ceph:v15.2.17
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /rook/copied-binaries/tini
      --
      /rook/copied-binaries/rook
    Args:
      cmd-reporter
      --command
      {"cmd":["ceph"],"args":["--version"]}
      --config-map-name
      rook-ceph-detect-version
      --namespace
      rook-ceph
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from rook-ceph-cmd-reporter-token-xg29f (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  rook-copied-binaries:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  rook-ceph-cmd-reporter-token-xg29f:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-cmd-reporter-token-xg29f
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     ceph-role
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  <unknown>              default-scheduler  Successfully assigned rook-ceph/rook-ceph-detect-version-fdptl to gold-5
  Normal   Pulled     5m19s                  kubelet, gold-5    Container image "rook/ceph:v1.3.11" already present on machine
  Normal   Created    5m19s                  kubelet, gold-5    Created container init-copy-binaries
  Normal   Started    5m18s                  kubelet, gold-5    Started container init-copy-binaries
  Warning  Failed     4m32s (x3 over 5m17s)  kubelet, gold-5    Failed to pull image "ceph/ceph:v15.2.17": rpc error: code = Unknown desc = Error response from daemon: manifest for ceph/ceph:v15.2.17 not found
  Warning  Failed     4m32s (x3 over 5m17s)  kubelet, gold-5    Error: ErrImagePull
  Warning  Failed     3m55s (x6 over 5m17s)  kubelet, gold-5    Error: ImagePullBackOff
  Normal   Pulling    3m41s (x4 over 5m18s)  kubelet, gold-5    Pulling image "ceph/ceph:v15.2.17"
  Normal   BackOff    7s (x22 over 5m17s)    kubelet, gold-5    Back-off pulling image "ceph/ceph:v15.2.17"

So time to back off to v15.2.13 as is in dockerhub, but this time without the space error.

This time the operator didn't seem to notice the cluster getting updated.

To trigger another change I just added a dummy annotation. Then it was detected.

2022-08-16 01:51:23.078775 E | op-cluster: unknown ceph major version. "failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap"
2022-08-16 01:51:23.099828 I | op-cluster: The Cluster CR has changed. diff=  v1.ClusterSpec{
        CephVersion: v1.CephVersionSpec{
-               Image:            "ceph/ceph:v15.2.17",
+               Image:            "ceph/ceph:v15.2.13",
                AllowUnsupported: false,
        },
        Storage:     v1.StorageScopeSpec{Nodes: []v1.Node{{Name: "gold-1", Selection: v1.Selection{Devices: []v1.Device{{Name: "sdb"}, {Name: "sdc"}}}}, {Name: "gold-4", Selection: v1.Selection{Devices: []v1.Device{{Name: "sdb"}, {Name: "sdc"}}}}, {Name: "gold-6", Selection: v1.Selection{Devices: []v1.Device{{Name: "sda"}, {Name: "sdb"}}}}}, Selection: v1.Selection{UseAllDevices: &false}},
        Annotations: nil,
        ... // 17 identical fields
  }
2022-08-16 01:51:23.099868 I | op-cluster: update event for cluster "rook-ceph" is supported, orchestrating update now
2022-08-16 01:51:23.106441 I | op-config: CephCluster "rook-ceph" status: "Updating". "Cluster is updating"
2022-08-16 01:51:23.124800 I | op-cluster: the ceph version changed from "ceph/ceph:v15.2.17" to "ceph/ceph:v15.2.13"
2022-08-16 01:51:23.124845 I | op-cluster: detecting the ceph image version for image ceph/ceph:v15.2.13...

This time the image was pulled and the job started (I guess - by the time I closed the log and checked for the job, there was nothing there. However the operator log showed resources getting updated).

Monitoring the update

 watch --exec kubectl -n rook-ceph get deployments -l rook_cluster=rook-ceph -o jsonpath='{range .items[*]}{.metadata.name}{"  \treq/upd/avl: "}{.spec.replicas}{"/"}{.status.updatedReplicas}{"/"}{.status.readyReplicas}{"  \tceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}'

For example:

Every 2.0s: kubectl -n rook-ceph get deployments -l rook_cluster=rook-ceph -o jsonpath={range .items[*]}{.metadata.name}{"  \treq/up...  gold-1: Tue Aug 16 01:54:03 2022

rook-ceph-crashcollector-gold-1         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-gold-4         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-gold-5         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-gold-6         req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-mds-myfs-a    req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-mds-myfs-b    req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-mgr-a         req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-mon-h         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-i         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-k         req/upd/avl: 1//        ceph-version=15.2.13-0
rook-ceph-osd-1         req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-osd-2         req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-osd-3         req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-osd-5         req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-osd-6         req/upd/avl: 1/1/1      ceph-version=14.2.8-0
rook-ceph-osd-9         req/upd/avl: 1/1/1      ceph-version=14.2.8-0

The mons are in process of updating, the crash collectors are also updating.

The osd's that are upgrading all show this error:

debug 2022-08-16T04:31:04.993+0000 7fcd20d53700 -1 bluestore(/var/lib/ceph/osd/ceph-5) fsck warning: #2:8ebfb41c:::10000c6f0ec.00000000:head# has omap that is not per-pool or pgmeta

The cluster determined that they are down and is rebalancing the pods. This is probably the worst case scenario.

From my research I found that these warnings are normal and part of the process of upgrading the osd. However, the pods are continually killed due to the liveness probe failing.

A check of the pods after letting them run awhile results in this:

rook-ceph-osd-1-6f954f9986-56p7z                   1/1     Running       0          5h36m   10.233.90.142    gold-6   <none>           <none>
rook-ceph-osd-2-5db6b9f58f-4j9sl                   1/1     Running       5          3m30s   10.233.97.50     gold-1   <none>           <none>
rook-ceph-osd-3-6b46697d64-tc6nb                   1/1     Running       0          5h34m   10.233.90.48     gold-6   <none>           <none>
rook-ceph-osd-5-7f6db7748b-qw4pz                   1/1     Running       3          3m10s   10.233.97.77     gold-1   <none>           <none>
rook-ceph-osd-6-59cf989d74-lvf9s                   1/1     Running       0          5h30m   10.233.99.81     gold-4   <none>           <none>
rook-ceph-osd-9-7fd4874f66-mlkjb                   1/1     Running       0          5h28m   10.233.99.52     gold-4   <none>           <none>
rook-ceph-osd-prepare-gold-1-vw4wd                 0/1     Completed     0          46m     10.233.97.240    gold-1   <none>           <none>
rook-ceph-osd-prepare-gold-4-276p5                 0/1     Completed     0          46m     10.233.99.148    gold-4   <none>           <none>
rook-ceph-osd-prepare-gold-5-pmcw2                 0/1     Completed     0          5d14h   10.233.103.47    gold-5   <none>           <none>
rook-ceph-osd-prepare-gold-6-kzhz9                 0/1     Completed     0          46m     10.233.90.82     gold-6   <none>           <none>

The osd pods that are upgrading are continually restarting.

Looking at the pod they have this for their last state:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 16 Aug 2022 05:29:50 +0000
      Finished:     Tue, 16 Aug 2022 05:31:34 +0000

After using Kubernetes so long, I recognize Exit Code 137 means the pod was killed. Usually this is from out of memory, but in this case it was Terminated. Which means the pod is being killed due to the liveness probe failing. Indeed looking in the events this is there:

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  <unknown>          default-scheduler  Successfully assigned rook-ceph/rook-ceph-osd-5-7f6db7748b-qw4pz to gold-1
  Normal   Pulled     2m25s              kubelet, gold-1    Container image "ceph/ceph:v15.2.13" already present on machine
  Normal   Created    2m25s              kubelet, gold-1    Created container activate
  Normal   Started    2m25s              kubelet, gold-1    Started container activate
  Normal   Pulled     2m23s              kubelet, gold-1    Container image "ceph/ceph:v15.2.13" already present on machine
  Normal   Created    2m23s              kubelet, gold-1    Created container chown-container-data-dir
  Normal   Started    2m23s              kubelet, gold-1    Started container chown-container-data-dir
  Warning  Unhealthy  68s (x3 over 88s)  kubelet, gold-1    Liveness probe failed: no valid command found; 10 closest matches:
0
1
2
abort
assert
bluefs debug_inject_read_zeros
bluefs stats
bluestore allocator dump block
bluestore allocator dump bluefs-db
bluestore allocator fragmentation block
admin_socket: invalid command
  Normal  Killing  68s                  kubelet, gold-1  Container osd failed liveness probe, will be restarted
  Normal  Pulled   36s (x2 over 2m22s)  kubelet, gold-1  Container image "ceph/ceph:v15.2.13" already present on machine
  Normal  Created  36s (x2 over 2m22s)  kubelet, gold-1  Created container osd
  Normal  Started  36s (x2 over 2m21s)  kubelet, gold-1  Started container osd

So it seems that upgrading to v15.2.17, which should be supported according to the documentation results in a failed liveness probe because the command is not found.

So, to continue the upgrade (since I have no way of downgrading), I removed the liveness probes from the osd's to allow them to come back online.

After an hour, one of the osd's finished and rejoined the cluster so I made the same change to the others.

Checking the health now shows to warnings:

$ kubectl exec -it rook-ceph-tools-788f7f4b84-gckp8 -- ceph status cluster: id: 04461f64-e630-4891-bcea-0de24cf06c51 health: HEALTH_WARN clients are using insecure global_id reclaim mons are allowing insecure global_id reclaim

The upgrade seems to have stalled as the operator log isn't showing any additional changes. Just this over and over again:

2022-08-16 06:17:13.743320 I | op-config: CephCluster "rook-ceph" status: "Ready". "Cluster created successfully"
2022-08-16 06:18:14.967352 I | op-config: CephCluster "rook-ceph" status: "Ready". "Cluster created successfully"
2022-08-16 06:19:16.363591 I | op-config: CephCluster "rook-ceph" status: "Ready". "Cluster created successfully"
2022-08-16 06:20:17.683344 I | op-config: CephCluster "rook-ceph" status: "Ready". "Cluster created successfully"
2022-08-16 06:21:18.936284 I | op-config: CephCluster "rook-ceph" status: "Ready". "Cluster created successfully"
2022-08-16 06:22:20.147279 I | op-config: CephCluster "rook-ceph" status: "Ready". "Cluster created successfully"
2022-08-16 06:23:21.438994 I | op-config: CephCluster "rook-ceph" status: "Ready". "Cluster created successfully"

I restarted the rook operator and then it continued the update. This time I was already in the rook container, so I just killed Process one. Similar to this command:

kubectl exec rook-ceph-operator-88cdb5f4b-k6xvw -- kill 1

The difficulty is that even if the liveness probes are deleted, they will be regenerated when the update begins. So after each osd starts it's update, the deployment needs to be updated. This requires some manual intervention for each osd.

I am wondering if I should have held off until rook was at version 1.4 or something, but the docs indicated that 1.3 would be fine as it was at the end of the upgrading 1.2 to 1.3 help doc. Also, there is the case that updating to 1.4+ will trigger a regeneration of the liveness probe and cause the same problem so I'll need to watch out at that point too.

While it is upgrading the cluster goes crazy:

  cluster:
    id:     04461f64-e630-4891-bcea-0de24cf06c51
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            2 osds down
            2 hosts (4 osds) down
            Degraded data redundancy: 6691197/20073591 objects degraded (33.333%), 73 pgs degraded, 73 pgs undersized

  services:
    mon: 3 daemons, quorum h,i,k (age 9h)
    mgr: a(active, since 9h)
    mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
    osd: 13 osds: 4 up (since 30m), 6 in (since 5d)

There are 13ods's but we've marked them out a long time ago as we haven't had time to visit the colo and replace the drives, so now there are only 6. Here it is saying that 4 of 6 are up. Which is correct, but the health message says that 4 are down. From looking at the logs, only 1 is actually down. So Ceph takes awhile to figure things out.

Finally, after all of this headache, everything was upgraded:

kubectl -n rook-ceph get deployments -l rook_cluster=rook-ceph -o jsonpath={range .items[*]}{.metadata.name}{"  \treq/upd/avl: "}{.spec....  gold-1: Tue Aug 16 16:24:52 2022

rook-ceph-crashcollector-gold-1         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-gold-4         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-gold-5         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-crashcollector-gold-6         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mds-myfs-a                    req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mds-myfs-b                    req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mgr-a                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-h                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-i                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-mon-k                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-1                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-2                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-3                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-5                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-6                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0
rook-ceph-osd-9                         req/upd/avl: 1/1/1      ceph-version=15.2.13-0

The logs of the operators showed everything was working, but the health was still wrong.

$ kubectl exec -it rook-ceph-tools-788f7f4b84-gckp8 -- ceph status
  cluster:
    id:     04461f64-e630-4891-bcea-0de24cf06c51
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 osds down
            Degraded data redundancy: 2838481/20073594 objects degraded (14.140%), 32 pgs degraded, 32 pgs undersized

  services:
    mon: 3 daemons, quorum h,i,k (age 9h)
    mgr: a(active, since 9h)
    mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
    osd: 13 osds: 5 up (since 3m), 6 in (since 5d)

  data:
    pools:   4 pools, 73 pgs
    objects: 6.69M objects, 2.9 TiB
    usage:   8.8 TiB used, 46 TiB / 55 TiB avail
    pgs:     2838481/20073594 objects degraded (14.140%)
             41 active+clean
             32 active+undersized+degraded

  io:
    client:   1.2 KiB/s rd, 40 KiB/s wr, 2 op/s rd, 5 op/s wr

But it went from 2 hosts/4 osd's down to 0 hosts/1 osd down. So it just takes a while for the cluster to figure things are working.

Checking the osd status, shows that the final osd (3) is still reporting out:

kubectl exec -it rook-ceph-tools-788f7f4b84-gckp8 -- ceph osd status                                                                         

ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
 0             0      0       0        0       0        0   exists
 1  gold-6  1635G  7677G      0     8190       1       71   exists,up
 2  gold-1  1515G  7797G      0      819       1        0   exists,up
 3  gold-6  1367G  7945G      0      819       0        0   exists
 4             0      0       0        0       0        0   exists
 5  gold-1  1486G  7826G      2     21.5k      0        0   exists,up
 6  gold-4  1442G  7870G      3     46.3k      0        0   exists,up
 7             0      0       0        0       0        0   exists
 8             0      0       0        0       0        0   autoout,exists
 9  gold-4  1559G  7753G      1     9009       1       15   exists,up
10             0      0       0        0       0        0   exists
11             0      0       0        0       0        0   autoout,exists
12             0      0       0        0       0        0   autoout,exists

Looking at future versions:

Rook v1.8 supports the following Ceph versions:

Rook v1.7 supports the following Ceph versions:

Rook v1.6 supports the following Ceph versions:

Rook v1.5 supports the following Ceph versions:

Rook v1.4 supports the following Ceph versions:

So eventually the cluster will need to be upgraded to Pacific, but I think I'll wait until version 1.8 or 1.9.

Finally after about 20 min the cluster was in a better state:

$ kubectl exec -it rook-ceph-tools-788f7f4b84-gckp8 -- ceph status
  cluster:
    id:     04461f64-e630-4891-bcea-0de24cf06c51
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum h,i,k (age 10h)
    mgr: a(active, since 10h)
    mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
    osd: 13 osds: 6 up (since 4m), 6 in (since 5d)

  data:
    pools:   4 pools, 73 pgs
    objects: 6.69M objects, 2.9 TiB
    usage:   8.8 TiB used, 46 TiB / 55 TiB avail
    pgs:     72 active+clean
             1  active+clean+scrubbing+deep

  io:
    client:   1.7 KiB/s rd, 62 KiB/s wr, 2 op/s rd, 6 op/s wr

And everything is upgraded:

$ kubectl -n rook-ceph get deployment -l rook_cluster=rook-ceph -o jsonpath='{range .items[*]}{"ceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}' | sort | uniq
ceph-version=15.2.13-0

Very frustrating and nerve-wracking upgrade.