Upgrading rook from v1.3.11 to v1.4.9

17 Aug, 2022

Okay, so after upgrading from 1.2 to 1.3, it's time to upgrade from 1.3 to 1.4. The worry is that the liveness probes all had to be removed to upgrade the ceph libraries from Ceph Nautilus (14.2.8) to Octopus (15.2.13) so hopefully when the liveness probes get readed by the operator, they will work for the new version of Ceph.

First, clone the rook repo:

git clone https://github.com/rook/rook.git

Then check for the 1.4 versions

$ git tag -l | grep v1.4
v1.4.0
v1.4.0-alpha.0
v1.4.0-beta.0
v1.4.1
v1.4.2
v1.4.3
v1.4.4
v1.4.5
v1.4.6
v1.4.7
v1.4.8
v1.4.9

v1.4.9 is the latest so let's go for that one.

$ git checkout tags/v1.4.9
Previous HEAD position was b28b21a03 Merge pull request #6241 from travisn/backport-ci-disk
HEAD is now at 3bccbc9ef Merge pull request #6963 from travisn/release-1.4.9

running the upgrade scripts

Now there will be the upgrade scripts in the cluster/examples/kubernetes/ceph folder

kubectl delete -f upgrade-from-v1.3-delete.yaml
kubectl apply -f upgrade-from-v1.3-apply.yaml -f upgrade-from-v1.3-crds.yaml

For example:

$:~/github/rook/cluster/examples/kubernetes/ceph$ kubectl delete -f upgrade-from-v1.3-delete.yaml
clusterrole.rbac.authorization.k8s.io "cephfs-csi-nodeplugin-rules" deleted
clusterrole.rbac.authorization.k8s.io "cephfs-external-provisioner-runner-rules" deleted
clusterrole.rbac.authorization.k8s.io "rbd-csi-nodeplugin-rules" deleted
clusterrole.rbac.authorization.k8s.io "rbd-external-provisioner-runner-rules" deleted
clusterrole.rbac.authorization.k8s.io "rook-ceph-cluster-mgmt-rules" deleted
clusterrole.rbac.authorization.k8s.io "rook-ceph-global-rules" deleted
clusterrole.rbac.authorization.k8s.io "rook-ceph-mgr-cluster-rules" deleted
clusterrole.rbac.authorization.k8s.io "rook-ceph-mgr-system-rules" deleted
clusterrole.rbac.authorization.k8s.io "cephfs-csi-nodeplugin" deleted
clusterrole.rbac.authorization.k8s.io "cephfs-external-provisioner-runner" deleted
clusterrole.rbac.authorization.k8s.io "rbd-csi-nodeplugin" deleted
clusterrole.rbac.authorization.k8s.io "rbd-external-provisioner-runner" deleted
clusterrole.rbac.authorization.k8s.io "rook-ceph-cluster-mgmt" deleted
clusterrole.rbac.authorization.k8s.io "rook-ceph-global" deleted
clusterrole.rbac.authorization.k8s.io "rook-ceph-mgr-cluster" deleted
clusterrole.rbac.authorization.k8s.io "rook-ceph-mgr-system" deleted
$:~/github/rook/cluster/examples/kubernetes/ceph$ kubectl apply -f upgrade-from-v1.3-apply.yaml -f upgrade-from-v1.3-crds.yaml
clusterrole.rbac.authorization.k8s.io/rook-ceph-global created
clusterrole.rbac.authorization.k8s.io/rook-ceph-cluster-mgmt created
clusterrole.rbac.authorization.k8s.io/rook-ceph-mgr-cluster created
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
clusterrole.rbac.authorization.k8s.io/rook-ceph-object-bucket configured
clusterrole.rbac.authorization.k8s.io/rook-ceph-mgr-system created
clusterrole.rbac.authorization.k8s.io/cephfs-csi-nodeplugin created
clusterrole.rbac.authorization.k8s.io/cephfs-external-provisioner-runner created
clusterrole.rbac.authorization.k8s.io/rbd-csi-nodeplugin created
clusterrole.rbac.authorization.k8s.io/rbd-external-provisioner-runner created
serviceaccount/rook-ceph-admission-controller created
clusterrole.rbac.authorization.k8s.io/rook-ceph-admission-controller-role created
clusterrolebinding.rbac.authorization.k8s.io/rook-ceph-admission-controller-rolebinding created
customresourcedefinition.apiextensions.k8s.io/cephclusters.ceph.rook.io configured
customresourcedefinition.apiextensions.k8s.io/cephclients.ceph.rook.io unchanged
customresourcedefinition.apiextensions.k8s.io/cephrbdmirrors.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephfilesystems.ceph.rook.io configured
customresourcedefinition.apiextensions.k8s.io/cephnfses.ceph.rook.io unchanged
customresourcedefinition.apiextensions.k8s.io/cephobjectstores.ceph.rook.io configured
customresourcedefinition.apiextensions.k8s.io/cephobjectstoreusers.ceph.rook.io unchanged
customresourcedefinition.apiextensions.k8s.io/cephobjectrealms.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephobjectzonegroups.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephobjectzones.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephblockpools.ceph.rook.io configured
customresourcedefinition.apiextensions.k8s.io/volumes.rook.io unchanged
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/objectbuckets.objectbucket.io configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/objectbucketclaims.objectbucket.io configured

Upgrading the operator

kubectl -n rook-ceph set image deploy/rook-ceph-operator rook-ceph-operator=rook/ceph:v1.4.9

This change of the cluster triggered a new operator pod to be created:

rook-ceph-operator-7d7b96897b-mszxr                1/1     Running           0          40s
rook-ceph-operator-88cdb5f4b-k6xvw                 0/1     Terminating       8          17h

And the cluster started updating quickly:

kubectl -n rook-ceph get deployments -l rook_cluster=rook-ceph -o jsonpath={range .items[*]}{.metadata.name}{"  \treq/upd/avl: "}{.spec....  gold-1: Tue Aug 16 17:02:19 2022

rook-ceph-crashcollector-gold-1         req/upd/avl: 1/1/1      rook-version=v1.4.9
rook-ceph-crashcollector-gold-4         req/upd/avl: 1/1/1      rook-version=v1.4.9
rook-ceph-crashcollector-gold-5         req/upd/avl: 1/1/1      rook-version=v1.4.9
rook-ceph-crashcollector-gold-6         req/upd/avl: 1/1/1      rook-version=v1.4.9
rook-ceph-mds-myfs-a    req/upd/avl: 1//        rook-version=v1.4.9
rook-ceph-mds-myfs-b    req/upd/avl: 1/1/1      rook-version=v1.3.11
rook-ceph-mgr-a         req/upd/avl: 1//        rook-version=v1.4.9
rook-ceph-mon-h         req/upd/avl: 1/1/1      rook-version=v1.4.9
rook-ceph-mon-i         req/upd/avl: 1/1/1      rook-version=v1.4.9
rook-ceph-mon-k         req/upd/avl: 1/1/1      rook-version=v1.4.9
rook-ceph-osd-1         req/upd/avl: 1/1/1      rook-version=v1.3.11
rook-ceph-osd-2         req/upd/avl: 1/1/1      rook-version=v1.3.11
rook-ceph-osd-3         req/upd/avl: 1/1/1      rook-version=v1.3.11
rook-ceph-osd-5         req/upd/avl: 1/1/1      rook-version=v1.3.11
rook-ceph-osd-6         req/upd/avl: 1/1/1      rook-version=v1.3.11
rook-ceph-osd-9         req/upd/avl: 1/1/1      rook-version=v1.3.11

Finally, after a small while everything was updated.

The only error is this one

mons are allowing insecure global_id reclaim

Which I tried fixing with this command:

ceph config set mon auth_allow_insecure_global_id_reclaim false

which resulted in the toolbox no longer being able to talk to the cluster.

[root@gold-1 /]# ceph health
[errno 5] RADOS I/O error (error connecting to the cluster)

so the cluster is running, which is good, but I can't talk to it anymore.

Even upgrading the toolbox doesn't work. I deleted the deployment and applied the latest toolbox.yaml and this error happens.

$ceph status
[errno 13] RADOS permission denied (error connecting to the cluster)

The ceph version in the toolbox is 15.2.8

ceph --version
ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)

Looking here, it says it is fixed in ceph version 15.2.11:

https://docs.ceph.com/en/latest/security/CVE-2021-20288/

Fixed versions

Pacific v16.2.1 (and later)

Octopus v15.2.11 (and later)

Nautilus v14.2.20 (and later)

Looking at the issue here:

https://github.com/rook/rook/issues/7746

It may be fixed in rook version 1.6.7

So the next step is to upgrade from 1.4.9 to 1.5+ then to 1.6+